root / docs / atomics.txt @ 7162bdea
History | View | Annotate | Download (14.6 kB)
1 |
CPUs perform independent memory operations effectively in random order. |
---|---|
2 |
but this can be a problem for CPU-CPU interaction (including interactions |
3 |
between QEMU and the guest). Multi-threaded programs use various tools |
4 |
to instruct the compiler and the CPU to restrict the order to something |
5 |
that is consistent with the expectations of the programmer. |
6 |
|
7 |
The most basic tool is locking. Mutexes, condition variables and |
8 |
semaphores are used in QEMU, and should be the default approach to |
9 |
synchronization. Anything else is considerably harder, but it's |
10 |
also justified more often than one would like. The two tools that |
11 |
are provided by qemu/atomic.h are memory barriers and atomic operations. |
12 |
|
13 |
Macros defined by qemu/atomic.h fall in three camps: |
14 |
|
15 |
- compiler barriers: barrier(); |
16 |
|
17 |
- weak atomic access and manual memory barriers: atomic_read(), |
18 |
atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends(); |
19 |
|
20 |
- sequentially consistent atomic access: everything else. |
21 |
|
22 |
|
23 |
COMPILER MEMORY BARRIER |
24 |
======================= |
25 |
|
26 |
barrier() prevents the compiler from moving the memory accesses either |
27 |
side of it to the other side. The compiler barrier has no direct effect |
28 |
on the CPU, which may then reorder things however it wishes. |
29 |
|
30 |
barrier() is mostly used within qemu/atomic.h itself. On some |
31 |
architectures, CPU guarantees are strong enough that blocking compiler |
32 |
optimizations already ensures the correct order of execution. In this |
33 |
case, qemu/atomic.h will reduce stronger memory barriers to simple |
34 |
compiler barriers. |
35 |
|
36 |
Still, barrier() can be useful when writing code that can be interrupted |
37 |
by signal handlers. |
38 |
|
39 |
|
40 |
SEQUENTIALLY CONSISTENT ATOMIC ACCESS |
41 |
===================================== |
42 |
|
43 |
Most of the operations in the qemu/atomic.h header ensure *sequential |
44 |
consistency*, where "the result of any execution is the same as if the |
45 |
operations of all the processors were executed in some sequential order, |
46 |
and the operations of each individual processor appear in this sequence |
47 |
in the order specified by its program". |
48 |
|
49 |
qemu/atomic.h provides the following set of atomic read-modify-write |
50 |
operations: |
51 |
|
52 |
void atomic_inc(ptr) |
53 |
void atomic_dec(ptr) |
54 |
void atomic_add(ptr, val) |
55 |
void atomic_sub(ptr, val) |
56 |
void atomic_and(ptr, val) |
57 |
void atomic_or(ptr, val) |
58 |
|
59 |
typeof(*ptr) atomic_fetch_inc(ptr) |
60 |
typeof(*ptr) atomic_fetch_dec(ptr) |
61 |
typeof(*ptr) atomic_fetch_add(ptr, val) |
62 |
typeof(*ptr) atomic_fetch_sub(ptr, val) |
63 |
typeof(*ptr) atomic_fetch_and(ptr, val) |
64 |
typeof(*ptr) atomic_fetch_or(ptr, val) |
65 |
typeof(*ptr) atomic_xchg(ptr, val |
66 |
typeof(*ptr) atomic_cmpxchg(ptr, old, new) |
67 |
|
68 |
all of which return the old value of *ptr. These operations are |
69 |
polymorphic; they operate on any type that is as wide as an int. |
70 |
|
71 |
Sequentially consistent loads and stores can be done using: |
72 |
|
73 |
atomic_fetch_add(ptr, 0) for loads |
74 |
atomic_xchg(ptr, val) for stores |
75 |
|
76 |
However, they are quite expensive on some platforms, notably POWER and |
77 |
ARM. Therefore, qemu/atomic.h provides two primitives with slightly |
78 |
weaker constraints: |
79 |
|
80 |
typeof(*ptr) atomic_mb_read(ptr) |
81 |
void atomic_mb_set(ptr, val) |
82 |
|
83 |
The semantics of these primitives map to Java volatile variables, |
84 |
and are strongly related to memory barriers as used in the Linux |
85 |
kernel (see below). |
86 |
|
87 |
As long as you use atomic_mb_read and atomic_mb_set, accesses cannot |
88 |
be reordered with each other, and it is also not possible to reorder |
89 |
"normal" accesses around them. |
90 |
|
91 |
However, and this is the important difference between |
92 |
atomic_mb_read/atomic_mb_set and sequential consistency, it is important |
93 |
for both threads to access the same volatile variable. It is not the |
94 |
case that everything visible to thread A when it writes volatile field f |
95 |
becomes visible to thread B after it reads volatile field g. The store |
96 |
and load have to "match" (i.e., be performed on the same volatile |
97 |
field) to achieve the right semantics. |
98 |
|
99 |
|
100 |
These operations operate on any type that is as wide as an int or smaller. |
101 |
|
102 |
|
103 |
WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS |
104 |
============================================= |
105 |
|
106 |
Compared to sequentially consistent atomic access, programming with |
107 |
weaker consistency models can be considerably more complicated. |
108 |
In general, if the algorithm you are writing includes both writes |
109 |
and reads on the same side, it is generally simpler to use sequentially |
110 |
consistent primitives. |
111 |
|
112 |
When using this model, variables are accessed with atomic_read() and |
113 |
atomic_set(), and restrictions to the ordering of accesses is enforced |
114 |
using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends() |
115 |
memory barriers. |
116 |
|
117 |
atomic_read() and atomic_set() prevents the compiler from using |
118 |
optimizations that might otherwise optimize accesses out of existence |
119 |
on the one hand, or that might create unsolicited accesses on the other. |
120 |
In general this should not have any effect, because the same compiler |
121 |
barriers are already implied by memory barriers. However, it is useful |
122 |
to do so, because it tells readers which variables are shared with |
123 |
other threads, and which are local to the current thread or protected |
124 |
by other, more mundane means. |
125 |
|
126 |
Memory barriers control the order of references to shared memory. |
127 |
They come in four kinds: |
128 |
|
129 |
- smp_rmb() guarantees that all the LOAD operations specified before |
130 |
the barrier will appear to happen before all the LOAD operations |
131 |
specified after the barrier with respect to the other components of |
132 |
the system. |
133 |
|
134 |
In other words, smp_rmb() puts a partial ordering on loads, but is not |
135 |
required to have any effect on stores. |
136 |
|
137 |
- smp_wmb() guarantees that all the STORE operations specified before |
138 |
the barrier will appear to happen before all the STORE operations |
139 |
specified after the barrier with respect to the other components of |
140 |
the system. |
141 |
|
142 |
In other words, smp_wmb() puts a partial ordering on stores, but is not |
143 |
required to have any effect on loads. |
144 |
|
145 |
- smp_mb() guarantees that all the LOAD and STORE operations specified |
146 |
before the barrier will appear to happen before all the LOAD and |
147 |
STORE operations specified after the barrier with respect to the other |
148 |
components of the system. |
149 |
|
150 |
smp_mb() puts a partial ordering on both loads and stores. It is |
151 |
stronger than both a read and a write memory barrier; it implies both |
152 |
smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the |
153 |
barrier from overtaking LOADs coming after the barrier and vice versa. |
154 |
|
155 |
- smp_read_barrier_depends() is a weaker kind of read barrier. On |
156 |
most processors, whenever two loads are performed such that the |
157 |
second depends on the result of the first (e.g., the first load |
158 |
retrieves the address to which the second load will be directed), |
159 |
the processor will guarantee that the first LOAD will appear to happen |
160 |
before the second with respect to the other components of the system. |
161 |
However, this is not always true---for example, it was not true on |
162 |
Alpha processors. Whenever this kind of access happens to shared |
163 |
memory (that is not protected by a lock), a read barrier is needed, |
164 |
and smp_read_barrier_depends() can be used instead of smp_rmb(). |
165 |
|
166 |
Note that the first load really has to have a _data_ dependency and not |
167 |
a control dependency. If the address for the second load is dependent |
168 |
on the first load, but the dependency is through a conditional rather |
169 |
than actually loading the address itself, then it's a _control_ |
170 |
dependency and a full read barrier or better is required. |
171 |
|
172 |
|
173 |
This is the set of barriers that is required *between* two atomic_read() |
174 |
and atomic_set() operations to achieve sequential consistency: |
175 |
|
176 |
| 2nd operation | |
177 |
|-----------------------------------------| |
178 |
1st operation | (after last) | atomic_read | atomic_set | |
179 |
---------------+--------------+-------------+------------| |
180 |
(before first) | | none | smp_wmb() | |
181 |
---------------+--------------+-------------+------------| |
182 |
atomic_read | smp_rmb() | smp_rmb()* | ** | |
183 |
---------------+--------------+-------------+------------| |
184 |
atomic_set | none | smp_mb()*** | smp_wmb() | |
185 |
---------------+--------------+-------------+------------| |
186 |
|
187 |
* Or smp_read_barrier_depends(). |
188 |
|
189 |
** This requires a load-store barrier. How to achieve this varies |
190 |
depending on the machine, but in practice smp_rmb()+smp_wmb() |
191 |
should have the desired effect. For example, on PowerPC the |
192 |
lwsync instruction is a combined load-load, load-store and |
193 |
store-store barrier. |
194 |
|
195 |
*** This requires a store-load barrier. On most machines, the only |
196 |
way to achieve this is a full barrier. |
197 |
|
198 |
|
199 |
You can see that the two possible definitions of atomic_mb_read() |
200 |
and atomic_mb_set() are the following: |
201 |
|
202 |
1) atomic_mb_read(p) = atomic_read(p); smp_rmb() |
203 |
atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb() |
204 |
|
205 |
2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_rmb() |
206 |
atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); |
207 |
|
208 |
Usually the former is used, because smp_mb() is expensive and a program |
209 |
normally has more reads than writes. Therefore it makes more sense to |
210 |
make atomic_mb_set() the more expensive operation. |
211 |
|
212 |
There are two common cases in which atomic_mb_read and atomic_mb_set |
213 |
generate too many memory barriers, and thus it can be useful to manually |
214 |
place barriers instead: |
215 |
|
216 |
- when a data structure has one thread that is always a writer |
217 |
and one thread that is always a reader, manual placement of |
218 |
memory barriers makes the write side faster. Furthermore, |
219 |
correctness is easy to check for in this case using the "pairing" |
220 |
trick that is explained below: |
221 |
|
222 |
thread 1 thread 1 |
223 |
------------------------- ------------------------ |
224 |
(other writes) |
225 |
smp_wmb() |
226 |
atomic_mb_set(&a, x) atomic_set(&a, x) |
227 |
smp_wmb() |
228 |
atomic_mb_set(&b, y) atomic_set(&b, y) |
229 |
|
230 |
=> |
231 |
thread 2 thread 2 |
232 |
------------------------- ------------------------ |
233 |
y = atomic_mb_read(&b) y = atomic_read(&b) |
234 |
smp_rmb() |
235 |
x = atomic_mb_read(&a) x = atomic_read(&a) |
236 |
smp_rmb() |
237 |
|
238 |
- sometimes, a thread is accessing many variables that are otherwise |
239 |
unrelated to each other (for example because, apart from the current |
240 |
thread, exactly one other thread will read or write each of these |
241 |
variables). In this case, it is possible to "hoist" the implicit |
242 |
barriers provided by atomic_mb_read() and atomic_mb_set() outside |
243 |
a loop. For example, the above definition atomic_mb_read() gives |
244 |
the following transformation: |
245 |
|
246 |
n = 0; n = 0; |
247 |
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) |
248 |
n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]); |
249 |
smp_rmb(); |
250 |
|
251 |
Similarly, atomic_mb_set() can be transformed as follows: |
252 |
smp_mb(): |
253 |
|
254 |
smp_wmb(); |
255 |
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) |
256 |
atomic_mb_set(&a[i], false); atomic_set(&a[i], false); |
257 |
smp_mb(); |
258 |
|
259 |
|
260 |
The two tricks can be combined. In this case, splitting a loop in |
261 |
two lets you hoist the barriers out of the loops _and_ eliminate the |
262 |
expensive smp_mb(): |
263 |
|
264 |
smp_wmb(); |
265 |
for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++) |
266 |
atomic_mb_set(&a[i], false); atomic_set(&a[i], false); |
267 |
atomic_mb_set(&b[i], false); smb_wmb(); |
268 |
} for (i = 0; i < 10; i++) |
269 |
atomic_set(&a[i], false); |
270 |
smp_mb(); |
271 |
|
272 |
The other thread can still use atomic_mb_read()/atomic_mb_set() |
273 |
|
274 |
|
275 |
Memory barrier pairing |
276 |
---------------------- |
277 |
|
278 |
A useful rule of thumb is that memory barriers should always, or almost |
279 |
always, be paired with another barrier. In the case of QEMU, however, |
280 |
note that the other barrier may actually be in a driver that runs in |
281 |
the guest! |
282 |
|
283 |
For the purposes of pairing, smp_read_barrier_depends() and smp_rmb() |
284 |
both count as read barriers. A read barriers shall pair with a write |
285 |
barrier or a full barrier; a write barrier shall pair with a read |
286 |
barrier or a full barrier. A full barrier can pair with anything. |
287 |
For example: |
288 |
|
289 |
thread 1 thread 2 |
290 |
=============== =============== |
291 |
a = 1; |
292 |
smp_wmb(); |
293 |
b = 2; x = b; |
294 |
smp_rmb(); |
295 |
y = a; |
296 |
|
297 |
Note that the "writing" thread are accessing the variables in the |
298 |
opposite order as the "reading" thread. This is expected: stores |
299 |
before the write barrier will normally match the loads after the |
300 |
read barrier, and vice versa. The same is true for more than 2 |
301 |
access and for data dependency barriers: |
302 |
|
303 |
thread 1 thread 2 |
304 |
=============== =============== |
305 |
b[2] = 1; |
306 |
smp_wmb(); |
307 |
x->i = 2; |
308 |
smp_wmb(); |
309 |
a = x; x = a; |
310 |
smp_read_barrier_depends(); |
311 |
y = x->i; |
312 |
smp_read_barrier_depends(); |
313 |
z = b[y]; |
314 |
|
315 |
smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs |
316 |
with atomic_mb_set(). |
317 |
|
318 |
|
319 |
COMPARISON WITH LINUX KERNEL MEMORY BARRIERS |
320 |
============================================ |
321 |
|
322 |
Here is a list of differences between Linux kernel atomic operations |
323 |
and memory barriers, and the equivalents in QEMU: |
324 |
|
325 |
- atomic operations in Linux are always on a 32-bit int type and |
326 |
use a boxed atomic_t type; atomic operations in QEMU are polymorphic |
327 |
and use normal C types. |
328 |
|
329 |
- atomic_read and atomic_set in Linux give no guarantee at all; |
330 |
atomic_read and atomic_set in QEMU include a compiler barrier |
331 |
(similar to the ACCESS_ONCE macro in Linux). |
332 |
|
333 |
- most atomic read-modify-write operations in Linux return void; |
334 |
in QEMU, all of them return the old value of the variable. |
335 |
|
336 |
- different atomic read-modify-write operations in Linux imply |
337 |
a different set of memory barriers; in QEMU, all of them enforce |
338 |
sequential consistency, which means they imply full memory barriers |
339 |
before and after the operation. |
340 |
|
341 |
- Linux does not have an equivalent of atomic_mb_read() and |
342 |
atomic_mb_set(). In particular, note that set_mb() is a little |
343 |
weaker than atomic_mb_set(). |
344 |
|
345 |
|
346 |
SOURCES |
347 |
======= |
348 |
|
349 |
* Documentation/memory-barriers.txt from the Linux kernel |
350 |
|
351 |
* "The JSR-133 Cookbook for Compiler Writers", available at |
352 |
http://g.oswego.edu/dl/jmm/cookbook.html |