2. The problem
• Shared memory parallel programs are built on shared
variables visible to multiple threads of control.
• But there is a lot of confusion about what those variables
mean:
– Are concurrent accesses allowed?
– What is a concurrent access?
– When do updates become visible to other threads?
– Can an update be partially visible?
• Many recent efforts with serious technical issues:
– Java, OpenMP 3.0, UPC, Go …
7/28/2011 2
3. Credits and Disclaimers:
• Much of this work was done by others or jointly. I‟m
relying particularly on:
– Basic approach: Sarita Adve, Mark Hill, Ada 83 …
– JSR 133: Also Jeremy Manson, Bill Pugh, Doug Lea
– C++11: Lawrence Crowl, Clark Nelson, Paul McKenney, Herb
Sutter, …
– Improved hardware models: Peter Sewell‟s group, many Intel,
AMD, ARM, IBM participants …
– Conflict exception work: Ceze, Lucia, Qadeer, Strauss
– Recent Java Memory Model work: Sevcik, Aspinall, Cenciarelli
– Race-free optimization: Pramod Joisha, Laura Effinger-Dean,
Dhruva Chakrabarti
• But some of it is still controversial.
– This reflects my personal views.
7/28/2011 3
4. Outline
• Emerging consensus:
– Interleaving semantics (Sequential Consistency)
– But only for data-race-free programs
• Brief discussion of consequences
– Software requirements
– Hardware requirements
• Major remaining problem:
– Java can‟t outlaw races.
– We don‟t know how to give meaning to data races.
– Some speculative solutions.
7/28/2011 4
5. Naive threads programming model
(Sequential Consistency)
• Threads behave as though their memory
accesses were simply interleaved. (Sequential
consistency)
Thread 1 Thread 2
x = 1; y = 2;
z = 3;
– might be executed as
x = 1; y = 2; z = 3;
7/28/2011 5
6. Locks restrict interleavings
Thread 1 Thread 2
lock(l); lock(l);
r1 = x; r2 = x;
x = r1+1; x = r2+1;
unlock(l); unlock(l);
– can only be executed as
lock(l); r1 = x; x = r1+1; unlock(l); lock(l);
r2 = x; x = r2+1; unlock(l);
or
lock(l); r2 = x; x = r2+1; unlock(l); lock(l);
r1 = x; x = r1+1; unlock(l);
since second lock(l) must follow first unlock(l)
7/28/2011 6
7. Atomic sections / transactional memory are
just like a single global lock.
7/28/2011 7
8. But this doesn‟t quite work …
• Limits reordering and other hardware/compiler
transformations
– “Dekker‟s” example (everything initially zero) should
allow r1 = r2 = 0:
Thread 1 Thread 2
x = 1; y = 1;
r1 = y; r2 = x;
– Compilers like to move up loads.
– Hardware likes to buffer stores.
7/28/2011 8
9. Sensitive to memory access
granularity
Thread 1 Thread 2
x = 300; x = 100;
• If memory is accessed a byte at a time, this may be
executed as:
x_high = 0;
x_high = 1; // x = 256
x_low = 44; // x = 300;
x_low = 100; // x = 356;
7/28/2011 9
10. And this is at too low a level …
• And taking advantage of sequential consistency
involves reasoning about memory access
interleaving:
– Much too hard.
– Want to reason about larger “atomic” code regions
• which can‟t be visibly interleaved.
09/08/2010 10
11. Real threads programming model
(1)
• Two memory accesses conflict if they
– access the same scalar object*, e.g. variable.
– at least one access is a store.
– E.g. x = 1; and r2 = x; conflict
• Two ordinary memory accesses participate in a data
race if they
– conflict, and
– can occur simultaneously
• i.e. appear as adjacent operations (diff. threads) in interleaving.
• A program is data-race-free (on a particular input) if no
sequentially consistent execution results in a data race.
* or contiguous sequence of bit-fields
09/08/2010 11
12. Real threads programming model
(2)
• Sequential consistency only for data-race-
free programs!
– Avoid anything else.
• Data races are prevented by
– locks (or atomic sections) to restrict
interleaving
– declaring synchronization variables
• (in two slides …)
09/08/2010 12
13. Alternate data-race-free definition:
happens-before
• Memory access a happens-before b if
– a precedes b in program order. Thread 1:
– a and b are synchronization operations,
and b observes the results of a, thus lock(m);
enforcing ordering between them. t = x + 1;
• e.g. a unlocks m, b subsequently locks m.
– Or there is a c such that a happens-before x = t;
c and c happens-before b. unlock(m)
• Two ordinary conflicting memory
operations a and b participate in a Thread 2:
data race if neither lock(m);
– a happens-before b, t = x + 1;
– nor b happens-before a.
• Set of data-race-free programs usually the x = t;
same. unlock(m)
14. Synchronization variables
• Java: volatile, java.util.concurrent.atomic.
• C++11: atomic<int>, atomic_int
• C1x: _Atomic(int), _Atomic int, atomic_int
• Guarantee indivisibility of operations.
• “Don‟t count” in determining whether there is a data race:
– Programs with “races” on synchronization variables are still
sequentially consistent.
– Though there may be “escapes” (Java, C++11).
• Dekker‟s algorithm “just works” with synchronization
variables.
7/28/2011 14
15. As expressive as races
Double-checked locking: Double-checked locking:
Wrong! Correct (C++11):
bool x_init; atomic<bool> x_init;
if (!x_init) { if (!x_init) {
l.lock(); l.lock();
if (!x_init) { if (!x_init) {
initialize x; initialize x;
x_init = true; x_init = true;
} }
l.unlock(); l.unlock();
} }
use x; use x;
7/28/2011 15
16. Data Races Revisited
• Are defined in terms of sequentially
consistent executions.
• If x and y are initially zero, this does not
have a data race:
Thread 1 Thread 2
if (x) if (y)
y = 1; x = 1;
7/28/2011 16
17. SC for DRF programming model
advantages over SC
• Supports important hardware & compiler optimizations.
• DRF restriction Synchronization-free code sections
appear to execute atomically, i.e. without visible
interleaving.
– If one didn‟t:
Thread 1 (not atomic): Thread 2(observer):
a = 1;
if (a == 1 && b == 0) {
b = 1;
…
}
09/08/2010 17
18. SC for DRF implementation model
(1)
• Very restricted reordering of lock();
memory operations around synch-free
synchronization operations: code
– Compiler either understands these, or
region
treats them as opaque, potentially
updating any location. unlock();
– Synchronization operations include
synch-free
instructions to limit or prevent
hardware reordering (“memory code
fences”). region
• e.g. lock acquisition, release, atomic
store, might contain memory fences.
19. SC for DRF implementation model
(2)
• Code may be reordered lock();
between synchronization
operations. synch-free
– Another thread can only tell if it code
accesses the same data between region
reordered operations.
– Such an access would be a data unlock();
race.
synch-free
• If data races are disallowed code
(e.g. Posix, Ada, C++11, not
Java), compiler may assume region
that variables don‟t change
asynchronously.
20. Data
races
Some variants are
errors
C++11 draft SC for DRF, X
with explicit easily avoidable exceptions
C1x draft
Java SC for DRF, with some exceptions
More details later.
Ada83 SC for DRF (?) X
Pthreads SC for DRF (??) X
OpenMP SC for DRF (??, except atomics) X
Fortran 2008 SC for DRF (??, except atomics) X
C#, .Net SC for DRF when restricted to
locks + Interlocked (???)
7/28/2011 20
21. The exceptions in C++11:
atomic<bool> x_init;
if (!x_init.load(memory_order_acquire) {
l.lock();
if (!x_init.load(memory_order_relaxed) {
initialize x;
x_init.store(true, memory_order_release);
}
l.unlock();
}
use x;
We‟ll ignore this for the rest of the talk …
22. Data Races as Errors
• In many languages, data races are errors
– In the sense of out-of-bounds array stores in C.
– Anything can happen:
• E.g. program can crash, local variables can appear
to change, etc.
– There are no benign races in e.g. C++11 or Ada 83.
– Compilers may, and do, assume there are no races.
If that assumption is false, all bets are off.
– See e.g. Boehm, How to miscompile programs with
“benign” data races, HotPar „11.
.
7/28/2011 22
23. How a data race may cause a
crash
unsigned x; • Assume switch
statement compiled as
If (x < 3) { branch table.
• May assume x is in
… // async x change
range.
switch(x) {
• Asynchronous change to
case 0: … x causes wild branch.
case 1: … – Not just wrong value.
case 2: …
}
}
23 28 July 2011
24. Note: We can usually ignore
wait/notify
• In most languages wait can return without
notify.
– “Spurious wakeup”
• Except for termination issues:
– wait() ≡ unlock(); lock()
– notify() ≡ nop
• Applies to existence of data races, partial
correctness, flow analysis.
7/28/2011 24
25. Another note on data race
definition
• We define it in terms of scalar accesses, but
• Container libraries should ensure that
Container accesses don‟t race
No races on memory locations
• This means
– Accesses to hidden shared state (caches,
allocation) must be locked by implementation.
– User must lock for container-level races.
• This is often the correct library thread-safety
condition.
7/28/2011 25
26. Outline
• Emerging consensus:
– Interleaving semantics (Sequential Consistency)
– But only for data-race-free programs
• Brief discussion of consequences
– Software requirements
– Hardware requirements
• Major remaining problem:
– Java can‟t outlaw races.
– We don‟t know how to give meaning to data races.
– Some speculative solutions.
7/28/2011 26
27. Compilers must not introduce data
races
• Single thread compilers currently may add
data races: (PLDI 05)
struct {char a; char b} x;
tmp = x;
x.a = „z‟; tmp.a = „z‟;
x = tmp;
– x.a = 1 in parallel with x.b = 1 may fail to
update x.b.
• … and much more interesting examples.
• Still broken in gcc with bit-fields.
7/28/2011 27
28. Some restrictions are a bit more
annoying:
• Compiler may not introduce “speculative” stores:
int count; // global, possibly shared
…
for (p = q; p != 0; p = p -> next)
if (p -> data > 0) ++count;
int count; // global, possibly shared
…
reg = count;
for (p = q; p != 0; p = p -> next)
if (p -> data > 0) ++reg;
count = reg; // may spuriously assign to count
28 28 July 2011
29. Introducing data races:
Potentially infinite loops
?
while (…) { *p = 1; } x = 1;
x = 1; while (…) { *p = 1; }
• x not otherwise accessed by loop.
• Prohibited in Java.
• Allowed by giving undefined behavior to such
infinite loops in C++0x. (Controversial.)
• Common text book analyses treat as equivalent?
7/28/2011 29
30. Note on program analysis for
optimization
• Lots of research papers on analysis of
sequentially consistent parallel programs.
• AFAIK: Real compilers don’t do that!
– For good reason.
– Use sequential optimization in sync-free regions.
• SC analysis is
– Generally sound, but pessimistic for C++11 (minus
SC escapes)
– Very much so with incomplete information
– Unsound for Java (but …)
7/28/2011 30
31. Some positive optimization
consequences
• Rules are finally clear!
– Optimization does not need to preserve full sequential
consistency!
• In languages that outlaw data races
– Compilers already use that.
– It can be leveraged more aggressively:
– Assuming no other synchronization, x not motified by this thread,
x is loop invariant in
while (…) {
… x …
l.lock(); … l.unlock();
}
7/28/2011 31
32. Language spec challenge 1
• Some common language features almost
unavoidably introduce data races.
• Most significant example:
– Detached/daemon threads, combined with
– Destructors / finalizers for static variables
– Detached threads call into libraries that may
access static variables
• Even while they‟re being cleaned up.
7/28/2011 32
33. Language spec challenge 2:
• Some really awful code:
Thread 2: Don’t try this at home!!
Thread 1:
x = 42; while (m.trylock()==SUCCESS)
? m.lock(); m.unlock();
assert (x == 42);
• Disclaimer: Example requires
• Can the assertion fail? tweaking to be pthreads-
compliant.
• Many implementations: Yes
• Traditional specs: No. C++11: Yes
• Trylock() can effectively fail spuriously!
09/08/2010 33
34. Outline
• Emerging consensus:
– Interleaving semantics (Sequential Consistency)
– But only for data-race-free programs
• Brief discussion of consequences
– Software requirements
– Hardware requirements
• Major remaining problem:
– Java can‟t outlaw races.
– We don‟t know how to give meaning to data races.
– Some speculative solutions.
7/28/2011 34
35. Byte store instructions
• x.c = „a‟; may not visibly read and
rewrite adjacent fields.
• Byte stores must be implemented with
– Byte store instruction, or
– Atomic read-modify-write.
• Typically expensive on multiprocessors.
• Often cheaply implementable on uniprocessors.
7/28/2011 35
36. Sequential consistency must be
enforceable
• Programs using only synchronization variables
must be sequentially consistent.
• Compiler literature contains many papers on
enforcing sequential consistency by adding
fences. But:
– Not really possible on Itanium.
– Wasn‟t possible on X86 until the re-revision of the
spec last year.
– Took months of discussions with PowerPC architects
to conclude it‟s (barely, sort of) possible there.
• The core issue is “write atomicity”:
7/28/2011 36
37. Can fences enforce SC?
Unclear that hardware fences can ensure sequential
consistency. “IRIW” example:
x, y initially zero. Fences between every instruction pair!
Thread 1: Thread 2: Thread 3: Thread 4:
x = 1; r1 = x; (1) y = 1; r3 = y; (1)
fence; fence;
r2 = y; (0) r4 = x; (0)
x set first! y set first!
Fully fenced, not sequentially consistent. Does hardware allow it?
7/28/2011 37
38. Why does it matter?
• Nobody cares about IRIW!?
• It‟s a pain to enforce on at least PowerPC.
• Many people (Sarita Adve, Doug Lea,
Vijay Saraswat) spent about a year trying
to relax SC requirement here.
• (Personal opinion) The results were
incomprehensible, and broke more
important code.
• No viable alternatives!
7/28/2011 38
39. Acceptable hardware memory
models
• More challenging requirements:
1. Precise memory model specification
2. Byte stores
3. Cheap mechanism to enforce write atomicity
4. Dirt cheap mechanism to enforce data
dependency ordering(?) (Java final fields)
• Other than that, all standard approaches
appear workable, but …
7/28/2011 39
40. Replace fences completely?
Synchronization variables on X86
• atomic store: ~1 cycle dozens of cycles
– store (mov); mfence;
• atomic load: ~1 cycle
– load (mov)
• Store implicitly ensures that prior memory
operations become visible before store.
• Load implicitly ensures that subsequent memory
operations become visible later.
• Sole reason for mfence: Order atomic store
followed by atomic load.
7/28/2011 40
41. Fence enforces all kinds of
additional, unobservable orderings
• s is a synchronization variable:
x = 1;
s = 2; // includes fence
r1 = y;
• Prevents reordering of x = 1 and r1 = y;
– final load delayed until assignment to a visible.
• But this ordering is invisible to non-racing threads
– …and expensive to enforce?
• We need a tiny fraction of mfence functionality.
7/28/2011 41
42. Outline
• Emerging consensus:
– Interleaving semantics (Sequential Consistency)
– But only for data-race-free programs
• Brief discussion of consequences
– Software requirements
– Hardware requirements
• Major remaining problem:
– Java can‟t outlaw races.
– We don‟t know how to give meaning to data races.
– Some speculative solutions.
7/28/2011 42
43. Data Races in Java
• C++11 leaves data race semantics
undefined.
– “catch fire” semantics
• Java supports sand-boxed code.
• Don‟t know how to prevent data-races in
sand-boxed, malicious code.
• Java must provide some guarantees in the
presence of data races.
7/28/2011 43
44. Interesting data race outcome?
x, y initially null,
Loads may or may not see racing stores?
Thread 1: Thread 2:
r1 = x; r2 = y;
y = r1; x = r2;
Outcome: x = y = r1 = r2 =
“<your bank password here>”
7/28/2011 44
45. The Java Solution
Quotation from 17.4.8, Java Language Specification, 3rd
edition, omitted, to avoid possible copyright questions. It
addresses the issue by explicitly outlawing the causality
cycles required to get the dubious result from the last slide.
The important point is that this is a rather complex
mathematical specification.
…
7/28/2011 45
46. Complicated, but nice properties?
• Manson, Pugh, Adve: The Java Memory
Model, POPL 05
Quotation from section 9.1.2 of above paper omitted, to avoid
possible copyright questions. This asserts (Theorem 1) that
non-conflicting operations may be reordered by a compiler.
7/28/2011 46
47. Much nicer than prior attempts, but:
• Aspinall, Sevcik, “Java Memory Model Examples: Good,
Bad, and Ugly”, VAMP 2007 (also ECOOP 2008 paper)
Quotation from above paper omitted, to avoid possible
copyright questions. This ends in the statement:
“This falsifies Theorem 1 of [paper from previous slide].”
Note: The underlying observation is due to Pietro Cenciarelli.
7/28/2011 47
48. Why is this hard?
• Want
– Constrained race semantics for essential
security properties.
– Unconstrained race semantics to support
compiler and hardware optimizations.
– Simplicity.
• No known good resolution.
7/28/2011 48
49. Outline
• Emerging consensus:
– Interleaving semantics (Sequential Consistency)
– But only for data-race-free programs
• Brief discussion of consequences
– Software requirements
– Hardware requirements
• Major remaining problem:
– Java can‟t outlaw races.
– We don‟t know how to give meaning to data races.
– Some speculative solutions.
7/28/2011 49
50. A Different Approach
• Outlaw data races.
• Require violations to be detectable!
– Even in malicious sand-boxed code.
• Possible approaches:
– Statically prevent data races.
• Tried repeatedly, ongoing work …
– Dynamically detect the relevant data races.
7/28/2011 50
51. Dynamic Race Detection
• Need to guarantee one of:
– Program is data-race free and provides SC execution (done),
– Program contains a data race and raises an exception, or
– Program exhibits simple semantics anyway, e.g.
• Sequentially consistent
• Synchronization-free regions are atomic
• This is significantly cheaper than fully accurate data-race
detection.
– Track byte-level R/W information
– Mostly in cache
– As opposed to epoch number + thread id per byte
7/28/2011 51
52. For more information:
• Boehm, “Threads Basics”, HPL TR 2009-259.
• Boehm, Adve, “Foundations of the C++ Concurrency Memory
Model”, PLDI 08.
• Sevcık and Aspinall, “On Validity of Program Transformations in the
Java Memory Model”, ECOOP 08.
• Sewell et al, “x86-TSO: A Rigorous and Usable Programmer‟s
Model for x86”, CACM, July 2010.
• S. V. Adve, Boehm, “Memory Models: A Case for Rethinking Parallel
Languages and Hardware”, CACM, August 2010.
• Lucia, Strauss, Ceze, Qadeer, Boehm, "Conflict Exceptions:
Providing Simple Parallel Language Semantics with Precise
Hardware Exceptions, ISCA 2010. Also Marino et al, PLDI 10.
• Effinger-Dean, Boehm, Chakrabarti, Joisha, “Extended Sequential
Reasoning for Data-Race-Free Programs”, MSPC 11.
7/28/2011 52