Programming Language Memory Models: What do Shared Variables Mean?


Published on

Hans Boehm's ECOOP 2011 Summer School talk.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Programming Language Memory Models: What do Shared Variables Mean?

  1. 1. Programming Language Memory Models: What do Shared Variables Mean? Hans-J. Boehm HP Labs7/28/2011 1
  2. 2. The problem• Shared memory parallel programs are built on shared variables visible to multiple threads of control.• But there is a lot of confusion about what those variables mean: – Are concurrent accesses allowed? – What is a concurrent access? – When do updates become visible to other threads? – Can an update be partially visible?• Many recent efforts with serious technical issues: – Java, OpenMP 3.0, UPC, Go …7/28/2011 2
  3. 3. Credits and Disclaimers:• Much of this work was done by others or jointly. I‟m relying particularly on: – Basic approach: Sarita Adve, Mark Hill, Ada 83 … – JSR 133: Also Jeremy Manson, Bill Pugh, Doug Lea – C++11: Lawrence Crowl, Clark Nelson, Paul McKenney, Herb Sutter, … – Improved hardware models: Peter Sewell‟s group, many Intel, AMD, ARM, IBM participants … – Conflict exception work: Ceze, Lucia, Qadeer, Strauss – Recent Java Memory Model work: Sevcik, Aspinall, Cenciarelli – Race-free optimization: Pramod Joisha, Laura Effinger-Dean, Dhruva Chakrabarti• But some of it is still controversial. – This reflects my personal views.7/28/2011 3
  4. 4. Outline• Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs• Brief discussion of consequences – Software requirements – Hardware requirements• Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions.7/28/2011 4
  5. 5. Naive threads programming model (Sequential Consistency)• Threads behave as though their memory accesses were simply interleaved. (Sequential consistency) Thread 1 Thread 2 x = 1; y = 2; z = 3; – might be executed as x = 1; y = 2; z = 3;7/28/2011 5
  6. 6. Locks restrict interleavings Thread 1 Thread 2 lock(l); lock(l); r1 = x; r2 = x; x = r1+1; x = r2+1; unlock(l); unlock(l); – can only be executed as lock(l); r1 = x; x = r1+1; unlock(l); lock(l); r2 = x; x = r2+1; unlock(l); or lock(l); r2 = x; x = r2+1; unlock(l); lock(l); r1 = x; x = r1+1; unlock(l); since second lock(l) must follow first unlock(l)7/28/2011 6
  7. 7. Atomic sections / transactional memory are just like a single global lock.7/28/2011 7
  8. 8. But this doesn‟t quite work …• Limits reordering and other hardware/compiler transformations – “Dekker‟s” example (everything initially zero) should allow r1 = r2 = 0: Thread 1 Thread 2 x = 1; y = 1; r1 = y; r2 = x; – Compilers like to move up loads. – Hardware likes to buffer stores.7/28/2011 8
  9. 9. Sensitive to memory access granularity Thread 1 Thread 2 x = 300; x = 100;• If memory is accessed a byte at a time, this may be executed as: x_high = 0; x_high = 1; // x = 256 x_low = 44; // x = 300; x_low = 100; // x = 356;7/28/2011 9
  10. 10. And this is at too low a level …• And taking advantage of sequential consistency involves reasoning about memory access interleaving: – Much too hard. – Want to reason about larger “atomic” code regions • which can‟t be visibly interleaved.09/08/2010 10
  11. 11. Real threads programming model (1) • Two memory accesses conflict if they – access the same scalar object*, e.g. variable. – at least one access is a store. – E.g. x = 1; and r2 = x; conflict • Two ordinary memory accesses participate in a data race if they – conflict, and – can occur simultaneously • i.e. appear as adjacent operations (diff. threads) in interleaving. • A program is data-race-free (on a particular input) if no sequentially consistent execution results in a data race. * or contiguous sequence of bit-fields09/08/2010 11
  12. 12. Real threads programming model (2)• Sequential consistency only for data-race- free programs! – Avoid anything else.• Data races are prevented by – locks (or atomic sections) to restrict interleaving – declaring synchronization variables • (in two slides …)09/08/2010 12
  13. 13. Alternate data-race-free definition: happens-before• Memory access a happens-before b if – a precedes b in program order. Thread 1: – a and b are synchronization operations, and b observes the results of a, thus lock(m); enforcing ordering between them. t = x + 1; • e.g. a unlocks m, b subsequently locks m. – Or there is a c such that a happens-before x = t; c and c happens-before b. unlock(m)• Two ordinary conflicting memory operations a and b participate in a Thread 2: data race if neither lock(m); – a happens-before b, t = x + 1; – nor b happens-before a.• Set of data-race-free programs usually the x = t; same. unlock(m)
  14. 14. Synchronization variables• Java: volatile, java.util.concurrent.atomic.• C++11: atomic<int>, atomic_int• C1x: _Atomic(int), _Atomic int, atomic_int• Guarantee indivisibility of operations.• “Don‟t count” in determining whether there is a data race: – Programs with “races” on synchronization variables are still sequentially consistent. – Though there may be “escapes” (Java, C++11).• Dekker‟s algorithm “just works” with synchronization variables.7/28/2011 14
  15. 15. As expressive as racesDouble-checked locking: Double-checked locking:Wrong! Correct (C++11):bool x_init; atomic<bool> x_init;if (!x_init) { if (!x_init) { l.lock(); l.lock(); if (!x_init) { if (!x_init) { initialize x; initialize x; x_init = true; x_init = true; } } l.unlock(); l.unlock();} }use x; use x;7/28/2011 15
  16. 16. Data Races Revisited• Are defined in terms of sequentially consistent executions.• If x and y are initially zero, this does not have a data race: Thread 1 Thread 2 if (x) if (y) y = 1; x = 1;7/28/2011 16
  17. 17. SC for DRF programming model advantages over SC• Supports important hardware & compiler optimizations.• DRF restriction  Synchronization-free code sections appear to execute atomically, i.e. without visible interleaving. – If one didn‟t: Thread 1 (not atomic): Thread 2(observer): a = 1; if (a == 1 && b == 0) { b = 1; … }09/08/2010 17
  18. 18. SC for DRF implementation model (1)• Very restricted reordering of lock(); memory operations around synch-free synchronization operations: code – Compiler either understands these, or region treats them as opaque, potentially updating any location. unlock(); – Synchronization operations include synch-free instructions to limit or prevent hardware reordering (“memory code fences”). region • e.g. lock acquisition, release, atomic store, might contain memory fences.
  19. 19. SC for DRF implementation model (2)• Code may be reordered lock(); between synchronization operations. synch-free – Another thread can only tell if it code accesses the same data between region reordered operations. – Such an access would be a data unlock(); race. synch-free• If data races are disallowed code (e.g. Posix, Ada, C++11, not Java), compiler may assume region that variables don‟t change asynchronously.
  20. 20. Data races Some variants are errorsC++11 draft SC for DRF, X with explicit easily avoidable exceptionsC1x draftJava SC for DRF, with some exceptions More details later.Ada83 SC for DRF (?) XPthreads SC for DRF (??) XOpenMP SC for DRF (??, except atomics) XFortran 2008 SC for DRF (??, except atomics) XC#, .Net SC for DRF when restricted to locks + Interlocked (???)7/28/2011 20
  21. 21. The exceptions in C++11:atomic<bool> x_init;if (!x_init.load(memory_order_acquire) { l.lock(); if (!x_init.load(memory_order_relaxed) { initialize x;, memory_order_release); } l.unlock();}use x;We‟ll ignore this for the rest of the talk …
  22. 22. Data Races as Errors• In many languages, data races are errors – In the sense of out-of-bounds array stores in C. – Anything can happen: • E.g. program can crash, local variables can appear to change, etc. – There are no benign races in e.g. C++11 or Ada 83. – Compilers may, and do, assume there are no races. If that assumption is false, all bets are off. – See e.g. Boehm, How to miscompile programs with “benign” data races, HotPar „11. .7/28/2011 22
  23. 23. How a data race may cause a crashunsigned x; • Assume switch statement compiled asIf (x < 3) { branch table. • May assume x is in … // async x change range. switch(x) { • Asynchronous change to case 0: … x causes wild branch. case 1: … – Not just wrong value. case 2: … }}23 28 July 2011
  24. 24. Note: We can usually ignore wait/notify• In most languages wait can return without notify. – “Spurious wakeup”• Except for termination issues: – wait() ≡ unlock(); lock() – notify() ≡ nop• Applies to existence of data races, partial correctness, flow analysis.7/28/2011 24
  25. 25. Another note on data race definition• We define it in terms of scalar accesses, but• Container libraries should ensure that Container accesses don‟t race  No races on memory locations• This means – Accesses to hidden shared state (caches, allocation) must be locked by implementation. – User must lock for container-level races.• This is often the correct library thread-safety condition.7/28/2011 25
  26. 26. Outline• Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs• Brief discussion of consequences – Software requirements – Hardware requirements• Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions.7/28/2011 26
  27. 27. Compilers must not introduce data races• Single thread compilers currently may add data races: (PLDI 05) struct {char a; char b} x; tmp = x; x.a = „z‟; tmp.a = „z‟; x = tmp; – x.a = 1 in parallel with x.b = 1 may fail to update x.b.• … and much more interesting examples.• Still broken in gcc with bit-fields.7/28/2011 27
  28. 28. Some restrictions are a bit more annoying:• Compiler may not introduce “speculative” stores: int count; // global, possibly shared … for (p = q; p != 0; p = p -> next) if (p -> data > 0) ++count; int count; // global, possibly shared … reg = count; for (p = q; p != 0; p = p -> next) if (p -> data > 0) ++reg; count = reg; // may spuriously assign to count28 28 July 2011
  29. 29. Introducing data races: Potentially infinite loops ?while (…) { *p = 1; } x = 1;x = 1; while (…) { *p = 1; }• x not otherwise accessed by loop.• Prohibited in Java.• Allowed by giving undefined behavior to such infinite loops in C++0x. (Controversial.)• Common text book analyses treat as equivalent?7/28/2011 29
  30. 30. Note on program analysis for optimization• Lots of research papers on analysis of sequentially consistent parallel programs.• AFAIK: Real compilers don’t do that! – For good reason. – Use sequential optimization in sync-free regions.• SC analysis is – Generally sound, but pessimistic for C++11 (minus SC escapes) – Very much so with incomplete information – Unsound for Java (but …)7/28/2011 30
  31. 31. Some positive optimization consequences• Rules are finally clear! – Optimization does not need to preserve full sequential consistency!• In languages that outlaw data races – Compilers already use that. – It can be leveraged more aggressively: – Assuming no other synchronization, x not motified by this thread, x is loop invariant in while (…) { … x … l.lock(); … l.unlock(); }7/28/2011 31
  32. 32. Language spec challenge 1• Some common language features almost unavoidably introduce data races.• Most significant example: – Detached/daemon threads, combined with – Destructors / finalizers for static variables – Detached threads call into libraries that may access static variables • Even while they‟re being cleaned up.7/28/2011 32
  33. 33. Language spec challenge 2: • Some really awful code: Thread 2: Don’t try this at home!!Thread 1: x = 42; while (m.trylock()==SUCCESS)? m.lock(); m.unlock(); assert (x == 42); • Disclaimer: Example requires • Can the assertion fail? tweaking to be pthreads- compliant. • Many implementations: Yes • Traditional specs: No. C++11: Yes • Trylock() can effectively fail spuriously! 09/08/2010 33
  34. 34. Outline• Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs• Brief discussion of consequences – Software requirements – Hardware requirements• Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions.7/28/2011 34
  35. 35. Byte store instructions• x.c = „a‟; may not visibly read and rewrite adjacent fields.• Byte stores must be implemented with – Byte store instruction, or – Atomic read-modify-write. • Typically expensive on multiprocessors. • Often cheaply implementable on uniprocessors.7/28/2011 35
  36. 36. Sequential consistency must be enforceable• Programs using only synchronization variables must be sequentially consistent.• Compiler literature contains many papers on enforcing sequential consistency by adding fences. But: – Not really possible on Itanium. – Wasn‟t possible on X86 until the re-revision of the spec last year. – Took months of discussions with PowerPC architects to conclude it‟s (barely, sort of) possible there.• The core issue is “write atomicity”:7/28/2011 36
  37. 37. Can fences enforce SC?Unclear that hardware fences can ensure sequentialconsistency. “IRIW” example:x, y initially zero. Fences between every instruction pair! Thread 1: Thread 2: Thread 3: Thread 4: x = 1; r1 = x; (1) y = 1; r3 = y; (1) fence; fence; r2 = y; (0) r4 = x; (0) x set first! y set first! Fully fenced, not sequentially consistent. Does hardware allow it?7/28/2011 37
  38. 38. Why does it matter?• Nobody cares about IRIW!?• It‟s a pain to enforce on at least PowerPC.• Many people (Sarita Adve, Doug Lea, Vijay Saraswat) spent about a year trying to relax SC requirement here.• (Personal opinion) The results were incomprehensible, and broke more important code.• No viable alternatives!7/28/2011 38
  39. 39. Acceptable hardware memory models• More challenging requirements: 1. Precise memory model specification 2. Byte stores 3. Cheap mechanism to enforce write atomicity 4. Dirt cheap mechanism to enforce data dependency ordering(?) (Java final fields)• Other than that, all standard approaches appear workable, but …7/28/2011 39
  40. 40. Replace fences completely? Synchronization variables on X86• atomic store: ~1 cycle dozens of cycles – store (mov); mfence;• atomic load: ~1 cycle – load (mov)• Store implicitly ensures that prior memory operations become visible before store.• Load implicitly ensures that subsequent memory operations become visible later.• Sole reason for mfence: Order atomic store followed by atomic load.7/28/2011 40
  41. 41. Fence enforces all kinds of additional, unobservable orderings• s is a synchronization variable: x = 1; s = 2; // includes fence r1 = y;• Prevents reordering of x = 1 and r1 = y; – final load delayed until assignment to a visible.• But this ordering is invisible to non-racing threads – …and expensive to enforce?• We need a tiny fraction of mfence functionality.7/28/2011 41
  42. 42. Outline• Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs• Brief discussion of consequences – Software requirements – Hardware requirements• Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions.7/28/2011 42
  43. 43. Data Races in Java• C++11 leaves data race semantics undefined. – “catch fire” semantics• Java supports sand-boxed code.• Don‟t know how to prevent data-races in sand-boxed, malicious code.• Java must provide some guarantees in the presence of data races.7/28/2011 43
  44. 44. Interesting data race outcome? x, y initially null, Loads may or may not see racing stores?Thread 1: Thread 2: r1 = x; r2 = y; y = r1; x = r2; Outcome: x = y = r1 = r2 = “<your bank password here>”7/28/2011 44
  45. 45. The Java SolutionQuotation from 17.4.8, Java Language Specification, 3rdedition, omitted, to avoid possible copyright questions. Itaddresses the issue by explicitly outlawing the causalitycycles required to get the dubious result from the last slide.The important point is that this is a rather complexmathematical specification. …7/28/2011 45
  46. 46. Complicated, but nice properties?• Manson, Pugh, Adve: The Java Memory Model, POPL 05 Quotation from section 9.1.2 of above paper omitted, to avoid possible copyright questions. This asserts (Theorem 1) that non-conflicting operations may be reordered by a compiler.7/28/2011 46
  47. 47. Much nicer than prior attempts, but:• Aspinall, Sevcik, “Java Memory Model Examples: Good, Bad, and Ugly”, VAMP 2007 (also ECOOP 2008 paper)Quotation from above paper omitted, to avoid possiblecopyright questions. This ends in the statement:“This falsifies Theorem 1 of [paper from previous slide].”Note: The underlying observation is due to Pietro Cenciarelli.7/28/2011 47
  48. 48. Why is this hard?• Want – Constrained race semantics for essential security properties. – Unconstrained race semantics to support compiler and hardware optimizations. – Simplicity.• No known good resolution.7/28/2011 48
  49. 49. Outline• Emerging consensus: – Interleaving semantics (Sequential Consistency) – But only for data-race-free programs• Brief discussion of consequences – Software requirements – Hardware requirements• Major remaining problem: – Java can‟t outlaw races. – We don‟t know how to give meaning to data races. – Some speculative solutions.7/28/2011 49
  50. 50. A Different Approach• Outlaw data races.• Require violations to be detectable! – Even in malicious sand-boxed code.• Possible approaches: – Statically prevent data races. • Tried repeatedly, ongoing work … – Dynamically detect the relevant data races.7/28/2011 50
  51. 51. Dynamic Race Detection• Need to guarantee one of: – Program is data-race free and provides SC execution (done), – Program contains a data race and raises an exception, or – Program exhibits simple semantics anyway, e.g. • Sequentially consistent • Synchronization-free regions are atomic• This is significantly cheaper than fully accurate data-race detection. – Track byte-level R/W information – Mostly in cache – As opposed to epoch number + thread id per byte7/28/2011 51
  52. 52. For more information:• Boehm, “Threads Basics”, HPL TR 2009-259.• Boehm, Adve, “Foundations of the C++ Concurrency Memory Model”, PLDI 08.• Sevcık and Aspinall, “On Validity of Program Transformations in the Java Memory Model”, ECOOP 08.• Sewell et al, “x86-TSO: A Rigorous and Usable Programmer‟s Model for x86”, CACM, July 2010.• S. V. Adve, Boehm, “Memory Models: A Case for Rethinking Parallel Languages and Hardware”, CACM, August 2010.• Lucia, Strauss, Ceze, Qadeer, Boehm, "Conflict Exceptions: Providing Simple Parallel Language Semantics with Precise Hardware Exceptions, ISCA 2010. Also Marino et al, PLDI 10.• Effinger-Dean, Boehm, Chakrabarti, Joisha, “Extended Sequential Reasoning for Data-Race-Free Programs”, MSPC 11.7/28/2011 52
  53. 53. Questions?7/28/2011 53
  54. 54. Backup slides7/28/2011 54
  55. 55. Introducing Races: Register Promotion 2 r = g; for(...) {[g is global] if(mt) {for(...) { g = r; lock(); r = g; } if(mt) lock(); use/update r instead of g; use/update g; if(mt) { if(mt) unlock(); g = r; unlock(); r = g; }} } g = r;7/28/2011 55
  56. 56. Trylock: Critical section reordering?• Reordering of memory operations with respect to critical sections: Expected (& Java): Naïve pthreads: Optimized pthreads lock() lock() lock() unlock() unlock() unlock()7/28/2011 56
  57. 57. Some open source pthread lock implementations (2006): lock() lock() lock() lock() unlock() unlock() unlock() unlock() [technically incorrect] [Correct, slow] [Correct] [Incorrect] NPTL NPTL NPTL FreeBSD {Alpha, PowerPC} Itanium (&X86) { Itanium, X86 } Itanium {mutex, spin} mutex spin spin7/28/2011 57