Lock-Free, Wait-Free Hash Table

Like this? Share it with your network

Share

Lock-Free, Wait-Free Hash Table

  • 345 views
Uploaded on

In this presentation, Dr. Cliff Click describes a totally lock-free hashtable with extremely low-cost and near perfect scaling. Readers pay no more than HashMap readers: just the cost of computing......

In this presentation, Dr. Cliff Click describes a totally lock-free hashtable with extremely low-cost and near perfect scaling. Readers pay no more than HashMap readers: just the cost of computing the hash, loading and comparing the key, and returning the value. Writers must use AtomicUpdate instead of a simple assignment but otherwise pay the same as readers. In particular, there is no required order between loads and stores; correctness is assured, no matter how the hardware orders memory operations. A state-based technique demonstrates the correctness of the algorithm. This novel approach is very straightforward and much easier to understand than the usual "happens-before" memory-order-based reasoning.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
345
On Slideshare
345
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 2007 JavaOne Conference A Lock-Free Hash Table A Lock-Free Hash Table Dr. Cliff Click Chief JVM Architect & Distinguished Engineer blogs.azulsystems.com/cliff Azul Systems May 8, 2007
  • 2. Hash Tables www.azulsystems.com • • • • Constant-Time Key-Value Mapping Fast arbitrary function Extendable, defined at runtime Used for symbol tables, DB caching, network access, url caching, web content, etc • Crucial for Large Business Applications ─ > 1MLOC • Used in Very heavily multi-threaded apps ─ > 1000 threads | 2 ©2003 Azul Systems, Inc.
  • 3. Popular Java Implementations www.azulsystems.com • Java's HashTable ─ Single threaded; scaling bottleneck • HashMap ─ Faster but NOT multi-thread safe • java.util.concurrent.HashMap ─ Striped internal locks; 16-way the default • Azul, IBM, Sun sell machines >100cpus • Azul has customers using all cpus in same app • Becomes a scaling bottleneck! | 3 ©2003 Azul Systems, Inc.
  • 4. A Lock-Free Hash Table www.azulsystems.com • No locks, even during table resize No spin-locks No blocking while holding locks All CAS spin-loops bounded Make progress even if other threads die.... • Requires atomic update instruction: CAS (Compare-And-Swap), LL/SC (Load-Linked/Store-Conditional, PPC only) or similar • Uses sun.misc.Unsafe for CAS ─ ─ ─ ─ | 4 ©2003 Azul Systems, Inc.
  • 5. A Faster Hash Table www.azulsystems.com • Slightly faster than j.u.c for 99% reads < 32 cpus • Faster with more cpus (2x faster) ─ Even with 4096-way striping ─ 10x faster with default striping • 3x Faster for 95% reads (30x vs default) • 8x Faster for 75% reads (100x vs default) • Scales well up to 768 cpus, 75% reads ─ Approaches hardware bandwidth limits | 5 ©2003 Azul Systems, Inc.
  • 6. Agenda www.azulsystems.com • • • • • • | Motivation “Uninteresting” Hash Table Details State-Based Reasoning Resize Performance Q&A 6 ©2003 Azul Systems, Inc.
  • 7. Some “Uninteresting” Details www.azulsystems.com • Hashtable: A collection of Key/Value Pairs • Works with any collection • Scaling, locking, bottlenecks of the collection management responsibility of that collection • Must be fast or O(1) effects kill you • Must be cache-aware • I'll present a sample Java solution ─ But other solutions can work, make sense | 7 ©2003 Azul Systems, Inc.
  • 8. “Uninteresting” Details www.azulsystems.com • Closed Power-of-2 Hash Table ─ Reprobe on collision ─ Stride-1 reprobe: better cache behavior • Key & Value on same cache line • Hash memoized ─ Should be same cache line as K + V ─ But hard to do in pure Java • No allocation on get() or put() • Auto-Resize | 8 ©2003 Azul Systems, Inc.
  • 9. Example get() code www.azulsystems.com idx = hash = key.hashCode(); while( true ) { // reprobing loop idx &= (size-1); // limit idx to table size k = get_key(idx); // start cache miss early h = get_hash(idx); // memoized hash if( k == key || (h == hash && key.equals(k)) ) return get_val(idx);// return matching value if( k == null ) return null; idx++; } | 9 ©2003 Azul Systems, Inc. // reprobe
  • 10. “Uninteresting” Details www.azulsystems.com • Could use prime table + MOD ─ Better hash spread, fewer reprobes ─ But MOD is 30x slower than AND • Could use open table put() requires allocation Follow 'next' pointer instead of reprobe Each 'next' is a cache miss Lousy hash -> linked-list traversal • Could put Key/Value/Hash on same cache line • Other variants possible, interesting ─ ─ ─ ─ | 10 ©2003 Azul Systems, Inc.
  • 11. Agenda www.azulsystems.com • • • • • • Motivation “Uninteresting” Hash Table Details State-Based Reasoning Resize Performance Q&A | 11 ©2003 Azul Systems, Inc.
  • 12. Ordering and Correctness www.azulsystems.com • How to show table mods correct? ─ put, putIfAbsent, change, delete, etc. • Prove via: fencing, memory model, load/store • • • • ordering, “happens-before”? Instead prove* via state machine Define all possible {Key,Value} states Define Transitions, State Machine Show all states “legal” *Warning: hand-wavy proof follows | 12 ©2003 Azul Systems, Inc.
  • 13. State-Based Reasoning www.azulsystems.com • Define all {Key,Value} states and transitions • Don't Care about memory ordering: ─ get() can read Key, Value in any order ─ put() can change Key, Value in any order ─ put() must use CAS to change Key or Value ─ But not double-CAS • No fencing required for correctness! ─ (sometimes stronger guarantees are wanted and will need fencing) • Proof is simple! | 13 ©2003 Azul Systems, Inc.
  • 14. Valid States www.azulsystems.com • A Key slot is: ─ null – empty ─ K – some Key; can never change again • A Value slot is: ─ null – empty ─ T – tombstone ─ V – some Values • A state is a {Key,Value} pair • A transition is a successful CAS | 14 ©2003 Azul Systems, Inc.
  • 15. State Machine www.azulsystems.com {null,null} {K,T} Empty deleted key insert delete insert {K,null} Partially inserted K/V pair {null,T/V} Partially inserted K/V pair Reader-only state | 15 ©2003 Azul Systems, Inc. change {K,V} Standard K/V pair
  • 16. Example put(key,newval) code: www.azulsystems.com idx = hash = key.hashCode(); while( true ) { // Key-Claim stanza idx &= (size-1); k = get_key(idx); // State: {k,?} if( k == null && // {null,?} -> {key,?} CAS_key(idx,null,key) ) break; h = get_hash(idx); // State: {key,?} // get memoized hash if( k == key || (h == hash && key.equals(k)) ) break; idx++; } | 16 ©2003 Azul Systems, Inc. // State: {key,?} // reprobe
  • 17. Example put(key,newval) code www.azulsystems.com // State: {key,?} oldval = get_val(idx); // State: {key,oldval} // Transition: {key,oldval} -> {key,newval} if( CAS_val(idx,oldval,newval) ) { // Transition worked ... // Adjust size } else { // Transition failed; oldval has changed // We can act “as if” our put() worked but // was immediately stomped over } return oldval; | 17 ©2003 Azul Systems, Inc.
  • 18. Some Things to Notice www.azulsystems.com • Once a Key is set, it never changes ─ No chance of returning Value for wrong Key ─ Means Keys leak; table fills up with dead Keys ─ Fix in a few slides... • No ordering guarantees provided! ─ Bring Your Own Ordering/Synchronization • Weird {null,V} state meaningful but uninteresting ─ Means reader got an empty key and so missed ─ But possibly prefetched wrong Value | 18 ©2003 Azul Systems, Inc.
  • 19. Some Things to Notice www.azulsystems.com • There is no machine-wide coherent State! • Nobody guaranteed to read the same State ─ Except on the same CPU with no other writers • No need for it either • Consider degenerate case of a single Key • Same guarantees as: ─ single shared global variable ─ many readers & writers, no synchronization ─ i.e., darned little | 19 ©2003 Azul Systems, Inc.
  • 20. A Slightly Stronger Guarantee www.azulsystems.com • Probably want “happens-before” on Values ─ java.util.concurrent provides this • Similar to declaring that shared global 'volatile' • Things written into a Value before put() ─ Are guaranteed to be seen after a get() • Requires st/st fence before CAS'ing Value ─ “free” on Sparc, X86 • Requires ld/ld fence after loading Value ─ “free” on Azul | 20 ©2003 Azul Systems, Inc.
  • 21. Agenda www.azulsystems.com • • • • • • Motivation “Uninteresting” Hash Table Details State-Based Reasoning Resize Performance Q&A | 21 ©2003 Azul Systems, Inc.
  • 22. Resizing The Table www.azulsystems.com • Need to resize if table gets full • Or just re-probing too often • Resize copies live K/V pairs ─ Doubles as cleanup of dead Keys ─ Resize (“cleanse”) after any delete ─ Throttled, once per GC cycle is plenty often • Alas, need fencing, 'happens before' • Hard bit for concurrent resize & put(): ─ Must not drop the last update to old table | 22 ©2003 Azul Systems, Inc.
  • 23. Resizing www.azulsystems.com • Expand State Machine • Side-effect: mid-resize is a valid State • Means resize is: ─ Concurrent – readers can help, or just read&go ─ Parallel – all can help ─ Incremental – partial copy is OK • Pay an extra indirection while resize in progress ─ So want to finish the job eventually • Stacked partial resizes OK, expected | 23 ©2003 Azul Systems, Inc.
  • 24. get/put during Resize www.azulsystems.com • get() works on the old table ─ Unless see a sentinel • put() or other mod must use new table • Must check for new table every time ─ Late writes to old table 'happens before' resize • Copying K/V pairs is independent of get/put • Copy has many heuristics to choose from: ─ All touching threads, only writers, unrelated background thread(s), etc | 24 ©2003 Azul Systems, Inc.
  • 25. New State: 'use new table' Sentinel www.azulsystems.com • X: sentinel used during table-copy ─ Means: not in old table, check new • A Key slot is: ─ null, K ─ X – 'use new table', not any valid Key ─ null → K OR null → X • A Value slot is: ─ null, T, V ─ X – 'use new table', not any valid Value ─ null → {T,V}* → X | 25 ©2003 Azul Systems, Inc.
  • 26. State Machine – old table www.azulsystems.com {X,null} kill {null,null} {K,T} Empty insert check newer table Deleted key kill delete insert {K,null} {K,X} copy {K,V} into newer table {null,T/V/X} Partially inserted K/V pair | 26 ©2003 Azul Systems, Inc. change {K,V} Standard K/V pair States {X,T/V/X} not possible
  • 27. State Machine: Copy One Pair www.azulsystems.com empty {null,null} | 27 ©2003 Azul Systems, Inc. {X,null}
  • 28. State Machine: Copy One Pair www.azulsystems.com empty {null,null} {X,null} dead or partially inserted {K,T/null} | 28 ©2003 Azul Systems, Inc. {K,X}
  • 29. State Machine: Copy One Pair www.azulsystems.com empty {null,null} {X,null} dead or partially inserted {K,T/null} {K,X} alive, but old {K,V1} {K,X} old table new table {K,V2} | 29 ©2003 Azul Systems, Inc. {K,V2}
  • 30. Copying Old To New www.azulsystems.com • New States V', T' – primed versions of V,T ─ Prime'd values in new table copied from old ─ Non-prime in new table is recent put() ─ “happens after” any prime'd value ─ Prime allows 2-phase commit ─ Engineering: wrapper class (Java), steal bit (C) • Must be sure to copy late-arriving old-table write • Attempt to copy atomically ─ May fail & copy does not make progress ─ But old, new tables not damaged | 30 ©2003 Azul Systems, Inc.
  • 31. New States: Prime'd www.azulsystems.com • A Key slot is: ─ null, K, X • A Value slot is: ─ null, T, V, X ─ T',V' – primed versions of T & V ─ Old things copied into the new table ─ “2-phase commit” ─ null → {T',V'}* → {T,V}* → X • State Machine again... | 31 ©2003 Azul Systems, Inc.
  • 32. State Machine – new table www.azulsystems.com kill {null,null} insert {K,T} {K,null} Deleted key Partially inserted K/V pair {K,X} copy {K,V} into newer table {K,T'/V'} {null,T/T'/V/V'/X} kill insert copy in from older table check newer table delete Empty {X,null} {K,V} Standard K/V pair States {X,T/T'/V/V'/X} not possible | 32 ©2003 Azul Systems, Inc.
  • 33. State Machine – new table www.azulsystems.com kill {null,null} insert {K,T} {K,null} Deleted key Partially inserted K/V pair {K,X} copy {K,V} into newer table {K,T'/V'} {null,T/T'/V/V'/X} kill insert copy in from older table check newer table delete Empty {X,null} {K,V} Standard K/V pair States {X,T/T'/V/V'/X} not possible | 33 ©2003 Azul Systems, Inc.
  • 34. State Machine – new table www.azulsystems.com kill {null,null} insert {K,T} {K,null} Deleted key Partially inserted K/V pair {K,X} copy {K,V} into newer table {K,T'/V'} {null,T/T'/V/V'/X} kill insert copy in from older table check newer table delete Empty {X,null} {K,V} Standard K/V pair States {X,T/T'/V/V'/X} not possible | 34 ©2003 Azul Systems, Inc.
  • 35. State Machine: Copy One Pair www.azulsystems.com {K,V1} read V1 K,V' in new table X in old table {K,X} old new {K,V'x} read V'x {K,V'1} partial copy Fence Fence | 35 ©2003 Azul Systems, Inc. {K,V'1} {K,V1} copy complete
  • 36. Some Things to Notice www.azulsystems.com • Old value could be V or T ─ or V' or T' (if nested resize in progress) • Skip copy if new Value is not prime'd ─ Means recent put() overwrote any old Value • If CAS into new fails ─ Means either put() or other copy in progress ─ So this copy can quit • Any thread can see any state at any time ─ And CAS to the next state | 36 ©2003 Azul Systems, Inc.
  • 37. Agenda www.azulsystems.com • • • • • • Motivation “Uninteresting” Hash Table Details State-Based Reasoning Resize Performance Q&A | 37 ©2003 Azul Systems, Inc.
  • 38. Microbenchmark www.azulsystems.com • Measure insert/lookup/remove of Strings • Tight loop: no work beyond HashTable itself and test harness (mostly RNG) • “Guaranteed not to exceed” numbers • All fences; full ConcurrentHashMap semantics • Variables: ─ 99% get, 1% put (typical cache) vs 75 / 25 ─ Dual Athalon, Niagara, Azul Vega1, Vega2 ─ Threads from 1 to 800 ─ NonBlocking vs 4096-way ConcurrentHashMap ─ 1K entry table vs 1M entry table | 38 ©2003 Azul Systems, Inc.
  • 39. AMD 2.4GHz – 2 (ht) cpus www.azulsystems.com 1K Table 1M Table 30 30 NB-99 25 CHM-99 20 M-ops/sec 20 M-ops/sec 25 NB-75 15 CHM-75 10 15 10 5 0 0 NB 5 1 2 3 4 5 Threads | 39 ©2003 Azul Systems, Inc. 6 7 8 CHM 1 2 3 4 5 Threads 6 7 8
  • 40. Niagara – 8x4 cpus www.azulsystems.com 1K Table 1M Table 80 70 70 60 60 CHM-99, NB-99 50 M-ops/sec M-ops/sec 80 40 30 CHM-75, NB-75 50 40 NB 30 20 20 10 10 0 CHM 0 0 8 16 24 32 40 Threads | 40 ©2003 Azul Systems, Inc. 48 56 64 0 8 16 24 32 Threads 40 48 56 64
  • 41. Azul Vega1 – 384 cpus www.azulsystems.com 1K Table 1M Table 500 500 NB-99 400 300 300 M-ops/sec M-ops/sec 400 CHM-99 200 200 NB NB-75 100 100 CHM-75 0 0 100 200 Threads | 41 ©2003 Azul Systems, Inc. 300 400 CHM 0 0 100 200 Threads 300 400
  • 42. Azul Vega2 – 768 cpus www.azulsystems.com 1K Table 1M Table 1200 1200 NB-99 1000 1000 800 CHM-99 600 M-ops/sec M-ops/sec 800 600 NB 400 400 NB-75 200 200 CHM-75 0 CHM 0 0 100 200 300 400 500 600 700 800 Threads | 42 ©2003 Azul Systems, Inc. 0 100 200 300 400 500 Threads 600 700 800
  • 43. Summary www.azulsystems.com • • • • ● ● ● A faster lock-free HashTable Faster for more CPUs Much faster for higher table modification rate State-Based Reasoning: ─ No ordering, no JMM, no fencing Any thread can see any state at any time ● Must assume values change at each step State graphs really helped coding & debugging Resulting code is small & fast | 43 ©2003 Azul Systems, Inc.
  • 44. Summary www.azulsystems.com ● ● ● Obvious future work: ● Tools to check states ● Tools to write code Seems applicable to other data structures as well ● Concurrent append j.u.Vector ● Scalable near-FIFO work queues Code & Video available at: http://blogs.azulsystems.com/cliff/ | 44 ©2003 Azul Systems, Inc.
  • 45. #1 Platform for Business Critical Java™ WWW.AZULSYSTEMS.COM Thank You