Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

620 views

Published on

Presentation by: Trey Cain and Mikko Lipasti
Paper and more information: http://soft.vub.ac.be/races/paper/edge-chasing-delayed-consistency-pushing-the-limits-of-weak-memory-models/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
620
On SlideShare
0
From Embeds
0
Number of Embeds
214
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • OK, so giving this talk is kind of a blast from the past for me. I actually did this work with Mikko Lipasti , who was my advisor at the time, when I was finishing up my PhD in 2004. We never presented it outside of my defense, after which I started my job at IBM and my research directions shifted gears. Since this was way back in 2004, if you don’t mind I just wanted to jog my memory a little bit at the outset of the talk, so like Marty McFly in Back to the Future I’m going to take a journey back to some highlights of 2004. So, this was the year that Mark Zuckerberg launched facebook. Boy, I can barely remember life before facebook. But then. It was also the year Janet Jackson suffered a wardrobe malfunction during the superbowl. How could anyone forget that? And lastly, it was the year that an incumbent president was being challenged by a Massachusetts politician in an election year. OK, so maybe things haven’t changed that much since then, both for politics as well as multiprocessor systems. So that was eight years ago, and some has changed but much has stayed the same. And when you think about it, relaxing synchronization is about as close as a programmer can come to time travel. Will I get the old value, or will I get the new value? Will I get what I expect, or will I get a wardrobe malfunction? OK, so pushing the gas pedal down to 88 mph , the flux capacitor is lit, and let’s go!
  • So like I said in the lightning round , when I saw the CFP for Races, I knew this would be a great place to share this prior work. While most of the discussion so far has been about software mechanisms for relaxing synchronization, but we were working within the constraints of a hardware developer. We were trying to achieve the same sort of scalable performance, while supporting legacy applications written to the PowerPC weakly ordered memory model , which we were unable to change. Given that constraint, the lever that we used was the hardware cache coherence protocol , where we attempted to avoid coherence misses by allowing a core to continue using stale data in its cache for as long as possible. So by avoiding coherence misses, hopefully we would be able to improve performance. We came up with a new implementation of the PowerPC weakly ordered memory model, which we called edge chasing delayed consistency.
  • Not that I need to motivate the problem to this audience, but you know shared memory multiprocessors are proliferating everywhere that you look. While they used to be relegated to high-end servers, now many cell phones, tvs, game consoles and tablets are SMPs. And the performance of these SMPs suffers due to coherence misses, even for relatively small systems.
  • This graph measures a 16 core system operating with a 16MB L3 Cache per core. And it shows the number of L3 misses per 1000 instructions, broken down by type, where the lower blue portion of the bar is the number of coherence misses, as you can see if that it is a significant fraction across ll of the workloads.
  • So what I’m going to be describing is a optimized implementation of weak ordering called edge-chasing delayed consistency. This is not a new consistency model for the programmer, it is a new implementation of weak ordering that allows a cache line to continue being read after it has been invalidated by another processor. In fact, it is going to allow that cache line to be read until it is absolutely necessary that the core see a new version of the line, where the necessary conditions are dictated by the consistency model, and that time is when the reading processor is causally dependent upon the new value. It is going to continue reading the old data until it is necessary for it to observe the old data. That is, until it observes a value in a memory location that precedes the invalidation of the stale block in the happens before relationship, that is it causally depends upon it.
  • So we were interested in developing a coherence protocol that enforced the necessary conditions of a consistency model, not sufficient conditions. In order to do this, to really understand what is necessary, we relied on a formalism called a “constraint graph” which many of you are probably aware of. Describe the constraint graph The key thing about the constraint graph is that if it is acyclic, then the execution is correct. If it contains a cycle, then it is impossible to put the set of operations in a total order, therefore it is incorrect.
  • So we extended the definition of the constraint graph to weakly ordered systems, where instead of there being edges between every instruction executed by a single thread, there are only edges between instructions and memory barriers, as well as a few other edges corresponding to single-threaded data dependnces
  • So the edge-chasing consistecy model derives its name froma class of deadlock detection algortihms that have been described for distributed database systems.
  • With 30% updates, speedups of 2.74, 1.82, and 1.18 for these list lengths With 100% updates, speedups of 3.11, 3.87, and 1.35 for these list lengths
  • Intolerable vs. tolerable misses Bars, left to right We expect ECDC to improve performance for reductions in false sharing misses and true sharing misses to data. As we can see from this chart, most of the reduction comes from misses to falsely shared data and misses to truly shared synchronization data. We do not believe that any of these applications exhibit the data-race tolerant quality of the lock free list insertion microbenchmark or convergent iterative algorithms. Raytrace exhibits the most reduction, over 50 percent of all coherence misses can be tolerated using ECDC, however most of these are synchronization misses. Other applications who can use a significant amount of stale data are TPC-H, SPECweb99, and SPECjbb2000.
  • So this graph shows the normalized execution time for three variants of the ECDC protocol, so lower is better, relative to a baseline coherence protocol. In terms of performance improvements for real applications, it is a little disappointing, around 4% for SPECweb99 and 7.5% for TPC-H. (don’t go back)
  • So our conclusions after staring at the data for a while was that the two success stories were mostly benefiting from the false sharing reduction. For the other applictions, either there weren’t enough coherence misses, or the avoidance of those misses does not improve performance. For example in the case of synchronization variables, you may be able to see the “locked” value for a little longer than you would otherwise. So instead of being stalled on a cache miss to retrieve the lock from the processor releasing a lock, you’re simply able to see the old value, and spin longer. It is unclear to us why one would expect results to be any different for applications that rely on lock-based synchronization. For other sorts of synchronization models, the story may be different: for example lock free data structures like the linked list example we showed, or for the transactional programming model perhaps. So one final word of caution before concluding. While Hans data races are pure evil, Donald Knuth has stated that premature optimization is the root of all evil. If you have a Barnes Hut, and a vision for attacking the problem go for it, in other words find your nail before inventing hammers.
  • So, when I talk about causality and causal dependences, what do I mean by that?
  • Ended at 16:00
  • E.G. OoO processor
  • E.G. OoO processor
  • Ask Mikko
  • Infrastructure issues with models weaker than weak ordering
  • Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

    1. 1. IBM T.J. Watson Research CenterRACES’12 Oct 21, 2012 © 2012 IBM CorporationEdge Chasing Delayed Consistency:Pushing the Limits of Weak Memory ModelsHarold “Trey” CainIBM T.J. Watson Research CenterProf. Mikko H. LipastiUniversity of Wisconsin
    2. 2. IBM Research© 2012 IBM Corporation2 Cain and Lipasti Oct 21, 2012Gotta go back in time! Part of Ph.D. Dissertation– Never submitted for publication, until now.– Looked particularly relevant when I saw the RACES CFP. Journey back in time to the year 2004, when…– … Mark Zuckerberg launched Facebook– … Janet Jackson suffered a “wardrobe malfunction”during the Superbowl halftime show– … an incumbent president was being challenged by aMassachusetts politician 88mph here we come!
    3. 3. IBM Research© 2012 IBM Corporation3 Cain and Lipasti Oct 21, 2012Edge Chasing Delayed Consistency: Pushing the Limits ofWeak Ordering From the RACES website:– “an approach towards scalability that reduces synchronizationrequirements drastically, possibly to the point of discarding themaltogether.” A hardware developer’s perspective:– Constraints of Legacy Code• What if we want to apply this principle, but have no control over theapplications that are running on a system?– Can one build a coherence protocol that avoids synchronizing cores asmuch as possible?• For example by allowing each core to use stale versions of cache lines aslong as possible• While maintaining architectural correctness; i.e. we will not break existingcode• If we do that, what will happen?
    4. 4. IBM Research© 2012 IBM Corporation4 Cain and Lipasti Oct 21, 2012Cache-Coherent Shared-memory multiprocessors Are ubiquitous Coherence misses are a major source of performance loss forshared memory applications10 years ago Today
    5. 5. IBM Research© 2012 IBM Corporation5 Cain and Lipasti Oct 21, 201216MB L3 Cache Misses per 1000 inst
    6. 6. IBM Research© 2012 IBM Corporation6 Cain and Lipasti Oct 21, 2012Edge-Chasing Delayed Consistency (ECDC) A new hardware implementation of POWER weakordering– Not a new consistency model Allows a cache line to be non-speculatively readafter being invalidated. Based on necessary conditions– Processor must fetch new data only if causally dependenton it.
    7. 7. IBM Research© 2012 IBM Corporation7 Cain and Lipasti Oct 21, 2012Constraint graph Introduced for SC by Landin et al., ISCA-18 Directed-graph represents a multithreaded execution– Nodes represent dynamic instances of instructions– Edges represent their transitive orders (program order, RAW,WAW, WAR). If the constraint graph is acyclic, then the execution iscorrect
    8. 8. IBM Research© 2012 IBM Corporation8 Cain and Lipasti Oct 21, 2012Constraint graph example - WOProc 1 Proc 2LD AST BLD BST->MBOrderLD->MBOrderWrite-after-readdependence orderRead-after-writedependence orderST AMB MBMB->STOrderMB->LDOrder1.2.3.5.4.Observation: An aggressive coherence protocol can ignore coherence messagesunless doing so will create a cycle in the constraint graph
    9. 9. IBM Research© 2012 IBM Corporation9 Cain and Lipasti Oct 21, 2012Edge-chasing delayed consistency Based on edge-chasing algorithms used by distributeddatabase systems for deadlock detectionP1 P2 P3 P4Wham-O!Cycle in WFG detected when a locally created probe received
    10. 10. IBM Research© 2012 IBM Corporation10 Cain and Lipasti Oct 21, 2012ECDC - Basic idea Observation: Cycles in constraint graph can be detectedusing a similar mechanism Protocol:– Upon write miss, create a “probe”– Upon receipt of invalidation, add probe to cache line• Continue to read stale block until the probe is re-observed onanother message– Pass probe to other processors at communication
    11. 11. IBM Research© 2012 IBM Corporation11 Cain and Lipasti Oct 21, 2012Example – necessary miss (SC)Proc1Proc 2LD AST BLD BRAWST ALD AWARLine A is in proc 1’scache, valid bit = 1Line A is in proc 1’scache, valid bit = 0Supplanter ProbeA =RAW
    12. 12. IBM Research© 2012 IBM Corporation12 Cain and Lipasti Oct 21, 2012Detecting critical writes Some write values shouldn’t be delayed (e.g. lockreleases, barriers, etc.) Two heuristics– Atomic primitives – any cache block that has beentouched by a store-conditional should not be delayed– Polling detection – If consecutive cache accesses havesame PC and address, discard stale line
    13. 13. IBM Research© 2012 IBM Corporation13 Cain and Lipasti Oct 21, 2012Performance Evaluation PHARMSim – Cycle-mode Full System Simulator– Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], withinthe SimOS-PPC full-system simulator– Out-of-order single-threaded core– 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 bytecache lines– Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)– Stride-based prefetcher modeled after Power4 Lock-free list insertion microbenchmark Full applications– SPLASH2: fft, fmm, ocean, radix, raytrace– Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99
    14. 14. IBM Research© 2012 IBM Corporation14 Cain and Lipasti Oct 21, 2012Why delayed consistency? False sharing/Silent sharing Convergant/Data-race tolerant algorithms– Genetic algorithms– Parallel equation solvers– Sparse matrix factorization Lock-free parallel linked data structures
    15. 15. IBM Research© 2012 IBM Corporation15 Cain and Lipasti Oct 21, 2012Lock-free Algorithms For example list insertion:– New node’s next pointer set to cur– CAS operation atomically updates prev’s next pointer to new Increasingly commonprev curnew
    16. 16. IBM Research© 2012 IBM Corporation16 Cain and Lipasti Oct 21, 2012Prior work (Delayed consistency) Invalidate-based receiver-delayed protocols, sender-delayedprotocols (Dubois et al., SC ’91) Lazy release consistency (Keleher et al., ISCA ’92) Update-based receiver-delayed, sender-delayed protocols(Afek et al., TPLS, ’93) Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95) Write cache for reducing bandwidth in update coherenceprotocol (Dahlgren and Stenstrom, JPDC ’95)
    17. 17. IBM Research© 2012 IBM Corporation17 Cain and Lipasti Oct 21, 2012Lock-free list microbenchmark0200040006000800010000120001400016000180000 20 40 60 80 100% updatescycles/searchbase-1000ecdc-1000base-100ecdc-100base-10ecdc-10 Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02]15 threads randomly updating or searching linked list, 1 thread performing searches
    18. 18. IBM Research© 2012 IBM Corporation18 Cain and Lipasti Oct 21, 2012Intolerable miss reductionLeft to right: a) baseline, b) ECDC base,c) ECDC merged read/write sets, d) ECDC scalar probe set
    19. 19. IBM Research© 2012 IBM Corporation19 Cain and Lipasti Oct 21, 2012ECDC Performance (Infinite resources)
    20. 20. IBM Research© 2012 IBM Corporation20 Cain and Lipasti Oct 21, 2012Conclusions Of nine applications studied, performance improvement for two– Mostly due to reduction in false sharing misses Other applications:– Not enough coherence misses, or– The avoidance of those misses does not improve performance We believe these results generalize to lock-based programs Other programming models may have potential– As shown, lock-free data structures• Should also apply to transactional programming model– But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth– Best to identify apps with a communication bottleneck before attacking
    21. 21. IBM Research© 2012 IBM Corporation21 Cain and Lipasti Oct 21, 2012Questions?
    22. 22. IBM Research© 2012 IBM Corporation22 Cain and Lipasti Oct 21, 2012Backup slides
    23. 23. IBM Research© 2012 IBM Corporation23 Cain and Lipasti Oct 21, 2012Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],within the SimOS-PPC full-system simulatorOut-of-orderexecution core15-stage, 8-wide pipeline256 entry reorder buffer, 128 entry load/store queue32 entry issue queueFunctionalunits (latency)8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),4 L1 Dcache load ports in OoO window1 L1 Dcache load/store port at commitFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selectiontable, 64 entry RAS, 8k entry 4-way BTBMemorysystem(latency)32k DM L1 icache (1), 32k DM L1 dcache (1)256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache linesMemory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)Stride-based prefetcher modeled after Power4
    24. 24. IBM Research© 2012 IBM Corporation24 Cain and Lipasti Oct 21, 2012Causality (Lamport) An instruction i is causallydependent upon instruction j ifthere is a directed path from jto i Two operations are concurrentif neither causally dependsupon the other Coherence misses are asignificant source ofperformance degradation formany applications If two operations areconcurrent, why is theirperformance penalized?TimeP3P2P1st Ast Cld Ast Bld Cld Bld A
    25. 25. IBM Research© 2012 IBM Corporation25 Cain and Lipasti Oct 21, 2012Prior work: formal memory model representations Local, WRT, global “performance” of memory ops (Dubois etal., ISCA-13) Acyclic graph representation (Landin et al., ISCA-18) Modeling memory operation as a series of sub-operations(Collier, RAPA) Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load forwarding(Gharachorloo, thesis)
    26. 26. IBM Research© 2012 IBM Corporation26 Cain and Lipasti Oct 21, 2012Anatomy of a cycleProc1ST AProc 2LD AST BLD BProgramorderProgramorderWARRAWIncoming invalidateCache miss
    27. 27. IBM Research© 2012 IBM Corporation27 Cain and Lipasti Oct 21, 2012Other prior work Speculative stale value usage– LVP with Stale Values (Lepak, Ph.D. Thesis ‘03)– Coherence Decoupling (Huh et al., ASPLOS ’04) Delayed RFO response to improvesynchronization throughput (Rajwar et al., HPCA’00)
    28. 28. IBM Research© 2012 IBM Corporation28 Cain and Lipasti Oct 21, 2012Constraint graph extensions Constraint graph definition differs for otherconsistency models Processor consistency– Remove program order edges from stores to subsequentloads– Remaining single-thread orders: edges from• Loads to subsequent loads• Stores to subsequent stores• Loads to subsequent stores
    29. 29. IBM Research© 2012 IBM Corporation29 Cain and Lipasti Oct 21, 2012Constraint graph extensions Constraint graph definition differs for otherconsistency models Weak ordering– Remove program order edges– Add single-thread ordering edges between• memory barrier and preceding/following instructions• same address reads/writes• dependent instructions
    30. 30. IBM Research© 2012 IBM Corporation30 Cain and Lipasti Oct 21, 2012PC Example – Dekker’s Alg.Proc1ST AProc 2ST BLD B LD AWrite-after-readdependence orderProgramorderProgramorderLack of store-to-load orderresults in acyclic graph1.2.3.4.
    31. 31. IBM Research© 2012 IBM Corporation31 Cain and Lipasti Oct 21, 2012Constraint graph example - SCProc1ST AProc 2LD AST BLD BProgramorderProgramorderWrite-after-readdependence orderRead-after-writedependence orderCycle indicates thatexecution is incorrect1.2.3.4.
    32. 32. IBM Research© 2012 IBM Corporation32 Cain and Lipasti Oct 21, 2012Constraint graph example - PCProc1ST AProc 2LD BST BLD AProgramorderProgramOrderWrite-after-readdependence orderRead-after-writedependence order1.2.3.4.
    33. 33. IBM Research© 2012 IBM Corporation33 Cain and Lipasti Oct 21, 2012ECDC Conceptual Description Identify causal dependences (upstream probe sets)– 1 upstream set per processor– 2 upstream sets per cache block (read set, write set) Communicating dependences– Probe sets passed on response messages– Probes attached to incoming invalidation messages– Extra ProbePropagation messages sent at memory barriers Identifying usable stale blocks– Extra stable state in cache (ST)– Supplanter probe
    34. 34. IBM Research© 2012 IBM Corporation34 Cain and Lipasti Oct 21, 2012ECDC OperationInitially1. ld A2. st A3. ld B4. st B5. ld CФprocupstream{ }{ }{ , }{ , }{ , }Ф(read|write)A{ | , }{ | , }{ | , }{ | , }{ | , }{ | }{ | }{ | }{ , | , }{ , | , }Ф(read|write)B
    35. 35. IBM Research© 2012 IBM Corporation35 Cain and Lipasti Oct 21, 2012Finite ECDC Performance When restricting PPB/STAB resources (220 KB perprocessor)– 16k probe lifetime counter– 128 entry STAB per processor– 32 Entry PPB per processor/directory controller (256 PPBvirtual namespace) TPC-H/SPECweb99 performance within margin oferror to infinite resources
    36. 36. IBM Research© 2012 IBM Corporation36 Cain and Lipasti Oct 21, 2012Non-atomicity of writes Absent from model Effect on optimizations– Forces unnecessary orders to exist– Correct, but another example of over-conservatism Hopefully, infrequent performance divotProcessor p1st r1, [A]Processor p2ld r1, [A]st r2, [r1]Processor p3ld r1, [B]membarld r2, [A]
    37. 37. IBM Research© 2012 IBM Corporation37 Cain and Lipasti Oct 21, 2012ECDC Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],within the SimOS-PPC full-system simulatorOut-of-orderexecution core15-stage, 8-wide pipeline256 entry reorder buffer, 128 entry load/store queue32 entry issue queueFunctionalunits (latency)8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),4 L1 Dcache load ports in OoO window1 L1 Dcache load/store port at commitFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selectiontable, 64 entry RAS, 8k entry 4-way BTBCacheHierarchy(latency)32k DM L1 icache (1), 32k DM L1 dcache (1)256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache linesStride-based prefetcher modeled after Power4Memorysystem(latency)2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth perlink, 5GHZ clock)Memory (400 cycle best-case latency, 10 GB/S bandwidth)
    38. 38. IBM Research© 2012 IBM Corporation38 Cain and Lipasti Oct 21, 2012Mapping ECDC to HW STAB – Maintainssupplanting probe for eachstale cache block PPB – Maintainsapproximation of upstreamsets In caches – 2 extra bits forstale state and synchheuristicDRAMDirMemCtrNICL2 $D$I$PSTABPPBCastoutPPB
    39. 39. IBM Research© 2012 IBM Corporation39 Cain and Lipasti Oct 21, 2012Probe representation Each probe represented by n-bit timer Stale block may be used until supplanting probetimer expires Probe set in p-processor system represented by ptimers
    40. 40. IBM Research© 2012 IBM Corporation40 Cain and Lipasti Oct 21, 2012STAB Detail125258123timer9980x112c0x24e20xc123address925690xf2e5104250x8000 (998)(13523)(21646)CacheIncoming Invalidatesp1 p2 p3counters
    41. 41. IBM Research© 2012 IBM Corporation41 Cain and Lipasti Oct 21, 2012PPB Detailaddress hash00055151893270002727127282735009218028080085595000012121212724Shift register/probe timers…Incoming upstream setExpired upstream setTimer index table
    42. 42. IBM Research© 2012 IBM Corporation42 Cain and Lipasti Oct 21, 2012Memory consistency review Memory consistency model– Specifies the programming interface to a shared memory– i.e. the allowable interleaving of instructions Models discussed here:– Sequential Consistency– Processor Consistency• No store-to-load program order– Weak Ordering• Order wrt memory barriers• Same-address order• Dependence order
    43. 43. IBM Research© 2012 IBM Corporation43 Cain and Lipasti Oct 21, 2012Example – necessary miss (SC)Proc1Proc 2LD AST BLD BRAWST ALD AWARPO POPOBlock A is in proc 1’scache, valid bit = 1Block A is in proc 1’scache, valid bit = 0
    44. 44. IBM Research© 2012 IBM Corporation44 Cain and Lipasti Oct 21, 2012Example – avoidable miss (SC)Proc1Proc 2LD AST BLD BRAWST ALD AWARPOPOPOBlock A is in proc 1’scache, valid bit = 1Block A is in proc 1’scache, valid bit = 0
    45. 45. IBM Research© 2012 IBM Corporation45 Cain and Lipasti Oct 21, 2012Typical ReadX transaction When sending invalidation, create probe, add to PPB At receipt of invalidation (2b, 2c) add probe to STAB When sending invalidate acknowledgment, add probe set to the response When receiving invalidate acknowledgment, add incoming probe set to the PPB3(a) Inval AckRS1H1. ReadX3(b) Inval AckS22(a) Sharers/Data2(b) Inval2(c) Inval
    46. 46. IBM Research© 2012 IBM Corporation46 Cain and Lipasti Oct 21, 2012Invalidation to read distance0%20%40%60%80%100%1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09cycles%ofloadcohmissesfftfmmoceanradixraytraceSPECjbb2000SPECweb99TPC-BTPC-H
    47. 47. IBM Research© 2012 IBM Corporation47 Cain and Lipasti Oct 21, 2012Invalidation to read distance (synch)0%10%20%30%40%50%60%70%80%90%100%1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09cycles%ofloadcohmissesfftfmmoceanradixraytraceSPECjbb2000SPECweb99TPC-BTPC-H
    48. 48. IBM Research© 2012 IBM Corporation48 Cain and Lipasti Oct 21, 2012Invalidation to read distance (data)0%10%20%30%40%50%60%70%80%90%100%1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09cycles%ofloadcohmissesfftfmmoceanradixraytraceSPECjbb2000SPECweb99TPC-BTPC-H
    49. 49. IBM Research© 2012 IBM Corporation49 Cain and Lipasti Oct 21, 2012STAB entry death cdf00.10.20.30.40.50.60.70.80.911 10 100 1000 10000 100000 1000000cycles%STABentriesdeallocatedfftfmmoceanradixraytraceSPECjbb2000SPECweb99TPC-BTPC-H
    50. 50. IBM Research© 2012 IBM Corporation50 Cain and Lipasti Oct 21, 2012STAB Entry Lifetime
    51. 51. IBM Research© 2012 IBM Corporation51 Cain and Lipasti Oct 21, 2012ECDC performance (16k probe lifetime)
    52. 52. IBM Research© 2012 IBM Corporation52 Cain and Lipasti Oct 21, 2012ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)
    53. 53. IBM Research© 2012 IBM Corporation53 Cain and Lipasti Oct 21, 2012ProbePropagation messages
    54. 54. IBM Research© 2012 IBM Corporation54 Cain and Lipasti Oct 21, 2012ECDC Storage Overhead0501001502002503003504p 8p 16p 32p 64p 128p 256p 512p 1024pProcessor countStorage(KB)
    55. 55. IBM Research© 2012 IBM Corporation55 Cain and Lipasti Oct 21, 2012What about limit study? Indicated a larger number of avoidable coherencemisses Reasons:– Did not account for non-speculative nature of protocol(oracle ECDC could be better)– Inaccurate measurement of critical writes• Many loads perform polling to lines that have never beentouched by a load-linked or store-conditional– Used isolated stale data detection mechanism
    56. 56. IBM Research© 2012 IBM Corporation56 Cain and Lipasti Oct 21, 2012What about speculative load squashes? In a few applications, they occur frequently(SPECjbb2000, TPC-H) Implemented/evaluated read-set-tracking w/squash on miss Could eliminate a large fraction of squashes– Unfortunately, little performance improvement– Presumably, many squashes caused by contended spinlocks
    57. 57. IBM Research© 2012 IBM Corporation57 Cain and Lipasti Oct 21, 2012ECDC and other consistency models Stricter model => more ProbePropagationmessages Potential for release consistency In SC/PC/TSO, ECDC benefits will probably bedominated by extra ProbePropagation messages
    58. 58. IBM Research© 2012 IBM Corporation58 Cain and Lipasti Oct 21, 2012Cause of STAB entry deallocation
    59. 59. IBM Research© 2012 IBM Corporation59 Cain and Lipasti Oct 21, 2012Publications [ISCA ’04] Memory ordering: A Value-based approach.– Selected for IEEE Micro Top Picks ‘04 [PACT ’03] Constraint Graph Analysis of Multithreaded Programs.– Selected for Best of PACT JILP Issue [PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs. [CAECW ’02] Precise and Accurate Processor Simulation [SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks. [Micro ’01] Correctly Implementing Value Prediction in Microprocessors that SupportMultithreading or Multiprocessing. [WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation [HPCA ’01] An Architectural Characterization of Java TPC-W. [Euro-Par ’00] A Callgraph-Based Search Strategy for Automated PerformanceDiagnosis.– Selected as distinguished paper [CAECW ’00] Characterizing a Java Implementation of TPC-W

    ×