Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Techniques to Improve Scalability of Transactional Memory systems Salil Pant Advisor:  Dr G. Byrd
Introduction <ul><li>Shared memory parallel programs need synchronization </li></ul><ul><li>Lock-based synchronization usi...
<ul><li>Issues with TM </li></ul><ul><ul><li>Scalability </li></ul></ul><ul><li>Contributions  </li></ul><ul><ul><li>Analy...
Conventional Synchronization  <ul><li>Conservative, blocking, lock-based  </li></ul><ul><li>Atomic read-modify-write primi...
<ul><li>Software  </li></ul><ul><ul><li>Mapping from locks to shared conf. variables </li></ul></ul><ul><ul><ul><li>Progra...
Transactional Memory <ul><li>Proposed by Herlihy </li></ul><ul><li>“ Transactional abstraction”  -  </li></ul><ul><ul><li>...
Hardware-supported TM <ul><li>Special instructions to indicate transactional accesses </li></ul><ul><ul><li>Initialize buf...
Hardware TM  <ul><li>Additions to the chip ( TLR proposal ) </li></ul>
Advantages <ul><li>Transfers burden to the designer  </li></ul><ul><ul><li>deadlocks, livelocks, starvation freedom etc.  ...
Issues with TM  <ul><li>TM is a optimistic speculative sync scheme </li></ul><ul><ul><li>Works best under mild/medium cont...
Scalability Issue <ul><li>Scalability of TM with increasing number of processors </li></ul><ul><ul><li>Optimistic executio...
Measuring scalability <ul><li>What are we looking for ?  </li></ul><ul><ul><li>Application vs. system scalability </li></u...
Queue Micro-benchmark <ul><li>Queue Micro-benchmark for TM </li></ul><ul><ul><li>2^10 insert/delete operations </li></ul><...
Micro-benchmark Results
Benchmark results
Observations <ul><li>Conflicts increase with increasing CPUs </li></ul><ul><ul><li>TM overhead can lead to slowdown </li><...
Value Predictor Idea <ul><li>TM performance degrades with conflicts </li></ul><ul><li>Certain data structures hard to para...
<ul><li>Serializing data/operations are predictable </li></ul><ul><ul><li>Pointers: head, tail etc </li></ul></ul><ul><ul>...
Implementation <ul><li>Stride-based, memory-level  </li></ul><ul><li>Base LogTM model  </li></ul><ul><ul><li>In-memory log...
Implementation <ul><li>Addresses identified as predictable by the programmer/compiler. </li></ul><ul><li>Value predictor i...
<ul><li>Need an extra buffer to hold predicted data. </li></ul><ul><ul><li>Only with LogTM </li></ul></ul><ul><ul><li>Cann...
Implementation Directory M 1 CPU 1 CPU 3 CPU 2 Data GetX GetX FGetX Nack Nack <ul><li>Log TM model  </li></ul>
Implementation Directory M-1 S-2 Value Predictor CPU 1 CPU 3 CPU 2 Nack Nack Pred Retry GetX FGetX FGetX <ul><li>Generatin...
<ul><li>State after predictions </li></ul>Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Nack Retry FGetX Nack Retr...
<ul><li>Successful predictions </li></ul>Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry F...
Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry Result NP S-3 <ul><li>Failed Predictions  ...
Evaluation <ul><li>Microbenchmarks: Loop-based, 2^10 Xn </li></ul><ul><ul><li>Shared-counter  </li></ul></ul><ul><ul><ul><...
Results
Splash Benchmarks  Cholesky  Raytrace  <ul><li>Adding directives to support value prediction </li></ul>
Splash benchmarks Table with TM parameters for 16 processors 13572 7100 Xn size No. of Xn %Stalls Writes %Aborts LogTM / V...
Observations <ul><li>Value predictor can improve speedup without much overhead </li></ul><ul><li>Performance gains with in...
Extending the value predictor <ul><li>Improving the simulation model </li></ul><ul><li>Exploring other types of value pred...
Proposed Ideas <ul><li>Value predictor not general enough! </li></ul><ul><ul><li>Need to reduce conflicts </li></ul></ul><...
Proposed Ideas contd. <ul><li>Why is this different from any other scalability research ?  </li></ul><ul><ul><li>Recent wo...
Proposed Ideas contd. <ul><li>Effectiveness of Nacks/Stalls decreases as number of processors increases </li></ul><ul><ul>...
Experiments/Analysis <ul><li>Need better benchmarks </li></ul><ul><ul><li>Synchronization intensive  </li></ul></ul><ul><u...
<ul><li>Contribution </li></ul><ul><ul><li>Identify scalability bottleneck with TM </li></ul></ul><ul><ul><li>Value predic...
Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
 
<ul><li>Overall, TCC’s FPGA implementation </li></ul><ul><li>adds 14% overhead in the control logic, and 29% in on chip me...
 
Upcoming SlideShare
Loading in …5
×

Prelim Slides

519 views

Published on

Slides from my PhD preliminary Exam

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Prelim Slides

  1. 1. Techniques to Improve Scalability of Transactional Memory systems Salil Pant Advisor: Dr G. Byrd
  2. 2. Introduction <ul><li>Shared memory parallel programs need synchronization </li></ul><ul><li>Lock-based synchronization using atomic read-modify-write primitives </li></ul><ul><li>Problems with locks </li></ul><ul><li>Solution:- Transactional memory </li></ul><ul><ul><li>Speculative and optimistic </li></ul></ul><ul><ul><li>Relieves the programmer </li></ul></ul>
  3. 3. <ul><li>Issues with TM </li></ul><ul><ul><li>Scalability </li></ul></ul><ul><li>Contributions </li></ul><ul><ul><li>Analysis of TM for scalability </li></ul></ul><ul><ul><li>Value-Predictor </li></ul></ul><ul><li>Results </li></ul><ul><li>Proposed work </li></ul>
  4. 4. Conventional Synchronization <ul><li>Conservative, blocking, lock-based </li></ul><ul><li>Atomic read-modify-write primitives </li></ul><ul><ul><li>Provide atomicity only for a single address. </li></ul></ul><ul><ul><li>Sync variables exposed to the programmer </li></ul></ul><ul><li>Programmer orchestrates synchronization </li></ul><ul><li>Granularity = (No. of shared R/W variables covered) (No. of lock variables) </li></ul><ul><li>High (>> 1) = coarse , low (~1) = fine </li></ul><ul><li>Fine granularity => More concurrency => better perf. </li></ul><ul><ul><li>as long as program runs correctly </li></ul></ul>
  5. 5. <ul><li>Software </li></ul><ul><ul><li>Mapping from locks to shared conf. variables </li></ul></ul><ul><ul><ul><li>Programmers opt for coarse grain locks </li></ul></ul></ul><ul><ul><li>Deadlocks, livelocks, starvation other issues managed by programmer </li></ul></ul><ul><ul><li>Blocking sync not good for fault tolerance </li></ul></ul><ul><li>Hardware </li></ul><ul><ul><li>Basic test and set not scalable </li></ul></ul><ul><ul><li>Software queue-based locks too heavy for common case </li></ul></ul>Problems Fine granularity == lot of locks == hard to program/debug
  6. 6. Transactional Memory <ul><li>Proposed by Herlihy </li></ul><ul><li>“ Transactional abstraction” - </li></ul><ul><ul><li>critical sections become “transactions” </li></ul></ul><ul><ul><li>ACI properties </li></ul></ul><ul><li>Optimistic speculative execution of critical sections </li></ul><ul><li>Conflicting accesses detected and execution rolled back </li></ul><ul><ul><li>read-write, write-write, write-read </li></ul></ul><ul><li>Can be implemented by hardware or software </li></ul>Lock (X); Update (A) Unlock (X); Lock (Y) Update (B); Unlock (Y); Begin_transaction; Update(A); End_transaction; Begin_transaction; Update(B); End_transaction ; Lock (X); Lock (Y); Update (A,B); Unlock(Y); Unlock(X); Begin_transaction; Update(A,B); End_transaction;
  7. 7. Hardware-supported TM <ul><li>Special instructions to indicate transactional accesses </li></ul><ul><ul><li>Initialize buffer to save transactional data </li></ul></ul><ul><ul><li>Checkpoint at the beginning </li></ul></ul><ul><li>Buffer to log versions of transactional data </li></ul><ul><ul><li>Special write buffer </li></ul></ul><ul><ul><li>In-memory log </li></ul></ul><ul><li>Conflict detection/resolution mechanism </li></ul><ul><ul><li>via coherence protocol </li></ul></ul><ul><ul><li>“ timestamps” – local logical clock + cpu_id </li></ul></ul><ul><li>Mechanism to rollback state </li></ul><ul><ul><li>Hardware to checkpoint processor state </li></ul></ul><ul><ul><li>ROB-based </li></ul></ul>
  8. 8. Hardware TM <ul><li>Additions to the chip ( TLR proposal ) </li></ul>
  9. 9. Advantages <ul><li>Transfers burden to the designer </li></ul><ul><ul><li>deadlocks, livelocks, starvation freedom etc. </li></ul></ul><ul><li>Ease of programming </li></ul><ul><ul><li>More transactions does not mean hard programs </li></ul></ul><ul><li>Performs better than locks in the common case </li></ul><ul><ul><li>More concurrency, less overhead </li></ul></ul><ul><ul><li>Concurrency now depends on size of transaction </li></ul></ul><ul><li>Non-blocking advantages </li></ul><ul><li>Can be implemented in software or by hardware. </li></ul><ul><ul><li>We mainly focus on hardware </li></ul></ul>
  10. 10. Issues with TM <ul><li>TM is a optimistic speculative sync scheme </li></ul><ul><ul><li>Works best under mild/medium contention </li></ul></ul><ul><li>How does HTM deal with ? </li></ul><ul><ul><li>Large transaction sizes </li></ul></ul><ul><ul><li>System calls or I/O inside transactions </li></ul></ul><ul><ul><li>Processes/threads getting de-scheduled </li></ul></ul><ul><ul><li>Thread migration </li></ul></ul>
  11. 11. Scalability Issue <ul><li>Scalability of TM with increasing number of processors </li></ul><ul><ul><li>Optimistic execution beneficial at 32 processors ? </li></ul></ul><ul><li>Greater overhead with conflicts/aborts compared to lock-based sync </li></ul><ul><ul><li>Memory + processor rollback </li></ul></ul><ul><ul><li>Network overhead </li></ul></ul><ul><li>Serialized commit/abort needed to maintain atomicity. </li></ul><ul><li>Transaction sizes predicted to increase </li></ul><ul><ul><li>Support I/O, system calls within transactions </li></ul></ul><ul><ul><li>Integrate TM with higher programming language models </li></ul></ul>
  12. 12. Measuring scalability <ul><li>What are we looking for ? </li></ul><ul><ul><li>Application vs. system scalability </li></ul></ul><ul><ul><li>TM overhead == conflicts </li></ul></ul><ul><li>Measure speedup for up to 32 processor systems </li></ul><ul><ul><li>“ Tourmaline” simulator for TM </li></ul></ul><ul><ul><li>Simple TM system with a timing model for memory accesses. </li></ul></ul><ul><ul><li>Provides version management & conflict detection. </li></ul></ul><ul><ul><li>Timestamps for conflict resolution </li></ul></ul><ul><ul><li>Conflicts always abort “younger” transactions </li></ul></ul><ul><ul><li>No network model </li></ul></ul><ul><ul><li>Added simple backoff </li></ul></ul><ul><li>2 Splash Benchmarks were “transactified” </li></ul><ul><ul><li>Cholesky & Raytrace </li></ul></ul>
  13. 13. Queue Micro-benchmark <ul><li>Queue Micro-benchmark for TM </li></ul><ul><ul><li>2^10 insert/delete operations </li></ul></ul><ul><ul><li>Important structure used in splash benchmarks </li></ul></ul>
  14. 14. Micro-benchmark Results
  15. 15. Benchmark results
  16. 16. Observations <ul><li>Conflicts increase with increasing CPUs </li></ul><ul><ul><li>TM overhead can lead to slowdown </li></ul></ul><ul><li>Situation gets worse with increased transaction sizes </li></ul><ul><li>Effect on speedup might be worse with a network model in place. </li></ul><ul><li>How to make TM resilient to conflicts ? </li></ul>
  17. 17. Value Predictor Idea <ul><li>TM performance degrades with conflicts </li></ul><ul><li>Certain data structures hard to parallelize </li></ul><ul><ul><li>No performance difference with TM </li></ul></ul>
  18. 18. <ul><li>Serializing data/operations are predictable </li></ul><ul><ul><li>Pointers: head, tail etc </li></ul></ul><ul><ul><li>Sizes: constant increment/decrements </li></ul></ul><ul><li>HTM already includes </li></ul><ul><ul><li>speculative hardware for buffering </li></ul></ul><ul><ul><li>checkpoint and rollback capability </li></ul></ul><ul><li>Still reap benefits of TM </li></ul><ul><li>Allows running transactions in parallel with predicted values </li></ul><ul><li>Such queues used mainly for task/memory management purposes </li></ul><ul><ul><li>Cholesky, Raytrace, Radiosity </li></ul></ul>
  19. 19. Implementation <ul><li>Stride-based, memory-level </li></ul><ul><li>Base LogTM model </li></ul><ul><ul><li>In-memory logging of old values during Xn stores </li></ul></ul><ul><ul><li>Eager conflict detection, timestamps for conflict resolution </li></ul></ul><ul><ul><li>Uses a Nack-based coherence protocol </li></ul></ul><ul><ul><li>Deadlock detection mechanism </li></ul></ul><ul><ul><li>Commits are easy </li></ul></ul><ul><ul><li>Aborts need memory + processor rollback. </li></ul></ul><ul><li>Nacks used to trigger value predictor </li></ul>
  20. 20. Implementation <ul><li>Addresses identified as predictable by the programmer/compiler. </li></ul><ul><li>Value predictor initializes entry with the address </li></ul><ul><li>VP entry, 1 per VP address </li></ul><ul><ul><li>Ordered list of real values, 2 in our design </li></ul></ul><ul><ul><li>Ordered list of predicted values </li></ul></ul><ul><ul><li>Ordered list of predicted cpus </li></ul></ul><ul><li>Fortunately, max 3 or 4 VP entries needed so far. </li></ul>
  21. 21. <ul><li>Need an extra buffer to hold predicted data. </li></ul><ul><ul><li>Only with LogTM </li></ul></ul><ul><ul><li>Cannot log predicted load value in memory </li></ul></ul><ul><li>Predictions checked at commit time </li></ul><ul><ul><li>Execution does not advance beyond commit until verified </li></ul></ul><ul><li>Needs changes in the coherence protocol </li></ul><ul><ul><li>More deadlock scenarios </li></ul></ul><ul><li>Simplifications </li></ul><ul><ul><li>Address, VP entries </li></ul></ul><ul><ul><li>Timing of VP </li></ul></ul><ul><ul><li>Always generate exclusive requests </li></ul></ul>
  22. 22. Implementation Directory M 1 CPU 1 CPU 3 CPU 2 Data GetX GetX FGetX Nack Nack <ul><li>Log TM model </li></ul>
  23. 23. Implementation Directory M-1 S-2 Value Predictor CPU 1 CPU 3 CPU 2 Nack Nack Pred Retry GetX FGetX FGetX <ul><li>Generating predictions </li></ul>Nack Nack
  24. 24. <ul><li>State after predictions </li></ul>Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Nack Retry FGetX Nack Retry FGetX Nack Nack
  25. 25. <ul><li>Successful predictions </li></ul>Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
  26. 26. Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry Result NP S-3 <ul><li>Failed Predictions </li></ul>Result
  27. 27. Evaluation <ul><li>Microbenchmarks: Loop-based, 2^10 Xn </li></ul><ul><ul><li>Shared-counter </li></ul></ul><ul><ul><ul><li>Simple counter, increment by fixed value </li></ul></ul></ul><ul><ul><li>Queue-based </li></ul></ul><ul><ul><ul><li>Insert only </li></ul></ul></ul><ul><ul><ul><li>Random inserts and deletes </li></ul></ul></ul><ul><li>Simulation platform: </li></ul><ul><ul><li>SIMICS in-order processors (1,2,4,8,16) </li></ul></ul><ul><ul><li>GEMS (RUBY) memory system </li></ul></ul><ul><ul><li>Highly optimized LogTM model for experiments </li></ul></ul><ul><li>Cholesky & Raytrace benchmarks </li></ul><ul><ul><li>Both contain a linked-list for memory management. </li></ul></ul><ul><ul><li>Cholesky could not be completely transactified </li></ul></ul>
  28. 28. Results
  29. 29. Splash Benchmarks Cholesky Raytrace <ul><li>Adding directives to support value prediction </li></ul>
  30. 30. Splash benchmarks Table with TM parameters for 16 processors 13572 7100 Xn size No. of Xn %Stalls Writes %Aborts LogTM / VP-TM Cholesky 24466 40.2 2.2 30 / 18.8 Raytrace 46958 32.3 3.9 20 / 13.4
  31. 31. Observations <ul><li>Value predictor can improve speedup without much overhead </li></ul><ul><li>Performance gains with increasing number of processors </li></ul><ul><li>Aborts increases as number of processors increases </li></ul><ul><li>Is TM scalable ? </li></ul><ul><ul><li>More benchmarks needed </li></ul></ul>
  32. 32. Extending the value predictor <ul><li>Improving the simulation model </li></ul><ul><li>Exploring other types of value predictors </li></ul><ul><ul><li>Expanding application scope </li></ul></ul><ul><li>Controlling aggressiveness </li></ul><ul><ul><li>Adding confidence mechanisms. </li></ul></ul><ul><li>Reducing hardware complexity of the value predictor entry. </li></ul>
  33. 33. Proposed Ideas <ul><li>Value predictor not general enough! </li></ul><ul><ul><li>Need to reduce conflicts </li></ul></ul><ul><li>Better backoff schemes </li></ul><ul><ul><li>Centralized transaction scheduler </li></ul></ul><ul><ul><li>“ Intelligent” backoff times </li></ul></ul><ul><ul><ul><li>Expose transactions to the directory </li></ul></ul></ul><ul><ul><ul><ul><li>begin_Xn and end_Xn messages to the directory ? </li></ul></ul></ul></ul><ul><ul><ul><li>Count number of memory accesses in transactions </li></ul></ul></ul><ul><ul><ul><li>Generate backoff time based on count </li></ul></ul></ul>
  34. 34. Proposed Ideas contd. <ul><li>Why is this different from any other scalability research ? </li></ul><ul><ul><li>Recent work by Bobba shows HTM designs impact performance by almost 80%. </li></ul></ul><ul><ul><li>Different data/conflict management schemes needed for different applications? </li></ul></ul><ul><ul><li>STM can help, but performance suffers </li></ul></ul><ul><ul><li>Can we have both lazy and eager version management? </li></ul></ul><ul><li>Is HTM on large systems a good idea ? </li></ul>
  35. 35. Proposed Ideas contd. <ul><li>Effectiveness of Nacks/Stalls decreases as number of processors increases </li></ul><ul><ul><li>Need stalling mechanism without the overhead of deadlocks </li></ul></ul><ul><li>Stall transactions after restart </li></ul><ul><li>Use timestamps to avoid starvation </li></ul><ul><li>Need to understand hardware requirements </li></ul><ul><ul><li>Verilog model </li></ul></ul><ul><ul><li>Proposals need hardware evaluation </li></ul></ul><ul><ul><ul><li>Value predictor </li></ul></ul></ul><ul><ul><ul><li>Speculative buffer </li></ul></ul></ul>
  36. 36. Experiments/Analysis <ul><li>Need better benchmarks </li></ul><ul><ul><li>Synchronization intensive </li></ul></ul><ul><ul><ul><li>SPECJBB , STAMP, Java Grande benchmarks </li></ul></ul></ul><ul><ul><li>Larger transactions </li></ul></ul><ul><li>Test up to 64 processors </li></ul><ul><ul><li>Simulations with SIMICS + GEMS </li></ul></ul>
  37. 37. <ul><li>Contribution </li></ul><ul><ul><li>Identify scalability bottleneck with TM </li></ul></ul><ul><ul><li>Value predictor for certain applications </li></ul></ul><ul><li>Proposal </li></ul><ul><ul><li>Extending value predictor work </li></ul></ul><ul><ul><li>Improved backoff schemes </li></ul></ul><ul><ul><li>Transaction queuing/stalling </li></ul></ul><ul><ul><li>Hardware evaluation* </li></ul></ul><ul><ul><li>END </li></ul></ul><ul><li>Questions ? </li></ul>
  38. 38. Directory M-1 S-2-3 Value Predictor CPU 1 CPU 3 CPU 2 Retry FGetX Unblock Retry FGetX Nack Nack Result M-2 S-3
  39. 40. <ul><li>Overall, TCC’s FPGA implementation </li></ul><ul><li>adds 14% overhead in the control logic, and 29% in on chip memory as compared to a non-speculative incarnation of our cache. </li></ul>

×