Your SlideShare is downloading. ×
0
Welcome toRACES’12Saturday 4 May 13
Thank You✦ Stefan Marr, Mattias De Wael✦ Presenters✦ Authors✦ Program Committee✦ Co-chair & Organizer: Theo D’Hondt✦ Organ...
Announcements✦ Program at:✦ http://soft.vub.ac.be/races/program/✦ Strict timekeepers✦ Dinner?✦ RecordingSaturday 4 May 13
9:00 Lightning and Welcome9:10 Unsynchronized Techniques for Approximate Parallel Computing9:35 Programming with Relaxed S...
LightningSaturday 4 May 13
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
24lengthnextvaluesaExpandable	  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o...
HardwareTowards Approximate Computing:Programming with Relaxed SynchronizationPrecise Less PreciseAccurateLess Accurate, l...
(Relative) Safety Propertiesfor Relaxed ApproximateProgramsMichael Carbin and Martin RinardSaturday 4 May 13
Nondeterminism	  is	  Unavoidable,but	  Data	  Races	  are	  Pure	  EvilHans-­‐J.	  Boehm,	  HP	  Labs	  • Much	  low-­‐le...
How FIFO isYour Concurrent FIFO Queue?Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes PayerUniversity of Salzb...
A Case for Relativistic Programming• Alter ordering requirements(Causal, not Total)• Don’t Alter correctness requirements•...
IBM Research© 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012Edge Chasing Delayed Consistency: Pushing the Li...
Does Better Throughput RequireWorse Latency?Does Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Ad...
spatial computingoffers insights into:• the costs and constraintsof communication in largeparallel computer arrays• how to...
Dancing withUncertaintySasa Misailovic, Stelios Sidiroglou and MartinRinardSaturday 4 May 13
© 2009 IBM Corporation1Sea Change In Linux-Kernel Parallel ProgrammingIn 2006, Linus Torvalds noted that since 2003, the ...
Upcoming SlideShare
Loading in...5
×

Welcome and Lightning Intros

429

Published on

The first program point is a welcome of the organizers and a brief 30 second introduction to the presentation by each speaker.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
429
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Welcome and Lightning Intros"

  1. 1. Welcome toRACES’12Saturday 4 May 13
  2. 2. Thank You✦ Stefan Marr, Mattias De Wael✦ Presenters✦ Authors✦ Program Committee✦ Co-chair & Organizer: Theo D’Hondt✦ Organizers: Andrew Black, Doug Kimelman, MartinRinard✦ VotersSaturday 4 May 13
  3. 3. Announcements✦ Program at:✦ http://soft.vub.ac.be/races/program/✦ Strict timekeepers✦ Dinner?✦ RecordingSaturday 4 May 13
  4. 4. 9:00 Lightning and Welcome9:10 Unsynchronized Techniques for Approximate Parallel Computing9:35 Programming with Relaxed Synchronization9:50 (Relative) Safety Properties for Relaxed Approximate Programs10:05 Break10:35 Nondeterminism is unavoidable, but data races are pure evil11:00 Discussion11:45 Lunch1:15 How FIFO is Your Concurrent FIFO Queue?1:35 The case for relativistic programming1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models2:15 Does Better Throughput Require Worse Latency?2:30 Parallel Sorting on a Spatial Computer2:50 Break3:25 Dancing with Uncertainty3:45 Beyond Expert-Only Parallel Programming4:00 Discussion4:30 Wrap upSaturday 4 May 13
  5. 5. LightningSaturday 4 May 13
  6. 6. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Saturday 4 May 13
  7. 7. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Saturday 4 May 13
  8. 8. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Saturday 4 May 13
  9. 9. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Saturday 4 May 13
  10. 10. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Saturday 4 May 13
  11. 11. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Saturday 4 May 13
  12. 12. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Data Race!Saturday 4 May 13
  13. 13. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Data Race!Saturday 4 May 13
  14. 14. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Data Race!Saturday 4 May 13
  15. 15. 24lengthnextvaluesaExpandable  Arrayappend(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;append(o)c = a;i = c.next;if (c.length <= i)n = expand c;a = n; c = n;c.values[i] = o;c.next = i + 1;Data Race!Saturday 4 May 13
  16. 16. HardwareTowards Approximate Computing:Programming with Relaxed SynchronizationPrecise Less PreciseAccurateLess Accurate, less up-to-date, possiblycorruptedReliableVariableComputationDataComputing modeltodayHuman BrainRelaxedSynchronizationRenganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012Saturday 4 May 13
  17. 17. (Relative) Safety Propertiesfor Relaxed ApproximateProgramsMichael Carbin and Martin RinardSaturday 4 May 13
  18. 18. Nondeterminism  is  Unavoidable,but  Data  Races  are  Pure  EvilHans-­‐J.  Boehm,  HP  Labs  • Much  low-­‐level  code  is  inherentlynondeterminisBc,  but• Data  races–Are  forbidden  by  C/C++/OpenMP/Posix  language  standards.–May  break  code  now  or  when  you  recompile.DataRaces–Don’t  improve  scalability  significantly,  even  if  the  code  sBll  works.–Are  easily  avoidable  in  C11  &  C++11.Saturday 4 May 13
  19. 19. How FIFO isYour Concurrent FIFO Queue?Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes PayerUniversity of Salzburgsemantically correctand therefore “slow”FIFO queuessemantically relaxedand thereby “fast”FIFO queuesSemantically relaxed FIFO queues can appear moreFIFO than semantically correct FIFO queues.vs.Saturday 4 May 13
  20. 20. A Case for Relativistic Programming• Alter ordering requirements(Causal, not Total)• Don’t Alter correctness requirements• High performance, Highly scalable• Easy to programPhilip W. Howard and Jonathan WalpoleSaturday 4 May 13
  21. 21. IBM Research© 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012Edge Chasing Delayed Consistency: Pushing the Limits ofWeak Ordering§ From the RACES website:– “an approach towards scalability that reduces synchronization requirementsdrastically, possibly to the point of discarding them altogether.”§ A hardware developer’s perspective:– Constraints of Legacy Code• What if we want to apply this principle, but have no control over the applicationsthat are running on a system?– Can one build a coherence protocol that avoids synchronizing cores asmuch as possible?• For example by allowing each core to use stale versions of cache lines as long aspossible• While maintaining architectural correctness; i.e. we will not break existing code• If we do that, what will happen?Trey Cain and Mikko LipastiSaturday 4 May 13
  22. 22. Does Better Throughput RequireWorse Latency?Does Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that theword was not changed by some other thread while theupdater was working. If it was changed, the updater mustDoes Better ThroughputRequire Worse Latency?David Ungar, Doug Kimelman, Sam Adams,Mark WegmanIBM T. J. Watson Research CenterIntroductionAs we continue to make the transition from uniprocessor tomulticore programming, pushed along by the changingtrajectory of hardware technology and system architecture,we are seeing an explosion of techniques for crossing thechasm between sequential and parallel data structures andalgorithms. In considering a spectrum of techniques formoderating application access to shared data on multicoreand manycore systems, we have observed that asapplication synchronization latency gets closer to hardwareinter-core latency, throughput decreases. The spectrum oftechniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and(non-deterministic) race-and-repair. Below we presentdefinitions of our notion of synchronization latency andthroughput, and describe our observation in greater detail.We conclude by wondering whether there is a fundamentallaw relating latency to throughput:Algorithms that improve application-level throughputworsen inter-core application-level latency.We believe that such a law would be of great utility as aunification that would provide a common perspective fromwhich to view and compare synchronization approaches.Throughput and LatencyFor this proposal, we define throughput and latency asfollows:• Throughput is the amount of application-level workperformed in unit time, normalized to the amount ofwork that would be accomplished with perfect linearscaling. In other words, a throughput of 1.0 would beachieved by a system that performed N times as muchwork per unit time with N cores as it did with one core.This formulation reflects how well an applicationexploits the parallelism of multiple cores.• Latency denotes the mean time required for a thread onone core to observe a change effected by a thread onanother core, normalized to the best latency possible forthe given platform. This formulation isolates the latencyinherent in the algorithms and data structures from thelatency arising out of the platform (operating system,processor, storage hierarchy, communication network,etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a datastructure, it will take additional time to update thereplicas. The best possible latency for a given platformcan be difficult to determine, but nonetheless itconstitutes a real lower bound for the overall latency thatis apparent to an application.Table 1 presents some fictional numbers in order toillustrate the concept: It describes two versions of the sameapplication, A and B, running on a hypothetical system.The numbers are consistent with a linear version of theproposed law, because Version B sacrifices a factor of threein latency to gain a factor of 3 in throughput.Table 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearTable 1: Hypothetical figuresif tradeoff were linearVersionCore countBest-possible inter-corelatencyMean observed latency inapplicationNormalized latency(observed / best possible)App-operations/sec. ( 1 core)App.-operations/sec. ( 10 cores)Normalized throughput(normalized to perfect scaling)Latency / ThroughputA B10 10200 µs 200 µs1,000 µs 3,000 µs5 151,000 1,0002,500 7,5000.25 0.7520 20A Progression of Techniques TradingThroughput for LatencyAs techniques have evolved for improving performance,each seems to have offered more throughput at the expenseof increased latency:• Mutexes and Locks: Mutexes and locks are perhaps thesimplest method for protecting shared data [1]. In thisstyle, each thread obtains a shared lock (or mutex) on adata structure before accessing or modifying it. Latencyis minimized because a waiter will observe any changesas soon as the updating thread releases the lock.However, the overhead required to obtain a lock, and theprocessing time lost while waiting for a lock can severelylimit throughput.• Lock-Free: In the lock-free style, each shared datastructure is organized so that any potential races areconfined to a single word. An updating thread need notlock the structure in advance. Instead, it prepares anupdated value, then uses an atomic instruction (such asCompare-And-Swap) to attempt to store the value intothe word [1]. The atomic instruction ensures that thewordTaking turns, broadcasting changes: Low latencyDividing into sections, round-robin: High throughputthroughput -> parallel -> distributed/replicated -> latencyDavid Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBMSaturday 4 May 13
  23. 23. spatial computingoffers insights into:• the costs and constraintsof communication in largeparallel computer arrays• how to design algorithmsthat respect these costsand constraintsparallel sorting on a spatial computerMax Orhai, Andrew P. BlackSaturday 4 May 13
  24. 24. Dancing withUncertaintySasa Misailovic, Stelios Sidiroglou and MartinRinardSaturday 4 May 13
  25. 25. © 2009 IBM Corporation1Sea Change In Linux-Kernel Parallel ProgrammingIn 2006, Linus Torvalds noted that since 2003, the Linuxkernel communitys grasp of concurrency had improved to thepoint that patches were often correct at first submissionWhy the improvement?–Not programming language: C before, during, and after–Not synchronization primitives: Locking before, during, and after–Not a change in personnel: Relatively low turnover–Not born parallel programmers: Remember Big Kernel Lock!So what was it?–Stick around for the discussion this afternoon and find out!!!Paul E. McKenney: Beyond Expert-Only Parallel Programming?Saturday 4 May 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×