Welcome and Lightning Intros

Welcome to
RACES’12
Saturday 4 May 13

Thank You
✦ Stefan Marr, Mattias De Wael
✦ Presenters
✦ Authors
✦ Program Committee
✦ Co-chair & Organizer: Theo D’Hondt
✦ Organizers: Andrew Black, Doug Kimelman, Martin
Rinard
✦ Voters
Saturday 4 May 13

Announcements
✦ Program at:
✦ http://soft.vub.ac.be/races/program/
✦ Strict timekeepers
✦ Dinner?
✦ Recording
Saturday 4 May 13

9:00 Lightning and Welcome
9:10 Unsynchronized Techniques for Approximate Parallel Computing
9:35 Programming with Relaxed Synchronization
9:50 (Relative) Safety Properties for Relaxed Approximate Programs
10:05 Break
10:35 Nondeterminism is unavoidable, but data races are pure evil
11:00 Discussion
11:45 Lunch
1:15 How FIFO is Your Concurrent FIFO Queue?
1:35 The case for relativistic programming
1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
2:15 Does Better Throughput Require Worse Latency?
2:30 Parallel Sorting on a Spatial Computer
2:50 Break
3:25 Dancing with Uncertainty
3:45 Beyond Expert-Only Parallel Programming
4:00 Discussion
4:30 Wrap up
Saturday 4 May 13

2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13

2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13

Hardware
Towards Approximate Computing:
Programming with Relaxed Synchronization
Precise Less Precise
Accurate
Less Accurate, less up-
to-date, possibly
corrupted
Reliable
Variable
Computation
Data
Computing model
today
Human Brain
Relaxed
Synchronization
Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012
Saturday 4 May 13

(Relative) Safety Properties
for Relaxed Approximate
Programs
Michael Carbin and Martin Rinard
Saturday 4 May 13

Nondeterminism
is
Unavoidable,
but
Data
Races
are
Pure
Evil
Hans-‐J.
Boehm,
HP
Labs

• Much
low-‐level
code
is
inherently
nondeterminisBc,
but
• Data
races
–Are
forbidden
by
C/C++/OpenMP/Posix
language

standards.
–May
break
code
now
or
when
you
recompile.
Data
Races
–Don’t
improve
scalability
signiﬁcantly,
even

if
the
code
sBll
works.
–Are
easily
avoidable
in
C11
&
C++11.
Saturday 4 May 13

How FIFO isYour Concurrent FIFO Queue?
Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer
University of Salzburg
semantically correct
and therefore “slow”
FIFO queues
semantically relaxed
and thereby “fast”
FIFO queues
Semantically relaxed FIFO queues can appear more
FIFO than semantically correct FIFO queues.
vs.
Saturday 4 May 13

A Case for Relativistic Programming
• Alter ordering requirements
(Causal, not Total)
• Don’t Alter correctness requirements
• High performance, Highly scalable
• Easy to program
Philip W. Howard and Jonathan Walpole
Saturday 4 May 13

IBM Research
© 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012
Edge Chasing Delayed Consistency: Pushing the Limits of
Weak Ordering
§ From the RACES website:
– “an approach towards scalability that reduces synchronization requirements
drastically, possibly to the point of discarding them altogether.”
§ A hardware developer’s perspective:
– Constraints of Legacy Code
• What if we want to apply this principle, but have no control over the applications
that are running on a system?
– Can one build a coherence protocol that avoids synchronizing cores as
much as possible?
• For example by allowing each core to use stale versions of cache lines as long as
possible
• While maintaining architectural correctness; i.e. we will not break existing code
• If we do that, what will happen?
Trey Cain and Mikko Lipasti
Saturday 4 May 13

Does Better Throughput Require
Worse Latency?
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
Mark Wegman
Introduction
follows:
Version
Core count
latency
application
Normalized latency
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
limit throughput.
word
Taking turns, broadcasting changes: Low latency
Dividing into sections, round-robin: High throughput
throughput -> parallel -> distributed/replicated -> latency
David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM
Saturday 4 May 13

spatial computing
offers insights into:
• the costs and constraints
of communication in large
parallel computer arrays
• how to design algorithms
that respect these costs
and constraints
parallel sorting on a spatial computer
Max Orhai, Andrew P. Black
Saturday 4 May 13

Dancing with
Uncertainty
Sasa Misailovic, Stelios Sidiroglou and Martin
Rinard
Saturday 4 May 13

© 2009 IBM Corporation
1
Sea Change In Linux-Kernel Parallel Programming
In 2006, Linus Torvalds noted that since 2003, the Linux
kernel community's grasp of concurrency had improved to the
point that patches were often correct at first submission
Why the improvement?
–Not programming language: C before, during, and after
–Not synchronization primitives: Locking before, during, and after
–Not a change in personnel: Relatively low turnover
–Not born parallel programmers: Remember Big Kernel Lock!
So what was it?
–Stick around for the discussion this afternoon and find out!!!
Paul E. McKenney: Beyond Expert-Only Parallel Programming?
Saturday 4 May 13

Welcome and Lightning Intros

Recommended

Recommended

More Related Content

Similar to Welcome and Lightning Intros

Similar to Welcome and Lightning Intros (20)

Recently uploaded

Recently uploaded (20)

Welcome and Lightning Intros