LOWERING STM OVERHEAD WITH
STATIC ANALYSIS
Yehuda Afek, Guy Korland, Arie Zilberstein
Tel-Aviv University
LCPC 2010
OUTLINE
Background
on STM, TL2.
STM overhead and common
optimizations.
New optimizations.
Experimental results.
Conclusion.
SOFTWARE TRANSACTIONAL MEMORY
Aims
to ease concurrent
programming.
Idea: enclose code in atomic blocks.
Code inside atomic block behaves as
a transaction:
Atomic
(executes altogether or not at all).
Consistent.
Isolated (Not affected by other concurrent
transactions).
SOFTWARE TRANSACTIONAL MEMORY
Implementation:
STM
compiler instruments every memory
access inside atomic blocks.
STM library functions handle the
synchronization according to a protocol.
TRANSACTIONAL LOCKING II
TL2
is an influential STM protocol.
Features:
Lock-based.
Word-based.
Lazy-update.
Achieves
synchronization through
versioned write-locks + global
version clock.
TRANSACTIONAL LOCKING II
Advantages
Locks
of TL2:
are held for a short time.
Zombie transactions are quickly aborted.
Rollback is cheap.
STM OVERHEAD
Instrumenting
all transactional
memory accesses induces a huge
performance overhead.
STM
compiler optimizations reduce
the overhead.
STM COMPILER OPTIMIZATIONS
Common
compiler optimizations:
1.
Avoiding instrumentation of accesses to
immutable and transaction-local
memory.
2.
Avoiding lock acquisition and releases for
thread-local memory.
3.
Avoiding readset population in read-only
transactions.
NEW STM COMPILER OPTIMIZATIONS
In
1.
2.
3.
4.
this work:
Reduce amount of instrumented memory
reads using load elimination.
Reduce amount of instrumented memory
writes using scalar promotion.
Avoid writeset lookups for memory not yet
written to.
Avoid writeset recordkeeping for memory
that will not be read.
LOAD ELIMINATION IN ATOMIC BLOCKS. 1
for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];
5 instrumented
}
memory reads per
loop iteration
(using Lazy Code Motion)
if (0 < nfeatures) {
nci = new_centers[index];
fi = feature[i];
for (j = 0; j < nfeatures; j++) {
nci[j] = nci[j] + fi[j];
}
}
2 instrumented
memory reads per
loop iteration
LOAD ELIMINATION IN ATOMIC BLOCKS. 1
for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];
}
Key
insight:
No
need to check if new_centers[index]
can change in other threads.
Still
need to check that it cannot
change locally or through method
calls.
SCALAR PROMOTION IN ATOMIC BLOCKS. 2
for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}
num_elts
instrumented
memory writes
(using Scalar Promotion)
if (0 < num_elts) {
double temp = moments[0];
try {
for (int i = 0; i < num_elts; i++) {
temp += data[i];
}
} finally {
moments[0] = temp;
}
instrumented
}
1
memory write
SCALAR PROMOTION IN ATOMIC BLOCKS. 2
for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}
(same)
Key insight:
No
need to check if moments[0] can change
in other threads.
Still
need to check that it cannot
change locally or through method
calls.
LOAD ELIMINATION AND SCALAR
PROMOTION ADVANTAGES
These optimizations are sound for every STM
protocol that guarantees transaction isolation.
Lazy-update protocols, like TL2, gain the most,
since reads and writes are expensive.
A
read looks up the value in the writeset before
looking at the memory location.
A
write adds to, or replaces a value in the writeset.
Let’s improve it further…
REDUNDANT WRITESET LOOKUPS. 3
Consider a transactional read: x = o.f;
If we know that we didn’t yet write to o.f in this
transaction…
… then we can skip looking in the writeset!
Analysis: discover redundant writeset lookups
using static analysis.
Use
data flow analysis to simulate readset at
compile-time.
Associate every abstract memory location with a tag
saying whether this location was already written
to or not.
Analyze only inside transaction boundaries.
Interprocedural, flow-sensitive, forward analysis.
4. REDUNDANT WRITESET RECORDKEEPING
Consider a transactional write: o.f = x;
If we know that we aren’t going to read o.f in this
transaction…
… then we can perform a cheaper writeset insert.
e.g.: by not updating the Bloom filter.
Analysis: discover redundant writeset
recordkeeping using static analysis.
Use
data flow analysis to simulate writeset at
compile-time.
Associate every abstract memory location with a tag
saying whether this location is going to be read.
Analyze only inside transaction boundaries.
Interprocedural, flow-sensitive, backward analysis.
EXPERIMENTS
We created analyses and transformations for
these 4 optimizations.
Software used:
Deuce STM with TL2 protocol.
Soot Java Optimization Framework.
STAMP and microbenchmarks.
Hardware used:
Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores ×
8 hardware threads.
RESULTS: LINKED LIST
Locating the
position of the
element in all
three add(),
remove() and
contains()
transactions
involves many
reads to
locations not
written to
before.
write operations, 20 seconds, 10K items, 26K possible range 10%
ANALYSIS
Load
Elimination had the largest
impact (up to 29% speedup).
No
example of Scalar Promotion was
found. (rare phenomenon or bad
luck?)
ANALYSIS
In
transactions that perform many
reads before writes, skipping the
writeset lookups increased
throughput by up to 28%.
Even
in transactions that don’t read
values after they are written,
skipping the writeset recordkeeping
gained no more than 4% speedup.
SUMMARY
We
presented 4 STM compiler
optimizations.
Optimizations
are biased towards
lazy-update STMs, but can be
applied with some changes to inplace-update STMs.