Lowering STM Overhead with Static Analysis

LOWERING STM OVERHEAD WITH
STATIC ANALYSIS
Yehuda Afek, Guy Korland, Arie Zilberstein
Tel-Aviv University

LCPC 2010

OUTLINE
 Background

on STM, TL2.
 STM overhead and common
optimizations.
 New optimizations.
 Experimental results.
 Conclusion.

SOFTWARE TRANSACTIONAL MEMORY
 Aims

to ease concurrent
programming.
 Idea: enclose code in atomic blocks.
 Code inside atomic block behaves as
a transaction:
 Atomic

(executes altogether or not at all).
 Consistent.
 Isolated (Not affected by other concurrent
transactions).

SOFTWARE TRANSACTIONAL MEMORY
 Implementation:
 STM

compiler instruments every memory
access inside atomic blocks.
 STM library functions handle the
synchronization according to a protocol.

TRANSACTIONAL LOCKING II
 TL2

is an influential STM protocol.
 Features:
 Lock-based.
 Word-based.
 Lazy-update.

 Achieves

synchronization through
versioned write-locks + global
version clock.

TRANSACTIONAL LOCKING II
 Advantages
 Locks

of TL2:

are held for a short time.
 Zombie transactions are quickly aborted.
 Rollback is cheap.

STM OVERHEAD
 Instrumenting

all transactional
memory accesses induces a huge
performance overhead.

 STM

compiler optimizations reduce
the overhead.

STM COMPILER OPTIMIZATIONS
 Common

compiler optimizations:

1.

Avoiding instrumentation of accesses to
immutable and transaction-local
memory.

2.

Avoiding lock acquisition and releases for
thread-local memory.

3.

Avoiding readset population in read-only
transactions.

NEW STM COMPILER OPTIMIZATIONS
 In
1.
2.
3.
4.

this work:
Reduce amount of instrumented memory
reads using load elimination.
Reduce amount of instrumented memory
writes using scalar promotion.
Avoid writeset lookups for memory not yet
written to.
Avoid writeset recordkeeping for memory
that will not be read.

LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];

5 instrumented

}

memory reads per
loop iteration

(using Lazy Code Motion)


if (0 < nfeatures) {
nci = new_centers[index];
fi = feature[i];
for (j = 0; j < nfeatures; j++) {
nci[j] = nci[j] + fi[j];
}
}

2 instrumented

memory reads per
loop iteration

LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];
}

 Key

insight:

 No

need to check if new_centers[index]
can change in other threads.

 Still

need to check that it cannot
change locally or through method
calls.

SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

num_elts
instrumented
memory writes

(using Scalar Promotion)


if (0 < num_elts) {
double temp = moments[0];
try {
temp += data[i];
}
} finally {
moments[0] = temp;
}
instrumented
}

1

memory write

SCALAR PROMOTION IN ATOMIC BLOCKS. 2


moments[0] += data[i];
}

 (same)

Key insight:

 No

need to check if moments[0] can change
in other threads.

 Still

need to check that it cannot
change locally or through method
calls.

LOAD ELIMINATION AND SCALAR
PROMOTION ADVANTAGES


These optimizations are sound for every STM
protocol that guarantees transaction isolation.



Lazy-update protocols, like TL2, gain the most,
since reads and writes are expensive.
A

read looks up the value in the writeset before
looking at the memory location.

A



write adds to, or replaces a value in the writeset.

Let’s improve it further…

REDUNDANT WRITESET LOOKUPS. 3


Consider a transactional read: x = o.f;
If we know that we didn’t yet write to o.f in this
transaction…
 … then we can skip looking in the writeset!




Analysis: discover redundant writeset lookups
using static analysis.
 Use

data flow analysis to simulate readset at
compile-time.
 Associate every abstract memory location with a tag
saying whether this location was already written
to or not.
 Analyze only inside transaction boundaries.
 Interprocedural, flow-sensitive, forward analysis.

4. REDUNDANT WRITESET RECORDKEEPING


Consider a transactional write: o.f = x;
If we know that we aren’t going to read o.f in this
transaction…
 … then we can perform a cheaper writeset insert.
 e.g.: by not updating the Bloom filter.




Analysis: discover redundant writeset
recordkeeping using static analysis.
 Use

data flow analysis to simulate writeset at
compile-time.
 Associate every abstract memory location with a tag
saying whether this location is going to be read.
 Analyze only inside transaction boundaries.
 Interprocedural, flow-sensitive, backward analysis.

EXPERIMENTS


We created analyses and transformations for
these 4 optimizations.



Software used:
Deuce STM with TL2 protocol.
 Soot Java Optimization Framework.
 STAMP and microbenchmarks.




Hardware used:


Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores ×
8 hardware threads.

READING THE RESULTS
Unoptimized
+ Load
Elimination

+ Redundant
Writeset
Recordkeeping

m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-

+ Immutable,
+ Transaction
Local,
+ThreadLocal
+ Redundant
Writeset
Lookups

RESULTS: K-MEANS

Load
Elimination
inside tight
loops
(e.g.,
new_centers
[index]

from the
example).

m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-

RESULTS: LINKED LIST
Locating the
position of the
element in all
three add(),
remove() and
contains()
transactions
involves many
reads to
locations not
written to
before.

write operations, 20 seconds, 10K items, 26K possible range 10%

RESULTS: SSCA2

Many small
transactions
that update
single
shared
values, and
don’t read
them
thereafter.

s 18 -i1.0 -u1.0 -l3 -p3-

ANALYSIS
 Load

Elimination had the largest
impact (up to 29% speedup).

 No

example of Scalar Promotion was
found. (rare phenomenon or bad
luck?)

ANALYSIS
 In

transactions that perform many
reads before writes, skipping the
writeset lookups increased
throughput by up to 28%.

 Even

in transactions that don’t read
values after they are written,
skipping the writeset recordkeeping
gained no more than 4% speedup.

SUMMARY
 We

presented 4 STM compiler
optimizations.

 Optimizations

are biased towards
lazy-update STMs, but can be
applied with some changes to inplace-update STMs.

Lowering STM Overhead with Static Analysis

More Related Content

What's hot

Similar to Lowering STM Overhead with Static Analysis

More from Guy Korland

Recently uploaded

Lowering STM Overhead with Static Analysis