LOWERING STM OVERHEAD WITH
STATIC ANALYSIS
Yehuda Afek, Guy Korland, Arie Zilberstein
Tel-Aviv University

LCPC 2010
OUTLINE
 Background

on STM, TL2.
 STM overhead and common
optimizations.
 New optimizations.
 Experimental results.
 Conclusion.
SOFTWARE TRANSACTIONAL MEMORY
 Aims

to ease concurrent
programming.
 Idea: enclose code in atomic blocks.
 Code inside atomic block behaves as
a transaction:
 Atomic

(executes altogether or not at all).
 Consistent.
 Isolated (Not affected by other concurrent
transactions).
SOFTWARE TRANSACTIONAL MEMORY
 Implementation:
 STM

compiler instruments every memory
access inside atomic blocks.
 STM library functions handle the
synchronization according to a protocol.
TRANSACTIONAL LOCKING II
 TL2

is an influential STM protocol.
 Features:
 Lock-based.
 Word-based.
 Lazy-update.

 Achieves

synchronization through
versioned write-locks + global
version clock.
TRANSACTIONAL LOCKING II
 Advantages
 Locks

of TL2:

are held for a short time.
 Zombie transactions are quickly aborted.
 Rollback is cheap.
STM OVERHEAD
 Instrumenting

all transactional
memory accesses induces a huge
performance overhead.

 STM

compiler optimizations reduce
the overhead.
STM COMPILER OPTIMIZATIONS
 Common

compiler optimizations:

1.

Avoiding instrumentation of accesses to
immutable and transaction-local
memory.

2.

Avoiding lock acquisition and releases for
thread-local memory.

3.

Avoiding readset population in read-only
transactions.
NEW STM COMPILER OPTIMIZATIONS
 In
1.
2.
3.
4.

this work:
Reduce amount of instrumented memory
reads using load elimination.
Reduce amount of instrumented memory
writes using scalar promotion.
Avoid writeset lookups for memory not yet
written to.
Avoid writeset recordkeeping for memory
that will not be read.
LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];

5 instrumented

}

memory reads per
loop iteration

(using Lazy Code Motion)


if (0 < nfeatures) {
nci = new_centers[index];
fi = feature[i];
for (j = 0; j < nfeatures; j++) {
nci[j] = nci[j] + fi[j];
}
}

2 instrumented

memory reads per
loop iteration
LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];
}

 Key

insight:

 No

need to check if new_centers[index]
can change in other threads.

 Still

need to check that it cannot
change locally or through method
calls.
SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

num_elts
instrumented
memory writes

(using Scalar Promotion)


if (0 < num_elts) {
double temp = moments[0];
try {
for (int i = 0; i < num_elts; i++) {
temp += data[i];
}
} finally {
moments[0] = temp;
}
instrumented
}

1

memory write
SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

 (same)

Key insight:

 No

need to check if moments[0] can change
in other threads.

 Still

need to check that it cannot
change locally or through method
calls.
LOAD ELIMINATION AND SCALAR
PROMOTION ADVANTAGES


These optimizations are sound for every STM
protocol that guarantees transaction isolation.



Lazy-update protocols, like TL2, gain the most,
since reads and writes are expensive.
A

read looks up the value in the writeset before
looking at the memory location.

A



write adds to, or replaces a value in the writeset.

Let’s improve it further…
REDUNDANT WRITESET LOOKUPS. 3


Consider a transactional read: x = o.f;
If we know that we didn’t yet write to o.f in this
transaction…
 … then we can skip looking in the writeset!




Analysis: discover redundant writeset lookups
using static analysis.
 Use

data flow analysis to simulate readset at
compile-time.
 Associate every abstract memory location with a tag
saying whether this location was already written
to or not.
 Analyze only inside transaction boundaries.
 Interprocedural, flow-sensitive, forward analysis.
4. REDUNDANT WRITESET RECORDKEEPING


Consider a transactional write: o.f = x;
If we know that we aren’t going to read o.f in this
transaction…
 … then we can perform a cheaper writeset insert.
 e.g.: by not updating the Bloom filter.




Analysis: discover redundant writeset
recordkeeping using static analysis.
 Use

data flow analysis to simulate writeset at
compile-time.
 Associate every abstract memory location with a tag
saying whether this location is going to be read.
 Analyze only inside transaction boundaries.
 Interprocedural, flow-sensitive, backward analysis.
EXPERIMENTS


We created analyses and transformations for
these 4 optimizations.



Software used:
Deuce STM with TL2 protocol.
 Soot Java Optimization Framework.
 STAMP and microbenchmarks.




Hardware used:


Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores ×
8 hardware threads.
READING THE RESULTS
Unoptimized
+ Load
Elimination

+ Redundant
Writeset
Recordkeeping

m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-

+ Immutable,
+ Transaction
Local,
+ThreadLocal
+ Redundant
Writeset
Lookups
RESULTS: K-MEANS

Load
Elimination
inside tight
loops
(e.g.,
new_centers
[index]

from the
example).

m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-
RESULTS: LINKED LIST
Locating the
position of the
element in all
three add(),
remove() and
contains()
transactions
involves many
reads to
locations not
written to
before.

write operations, 20 seconds, 10K items, 26K possible range 10%
RESULTS: SSCA2

Many small
transactions
that update
single
shared
values, and
don’t read
them
thereafter.

s 18 -i1.0 -u1.0 -l3 -p3-
ANALYSIS
 Load

Elimination had the largest
impact (up to 29% speedup).

 No

example of Scalar Promotion was
found. (rare phenomenon or bad
luck?)
ANALYSIS
 In

transactions that perform many
reads before writes, skipping the
writeset lookups increased
throughput by up to 28%.

 Even

in transactions that don’t read
values after they are written,
skipping the writeset recordkeeping
gained no more than 4% speedup.
SUMMARY
 We

presented 4 STM compiler
optimizations.

 Optimizations

are biased towards
lazy-update STMs, but can be
applied with some changes to inplace-update STMs.
Q&A
 Thank

you!

Lowering STM Overhead with Static Analysis

  • 1.
    LOWERING STM OVERHEADWITH STATIC ANALYSIS Yehuda Afek, Guy Korland, Arie Zilberstein Tel-Aviv University LCPC 2010
  • 2.
    OUTLINE  Background on STM,TL2.  STM overhead and common optimizations.  New optimizations.  Experimental results.  Conclusion.
  • 3.
    SOFTWARE TRANSACTIONAL MEMORY Aims to ease concurrent programming.  Idea: enclose code in atomic blocks.  Code inside atomic block behaves as a transaction:  Atomic (executes altogether or not at all).  Consistent.  Isolated (Not affected by other concurrent transactions).
  • 4.
    SOFTWARE TRANSACTIONAL MEMORY Implementation:  STM compiler instruments every memory access inside atomic blocks.  STM library functions handle the synchronization according to a protocol.
  • 5.
    TRANSACTIONAL LOCKING II TL2 is an influential STM protocol.  Features:  Lock-based.  Word-based.  Lazy-update.  Achieves synchronization through versioned write-locks + global version clock.
  • 6.
    TRANSACTIONAL LOCKING II Advantages  Locks of TL2: are held for a short time.  Zombie transactions are quickly aborted.  Rollback is cheap.
  • 7.
    STM OVERHEAD  Instrumenting alltransactional memory accesses induces a huge performance overhead.  STM compiler optimizations reduce the overhead.
  • 8.
    STM COMPILER OPTIMIZATIONS Common compiler optimizations: 1. Avoiding instrumentation of accesses to immutable and transaction-local memory. 2. Avoiding lock acquisition and releases for thread-local memory. 3. Avoiding readset population in read-only transactions.
  • 9.
    NEW STM COMPILEROPTIMIZATIONS  In 1. 2. 3. 4. this work: Reduce amount of instrumented memory reads using load elimination. Reduce amount of instrumented memory writes using scalar promotion. Avoid writeset lookups for memory not yet written to. Avoid writeset recordkeeping for memory that will not be read.
  • 10.
    LOAD ELIMINATION INATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; 5 instrumented } memory reads per loop iteration (using Lazy Code Motion)  if (0 < nfeatures) { nci = new_centers[index]; fi = feature[i]; for (j = 0; j < nfeatures; j++) { nci[j] = nci[j] + fi[j]; } } 2 instrumented memory reads per loop iteration
  • 11.
    LOAD ELIMINATION INATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; }  Key insight:  No need to check if new_centers[index] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  • 12.
    SCALAR PROMOTION INATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; } num_elts instrumented memory writes (using Scalar Promotion)  if (0 < num_elts) { double temp = moments[0]; try { for (int i = 0; i < num_elts; i++) { temp += data[i]; } } finally { moments[0] = temp; } instrumented } 1 memory write
  • 13.
    SCALAR PROMOTION INATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; }  (same) Key insight:  No need to check if moments[0] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  • 14.
    LOAD ELIMINATION ANDSCALAR PROMOTION ADVANTAGES  These optimizations are sound for every STM protocol that guarantees transaction isolation.  Lazy-update protocols, like TL2, gain the most, since reads and writes are expensive. A read looks up the value in the writeset before looking at the memory location. A  write adds to, or replaces a value in the writeset. Let’s improve it further…
  • 15.
    REDUNDANT WRITESET LOOKUPS.3  Consider a transactional read: x = o.f; If we know that we didn’t yet write to o.f in this transaction…  … then we can skip looking in the writeset!   Analysis: discover redundant writeset lookups using static analysis.  Use data flow analysis to simulate readset at compile-time.  Associate every abstract memory location with a tag saying whether this location was already written to or not.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, forward analysis.
  • 16.
    4. REDUNDANT WRITESETRECORDKEEPING  Consider a transactional write: o.f = x; If we know that we aren’t going to read o.f in this transaction…  … then we can perform a cheaper writeset insert.  e.g.: by not updating the Bloom filter.   Analysis: discover redundant writeset recordkeeping using static analysis.  Use data flow analysis to simulate writeset at compile-time.  Associate every abstract memory location with a tag saying whether this location is going to be read.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, backward analysis.
  • 17.
    EXPERIMENTS  We created analysesand transformations for these 4 optimizations.  Software used: Deuce STM with TL2 protocol.  Soot Java Optimization Framework.  STAMP and microbenchmarks.   Hardware used:  Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores × 8 hardware threads.
  • 18.
    READING THE RESULTS Unoptimized +Load Elimination + Redundant Writeset Recordkeeping m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input- + Immutable, + Transaction Local, +ThreadLocal + Redundant Writeset Lookups
  • 19.
    RESULTS: K-MEANS Load Elimination inside tight loops (e.g., new_centers [index] fromthe example). m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-
  • 20.
    RESULTS: LINKED LIST Locatingthe position of the element in all three add(), remove() and contains() transactions involves many reads to locations not written to before. write operations, 20 seconds, 10K items, 26K possible range 10%
  • 21.
    RESULTS: SSCA2 Many small transactions thatupdate single shared values, and don’t read them thereafter. s 18 -i1.0 -u1.0 -l3 -p3-
  • 22.
    ANALYSIS  Load Elimination hadthe largest impact (up to 29% speedup).  No example of Scalar Promotion was found. (rare phenomenon or bad luck?)
  • 23.
    ANALYSIS  In transactions thatperform many reads before writes, skipping the writeset lookups increased throughput by up to 28%.  Even in transactions that don’t read values after they are written, skipping the writeset recordkeeping gained no more than 4% speedup.
  • 24.
    SUMMARY  We presented 4STM compiler optimizations.  Optimizations are biased towards lazy-update STMs, but can be applied with some changes to inplace-update STMs.
  • 25.