0
LOWERING STM OVERHEAD WITH
STATIC ANALYSIS
Yehuda Afek, Guy Korland, Arie Zilberstein
Tel-Aviv University

LCPC 2010
OUTLINE
 Background

on STM, TL2.
 STM overhead and common
optimizations.
 New optimizations.
 Experimental results.
...
SOFTWARE TRANSACTIONAL MEMORY
 Aims

to ease concurrent
programming.
 Idea: enclose code in atomic blocks.
 Code inside...
SOFTWARE TRANSACTIONAL MEMORY
 Implementation:
 STM

compiler instruments every memory
access inside atomic blocks.
 ST...
TRANSACTIONAL LOCKING II
 TL2

is an influential STM protocol.
 Features:
 Lock-based.
 Word-based.
 Lazy-update.

 ...
TRANSACTIONAL LOCKING II
 Advantages
 Locks

of TL2:

are held for a short time.
 Zombie transactions are quickly abort...
STM OVERHEAD
 Instrumenting

all transactional
memory accesses induces a huge
performance overhead.

 STM

compiler opti...
STM COMPILER OPTIMIZATIONS
 Common

compiler optimizations:

1.

Avoiding instrumentation of accesses to
immutable and tr...
NEW STM COMPILER OPTIMIZATIONS
 In
1.
2.
3.
4.

this work:
Reduce amount of instrumented memory
reads using load eliminat...
LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j...
LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j...
SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

num_elts
instrument...
SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

 (same)

Key insig...
LOAD ELIMINATION AND SCALAR
PROMOTION ADVANTAGES


These optimizations are sound for every STM
protocol that guarantees t...
REDUNDANT WRITESET LOOKUPS. 3


Consider a transactional read: x = o.f;
If we know that we didn’t yet write to o.f in thi...
4. REDUNDANT WRITESET RECORDKEEPING


Consider a transactional write: o.f = x;
If we know that we aren’t going to read o....
EXPERIMENTS


We created analyses and transformations for
these 4 optimizations.



Software used:
Deuce STM with TL2 pr...
READING THE RESULTS
Unoptimized
+ Load
Elimination

+ Redundant
Writeset
Recordkeeping

m 40 -n 40 -t 0.001 –i random-n163...
RESULTS: K-MEANS

Load
Elimination
inside tight
loops
(e.g.,
new_centers
[index]

from the
example).

m 40 -n 40 -t 0.001 ...
RESULTS: LINKED LIST
Locating the
position of the
element in all
three add(),
remove() and
contains()
transactions
involve...
RESULTS: SSCA2

Many small
transactions
that update
single
shared
values, and
don’t read
them
thereafter.

s 18 -i1.0 -u1....
ANALYSIS
 Load

Elimination had the largest
impact (up to 29% speedup).

 No

example of Scalar Promotion was
found. (ra...
ANALYSIS
 In

transactions that perform many
reads before writes, skipping the
writeset lookups increased
throughput by u...
SUMMARY
 We

presented 4 STM compiler
optimizations.

 Optimizations

are biased towards
lazy-update STMs, but can be
ap...
Q&A
 Thank

you!
Upcoming SlideShare
Loading in...5
×

Lowering STM Overhead with Static Analysis

326

Published on

Instrumenting all transactional memory accesses induces a huge performance overhead.

STM compiler optimizations reduce the overhead

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
326
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Lowering STM Overhead with Static Analysis"

  1. 1. LOWERING STM OVERHEAD WITH STATIC ANALYSIS Yehuda Afek, Guy Korland, Arie Zilberstein Tel-Aviv University LCPC 2010
  2. 2. OUTLINE  Background on STM, TL2.  STM overhead and common optimizations.  New optimizations.  Experimental results.  Conclusion.
  3. 3. SOFTWARE TRANSACTIONAL MEMORY  Aims to ease concurrent programming.  Idea: enclose code in atomic blocks.  Code inside atomic block behaves as a transaction:  Atomic (executes altogether or not at all).  Consistent.  Isolated (Not affected by other concurrent transactions).
  4. 4. SOFTWARE TRANSACTIONAL MEMORY  Implementation:  STM compiler instruments every memory access inside atomic blocks.  STM library functions handle the synchronization according to a protocol.
  5. 5. TRANSACTIONAL LOCKING II  TL2 is an influential STM protocol.  Features:  Lock-based.  Word-based.  Lazy-update.  Achieves synchronization through versioned write-locks + global version clock.
  6. 6. TRANSACTIONAL LOCKING II  Advantages  Locks of TL2: are held for a short time.  Zombie transactions are quickly aborted.  Rollback is cheap.
  7. 7. STM OVERHEAD  Instrumenting all transactional memory accesses induces a huge performance overhead.  STM compiler optimizations reduce the overhead.
  8. 8. STM COMPILER OPTIMIZATIONS  Common compiler optimizations: 1. Avoiding instrumentation of accesses to immutable and transaction-local memory. 2. Avoiding lock acquisition and releases for thread-local memory. 3. Avoiding readset population in read-only transactions.
  9. 9. NEW STM COMPILER OPTIMIZATIONS  In 1. 2. 3. 4. this work: Reduce amount of instrumented memory reads using load elimination. Reduce amount of instrumented memory writes using scalar promotion. Avoid writeset lookups for memory not yet written to. Avoid writeset recordkeeping for memory that will not be read.
  10. 10. LOAD ELIMINATION IN ATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; 5 instrumented } memory reads per loop iteration (using Lazy Code Motion)  if (0 < nfeatures) { nci = new_centers[index]; fi = feature[i]; for (j = 0; j < nfeatures; j++) { nci[j] = nci[j] + fi[j]; } } 2 instrumented memory reads per loop iteration
  11. 11. LOAD ELIMINATION IN ATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; }  Key insight:  No need to check if new_centers[index] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  12. 12. SCALAR PROMOTION IN ATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; } num_elts instrumented memory writes (using Scalar Promotion)  if (0 < num_elts) { double temp = moments[0]; try { for (int i = 0; i < num_elts; i++) { temp += data[i]; } } finally { moments[0] = temp; } instrumented } 1 memory write
  13. 13. SCALAR PROMOTION IN ATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; }  (same) Key insight:  No need to check if moments[0] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  14. 14. LOAD ELIMINATION AND SCALAR PROMOTION ADVANTAGES  These optimizations are sound for every STM protocol that guarantees transaction isolation.  Lazy-update protocols, like TL2, gain the most, since reads and writes are expensive. A read looks up the value in the writeset before looking at the memory location. A  write adds to, or replaces a value in the writeset. Let’s improve it further…
  15. 15. REDUNDANT WRITESET LOOKUPS. 3  Consider a transactional read: x = o.f; If we know that we didn’t yet write to o.f in this transaction…  … then we can skip looking in the writeset!   Analysis: discover redundant writeset lookups using static analysis.  Use data flow analysis to simulate readset at compile-time.  Associate every abstract memory location with a tag saying whether this location was already written to or not.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, forward analysis.
  16. 16. 4. REDUNDANT WRITESET RECORDKEEPING  Consider a transactional write: o.f = x; If we know that we aren’t going to read o.f in this transaction…  … then we can perform a cheaper writeset insert.  e.g.: by not updating the Bloom filter.   Analysis: discover redundant writeset recordkeeping using static analysis.  Use data flow analysis to simulate writeset at compile-time.  Associate every abstract memory location with a tag saying whether this location is going to be read.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, backward analysis.
  17. 17. EXPERIMENTS  We created analyses and transformations for these 4 optimizations.  Software used: Deuce STM with TL2 protocol.  Soot Java Optimization Framework.  STAMP and microbenchmarks.   Hardware used:  Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores × 8 hardware threads.
  18. 18. READING THE RESULTS Unoptimized + Load Elimination + Redundant Writeset Recordkeeping m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input- + Immutable, + Transaction Local, +ThreadLocal + Redundant Writeset Lookups
  19. 19. RESULTS: K-MEANS Load Elimination inside tight loops (e.g., new_centers [index] from the example). m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-
  20. 20. RESULTS: LINKED LIST Locating the position of the element in all three add(), remove() and contains() transactions involves many reads to locations not written to before. write operations, 20 seconds, 10K items, 26K possible range 10%
  21. 21. RESULTS: SSCA2 Many small transactions that update single shared values, and don’t read them thereafter. s 18 -i1.0 -u1.0 -l3 -p3-
  22. 22. ANALYSIS  Load Elimination had the largest impact (up to 29% speedup).  No example of Scalar Promotion was found. (rare phenomenon or bad luck?)
  23. 23. ANALYSIS  In transactions that perform many reads before writes, skipping the writeset lookups increased throughput by up to 28%.  Even in transactions that don’t read values after they are written, skipping the writeset recordkeeping gained no more than 4% speedup.
  24. 24. SUMMARY  We presented 4 STM compiler optimizations.  Optimizations are biased towards lazy-update STMs, but can be applied with some changes to inplace-update STMs.
  25. 25. Q&A  Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×