Lowering STM Overhead with Static Analysis
Upcoming SlideShare
Loading in...5
×
 

Lowering STM Overhead with Static Analysis

on

  • 188 views

Instrumenting all transactional memory accesses induces a huge performance overhead.

Instrumenting all transactional memory accesses induces a huge performance overhead.

STM compiler optimizations reduce the overhead

Statistics

Views

Total Views
188
Views on SlideShare
188
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lowering STM Overhead with Static Analysis Lowering STM Overhead with Static Analysis Presentation Transcript

  • LOWERING STM OVERHEAD WITH STATIC ANALYSIS Yehuda Afek, Guy Korland, Arie Zilberstein Tel-Aviv University LCPC 2010
  • OUTLINE  Background on STM, TL2.  STM overhead and common optimizations.  New optimizations.  Experimental results.  Conclusion.
  • SOFTWARE TRANSACTIONAL MEMORY  Aims to ease concurrent programming.  Idea: enclose code in atomic blocks.  Code inside atomic block behaves as a transaction:  Atomic (executes altogether or not at all).  Consistent.  Isolated (Not affected by other concurrent transactions).
  • SOFTWARE TRANSACTIONAL MEMORY  Implementation:  STM compiler instruments every memory access inside atomic blocks.  STM library functions handle the synchronization according to a protocol.
  • TRANSACTIONAL LOCKING II  TL2 is an influential STM protocol.  Features:  Lock-based.  Word-based.  Lazy-update.  Achieves synchronization through versioned write-locks + global version clock.
  • TRANSACTIONAL LOCKING II  Advantages  Locks of TL2: are held for a short time.  Zombie transactions are quickly aborted.  Rollback is cheap.
  • STM OVERHEAD  Instrumenting all transactional memory accesses induces a huge performance overhead.  STM compiler optimizations reduce the overhead.
  • STM COMPILER OPTIMIZATIONS  Common compiler optimizations: 1. Avoiding instrumentation of accesses to immutable and transaction-local memory. 2. Avoiding lock acquisition and releases for thread-local memory. 3. Avoiding readset population in read-only transactions.
  • NEW STM COMPILER OPTIMIZATIONS  In 1. 2. 3. 4. this work: Reduce amount of instrumented memory reads using load elimination. Reduce amount of instrumented memory writes using scalar promotion. Avoid writeset lookups for memory not yet written to. Avoid writeset recordkeeping for memory that will not be read.
  • LOAD ELIMINATION IN ATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; 5 instrumented } memory reads per loop iteration (using Lazy Code Motion)  if (0 < nfeatures) { nci = new_centers[index]; fi = feature[i]; for (j = 0; j < nfeatures; j++) { nci[j] = nci[j] + fi[j]; } } 2 instrumented memory reads per loop iteration
  • LOAD ELIMINATION IN ATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; }  Key insight:  No need to check if new_centers[index] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  • SCALAR PROMOTION IN ATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; } num_elts instrumented memory writes (using Scalar Promotion)  if (0 < num_elts) { double temp = moments[0]; try { for (int i = 0; i < num_elts; i++) { temp += data[i]; } } finally { moments[0] = temp; } instrumented } 1 memory write
  • SCALAR PROMOTION IN ATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; }  (same) Key insight:  No need to check if moments[0] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  • LOAD ELIMINATION AND SCALAR PROMOTION ADVANTAGES  These optimizations are sound for every STM protocol that guarantees transaction isolation.  Lazy-update protocols, like TL2, gain the most, since reads and writes are expensive. A read looks up the value in the writeset before looking at the memory location. A  write adds to, or replaces a value in the writeset. Let’s improve it further…
  • REDUNDANT WRITESET LOOKUPS. 3  Consider a transactional read: x = o.f; If we know that we didn’t yet write to o.f in this transaction…  … then we can skip looking in the writeset!   Analysis: discover redundant writeset lookups using static analysis.  Use data flow analysis to simulate readset at compile-time.  Associate every abstract memory location with a tag saying whether this location was already written to or not.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, forward analysis.
  • 4. REDUNDANT WRITESET RECORDKEEPING  Consider a transactional write: o.f = x; If we know that we aren’t going to read o.f in this transaction…  … then we can perform a cheaper writeset insert.  e.g.: by not updating the Bloom filter.   Analysis: discover redundant writeset recordkeeping using static analysis.  Use data flow analysis to simulate writeset at compile-time.  Associate every abstract memory location with a tag saying whether this location is going to be read.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, backward analysis.
  • EXPERIMENTS  We created analyses and transformations for these 4 optimizations.  Software used: Deuce STM with TL2 protocol.  Soot Java Optimization Framework.  STAMP and microbenchmarks.   Hardware used:  Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores × 8 hardware threads.
  • READING THE RESULTS Unoptimized + Load Elimination + Redundant Writeset Recordkeeping m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input- + Immutable, + Transaction Local, +ThreadLocal + Redundant Writeset Lookups
  • RESULTS: K-MEANS Load Elimination inside tight loops (e.g., new_centers [index] from the example). m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-
  • RESULTS: LINKED LIST Locating the position of the element in all three add(), remove() and contains() transactions involves many reads to locations not written to before. write operations, 20 seconds, 10K items, 26K possible range 10%
  • RESULTS: SSCA2 Many small transactions that update single shared values, and don’t read them thereafter. s 18 -i1.0 -u1.0 -l3 -p3-
  • ANALYSIS  Load Elimination had the largest impact (up to 29% speedup).  No example of Scalar Promotion was found. (rare phenomenon or bad luck?)
  • ANALYSIS  In transactions that perform many reads before writes, skipping the writeset lookups increased throughput by up to 28%.  Even in transactions that don’t read values after they are written, skipping the writeset recordkeeping gained no more than 4% speedup.
  • SUMMARY  We presented 4 STM compiler optimizations.  Optimizations are biased towards lazy-update STMs, but can be applied with some changes to inplace-update STMs.
  • Q&A  Thank you!