SlideShare a Scribd company logo
1 of 25
LOWERING STM OVERHEAD WITH
STATIC ANALYSIS
Yehuda Afek, Guy Korland, Arie Zilberstein
Tel-Aviv University

LCPC 2010
OUTLINE
 Background

on STM, TL2.
 STM overhead and common
optimizations.
 New optimizations.
 Experimental results.
 Conclusion.
SOFTWARE TRANSACTIONAL MEMORY
 Aims

to ease concurrent
programming.
 Idea: enclose code in atomic blocks.
 Code inside atomic block behaves as
a transaction:
 Atomic

(executes altogether or not at all).
 Consistent.
 Isolated (Not affected by other concurrent
transactions).
SOFTWARE TRANSACTIONAL MEMORY
 Implementation:
 STM

compiler instruments every memory
access inside atomic blocks.
 STM library functions handle the
synchronization according to a protocol.
TRANSACTIONAL LOCKING II
 TL2

is an influential STM protocol.
 Features:
 Lock-based.
 Word-based.
 Lazy-update.

 Achieves

synchronization through
versioned write-locks + global
version clock.
TRANSACTIONAL LOCKING II
 Advantages
 Locks

of TL2:

are held for a short time.
 Zombie transactions are quickly aborted.
 Rollback is cheap.
STM OVERHEAD
 Instrumenting

all transactional
memory accesses induces a huge
performance overhead.

 STM

compiler optimizations reduce
the overhead.
STM COMPILER OPTIMIZATIONS
 Common

compiler optimizations:

1.

Avoiding instrumentation of accesses to
immutable and transaction-local
memory.

2.

Avoiding lock acquisition and releases for
thread-local memory.

3.

Avoiding readset population in read-only
transactions.
NEW STM COMPILER OPTIMIZATIONS
 In
1.
2.
3.
4.

this work:
Reduce amount of instrumented memory
reads using load elimination.
Reduce amount of instrumented memory
writes using scalar promotion.
Avoid writeset lookups for memory not yet
written to.
Avoid writeset recordkeeping for memory
that will not be read.
LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];

5 instrumented

}

memory reads per
loop iteration

(using Lazy Code Motion)


if (0 < nfeatures) {
nci = new_centers[index];
fi = feature[i];
for (j = 0; j < nfeatures; j++) {
nci[j] = nci[j] + fi[j];
}
}

2 instrumented

memory reads per
loop iteration
LOAD ELIMINATION IN ATOMIC BLOCKS. 1


for (int j = 0; j < nfeatures; j++) {
new_centers[index][j] = new_centers[index][j]
+ feature[i][j];
}

 Key

insight:

 No

need to check if new_centers[index]
can change in other threads.

 Still

need to check that it cannot
change locally or through method
calls.
SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

num_elts
instrumented
memory writes

(using Scalar Promotion)


if (0 < num_elts) {
double temp = moments[0];
try {
for (int i = 0; i < num_elts; i++) {
temp += data[i];
}
} finally {
moments[0] = temp;
}
instrumented
}

1

memory write
SCALAR PROMOTION IN ATOMIC BLOCKS. 2


for (int i = 0; i < num_elts; i++) {
moments[0] += data[i];
}

 (same)

Key insight:

 No

need to check if moments[0] can change
in other threads.

 Still

need to check that it cannot
change locally or through method
calls.
LOAD ELIMINATION AND SCALAR
PROMOTION ADVANTAGES


These optimizations are sound for every STM
protocol that guarantees transaction isolation.



Lazy-update protocols, like TL2, gain the most,
since reads and writes are expensive.
A

read looks up the value in the writeset before
looking at the memory location.

A



write adds to, or replaces a value in the writeset.

Let’s improve it further…
REDUNDANT WRITESET LOOKUPS. 3


Consider a transactional read: x = o.f;
If we know that we didn’t yet write to o.f in this
transaction…
 … then we can skip looking in the writeset!




Analysis: discover redundant writeset lookups
using static analysis.
 Use

data flow analysis to simulate readset at
compile-time.
 Associate every abstract memory location with a tag
saying whether this location was already written
to or not.
 Analyze only inside transaction boundaries.
 Interprocedural, flow-sensitive, forward analysis.
4. REDUNDANT WRITESET RECORDKEEPING


Consider a transactional write: o.f = x;
If we know that we aren’t going to read o.f in this
transaction…
 … then we can perform a cheaper writeset insert.
 e.g.: by not updating the Bloom filter.




Analysis: discover redundant writeset
recordkeeping using static analysis.
 Use

data flow analysis to simulate writeset at
compile-time.
 Associate every abstract memory location with a tag
saying whether this location is going to be read.
 Analyze only inside transaction boundaries.
 Interprocedural, flow-sensitive, backward analysis.
EXPERIMENTS


We created analyses and transformations for
these 4 optimizations.



Software used:
Deuce STM with TL2 protocol.
 Soot Java Optimization Framework.
 STAMP and microbenchmarks.




Hardware used:


Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores ×
8 hardware threads.
READING THE RESULTS
Unoptimized
+ Load
Elimination

+ Redundant
Writeset
Recordkeeping

m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-

+ Immutable,
+ Transaction
Local,
+ThreadLocal
+ Redundant
Writeset
Lookups
RESULTS: K-MEANS

Load
Elimination
inside tight
loops
(e.g.,
new_centers
[index]

from the
example).

m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-
RESULTS: LINKED LIST
Locating the
position of the
element in all
three add(),
remove() and
contains()
transactions
involves many
reads to
locations not
written to
before.

write operations, 20 seconds, 10K items, 26K possible range 10%
RESULTS: SSCA2

Many small
transactions
that update
single
shared
values, and
don’t read
them
thereafter.

s 18 -i1.0 -u1.0 -l3 -p3-
ANALYSIS
 Load

Elimination had the largest
impact (up to 29% speedup).

 No

example of Scalar Promotion was
found. (rare phenomenon or bad
luck?)
ANALYSIS
 In

transactions that perform many
reads before writes, skipping the
writeset lookups increased
throughput by up to 28%.

 Even

in transactions that don’t read
values after they are written,
skipping the writeset recordkeeping
gained no more than 4% speedup.
SUMMARY
 We

presented 4 STM compiler
optimizations.

 Optimizations

are biased towards
lazy-update STMs, but can be
applied with some changes to inplace-update STMs.
Q&A
 Thank

you!

More Related Content

What's hot

Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Jonathan Salwan
 
protothread and its usage in contiki OS
protothread and its usage in contiki OSprotothread and its usage in contiki OS
protothread and its usage in contiki OSSalah Amean
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Jonathan Salwan
 
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW ImplementationZhen Wei
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesAnne Nicolas
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Sneeker Yeh
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 
Contiki introduction I.
Contiki introduction I.Contiki introduction I.
Contiki introduction I.Dingxin Xu
 
Instruction Combine in LLVM
Instruction Combine in LLVMInstruction Combine in LLVM
Instruction Combine in LLVMWang Hsiangkai
 
Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Sneeker Yeh
 
Concurrency scalability
Concurrency scalabilityConcurrency scalability
Concurrency scalabilityMårten Rånge
 
FPGA design with CλaSH
FPGA design with CλaSHFPGA design with CλaSH
FPGA design with CλaSHConrad Parker
 
The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014Jian-Hong Pan
 
Kernel Recipes 2019 - Formal modeling made easy
Kernel Recipes 2019 - Formal modeling made easyKernel Recipes 2019 - Formal modeling made easy
Kernel Recipes 2019 - Formal modeling made easyAnne Nicolas
 
LLVM Register Allocation
LLVM Register AllocationLLVM Register Allocation
LLVM Register AllocationWang Hsiangkai
 

What's hot (20)

Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes
 
protothread and its usage in contiki OS
protothread and its usage in contiki OSprotothread and its usage in contiki OS
protothread and its usage in contiki OS
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
 
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
 
Machine Trace Metrics
Machine Trace MetricsMachine Trace Metrics
Machine Trace Metrics
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
 
opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 
Fm wtm12-v2
Fm wtm12-v2Fm wtm12-v2
Fm wtm12-v2
 
Contiki introduction I.
Contiki introduction I.Contiki introduction I.
Contiki introduction I.
 
Instruction Combine in LLVM
Instruction Combine in LLVMInstruction Combine in LLVM
Instruction Combine in LLVM
 
Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)
 
Concurrency scalability
Concurrency scalabilityConcurrency scalability
Concurrency scalability
 
FPGA design with CλaSH
FPGA design with CλaSHFPGA design with CλaSH
FPGA design with CλaSH
 
The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014
 
Kernel Recipes 2019 - Formal modeling made easy
Kernel Recipes 2019 - Formal modeling made easyKernel Recipes 2019 - Formal modeling made easy
Kernel Recipes 2019 - Formal modeling made easy
 
LLVM Register Allocation
LLVM Register AllocationLLVM Register Allocation
LLVM Register Allocation
 

Similar to Lowering STM Overhead with Static Analysis

Deuce STM - CMP'09
Deuce STM - CMP'09Deuce STM - CMP'09
Deuce STM - CMP'09Guy Korland
 
Compiler presention
Compiler presentionCompiler presention
Compiler presentionFaria Priya
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementationRajan Kumar
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
Arm developement
Arm developementArm developement
Arm developementhirokiht
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoductionRiyaj Shamsudeen
 
Design of Real - Time Operating System Using Keil µVision Ide
Design of Real - Time Operating System Using Keil µVision IdeDesign of Real - Time Operating System Using Keil µVision Ide
Design of Real - Time Operating System Using Keil µVision Ideiosrjce
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 

Similar to Lowering STM Overhead with Static Analysis (20)

Data race
Data raceData race
Data race
 
Deuce STM - CMP'09
Deuce STM - CMP'09Deuce STM - CMP'09
Deuce STM - CMP'09
 
Compiler presention
Compiler presentionCompiler presention
Compiler presention
 
Lab6 rtos
Lab6 rtosLab6 rtos
Lab6 rtos
 
Design
DesignDesign
Design
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Arm developement
Arm developementArm developement
Arm developement
 
Transactional Memory
Transactional MemoryTransactional Memory
Transactional Memory
 
Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
Design of Real - Time Operating System Using Keil µVision Ide
Design of Real - Time Operating System Using Keil µVision IdeDesign of Real - Time Operating System Using Keil µVision Ide
Design of Real - Time Operating System Using Keil µVision Ide
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Coding style for good synthesis
Coding style for good synthesisCoding style for good synthesis
Coding style for good synthesis
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 

More from Guy Korland

Redis Developer Day TLV - Redis Stack & RedisInsight
Redis Developer Day TLV - Redis Stack & RedisInsightRedis Developer Day TLV - Redis Stack & RedisInsight
Redis Developer Day TLV - Redis Stack & RedisInsightGuy Korland
 
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Using Redis As Your  Online Feature Store:  2021 Highlights. 2022 DirectionsUsing Redis As Your  Online Feature Store:  2021 Highlights. 2022 Directions
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 DirectionsGuy Korland
 
The evolution of DBaaS - israelcloudsummit
The evolution of DBaaS - israelcloudsummitThe evolution of DBaaS - israelcloudsummit
The evolution of DBaaS - israelcloudsummitGuy Korland
 
From kv to multi model RedisDay NYC19
From kv to multi model   RedisDay NYC19From kv to multi model   RedisDay NYC19
From kv to multi model RedisDay NYC19Guy Korland
 
From Key-Value to Multi-Model - RedisConf19
From Key-Value to Multi-Model - RedisConf19From Key-Value to Multi-Model - RedisConf19
From Key-Value to Multi-Model - RedisConf19Guy Korland
 
Building Scalable Producer-Consumer Pools based on Elimination-Diraction Trees
Building Scalable Producer-Consumer  Pools based on Elimination-Diraction TreesBuilding Scalable Producer-Consumer  Pools based on Elimination-Diraction Trees
Building Scalable Producer-Consumer Pools based on Elimination-Diraction TreesGuy Korland
 
Open stack bigdata NY cloudcamp
Open stack bigdata NY cloudcampOpen stack bigdata NY cloudcamp
Open stack bigdata NY cloudcampGuy Korland
 
The Open PaaS Stack
The Open PaaS StackThe Open PaaS Stack
The Open PaaS StackGuy Korland
 
Quasi-Linearizability: relaxed consistency for improved concurrency.
Quasi-Linearizability: relaxed consistency for improved concurrency.Quasi-Linearizability: relaxed consistency for improved concurrency.
Quasi-Linearizability: relaxed consistency for improved concurrency.Guy Korland
 
The Next Generation Application Server – How Event Based Processing yields s...
The Next Generation  Application Server – How Event Based Processing yields s...The Next Generation  Application Server – How Event Based Processing yields s...
The Next Generation Application Server – How Event Based Processing yields s...Guy Korland
 

More from Guy Korland (12)

Redis Developer Day TLV - Redis Stack & RedisInsight
Redis Developer Day TLV - Redis Stack & RedisInsightRedis Developer Day TLV - Redis Stack & RedisInsight
Redis Developer Day TLV - Redis Stack & RedisInsight
 
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Using Redis As Your  Online Feature Store:  2021 Highlights. 2022 DirectionsUsing Redis As Your  Online Feature Store:  2021 Highlights. 2022 Directions
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
 
Vector database
Vector databaseVector database
Vector database
 
The evolution of DBaaS - israelcloudsummit
The evolution of DBaaS - israelcloudsummitThe evolution of DBaaS - israelcloudsummit
The evolution of DBaaS - israelcloudsummit
 
From kv to multi model RedisDay NYC19
From kv to multi model   RedisDay NYC19From kv to multi model   RedisDay NYC19
From kv to multi model RedisDay NYC19
 
From Key-Value to Multi-Model - RedisConf19
From Key-Value to Multi-Model - RedisConf19From Key-Value to Multi-Model - RedisConf19
From Key-Value to Multi-Model - RedisConf19
 
Building Scalable Producer-Consumer Pools based on Elimination-Diraction Trees
Building Scalable Producer-Consumer  Pools based on Elimination-Diraction TreesBuilding Scalable Producer-Consumer  Pools based on Elimination-Diraction Trees
Building Scalable Producer-Consumer Pools based on Elimination-Diraction Trees
 
Cloudify 10m
Cloudify 10mCloudify 10m
Cloudify 10m
 
Open stack bigdata NY cloudcamp
Open stack bigdata NY cloudcampOpen stack bigdata NY cloudcamp
Open stack bigdata NY cloudcamp
 
The Open PaaS Stack
The Open PaaS StackThe Open PaaS Stack
The Open PaaS Stack
 
Quasi-Linearizability: relaxed consistency for improved concurrency.
Quasi-Linearizability: relaxed consistency for improved concurrency.Quasi-Linearizability: relaxed consistency for improved concurrency.
Quasi-Linearizability: relaxed consistency for improved concurrency.
 
The Next Generation Application Server – How Event Based Processing yields s...
The Next Generation  Application Server – How Event Based Processing yields s...The Next Generation  Application Server – How Event Based Processing yields s...
The Next Generation Application Server – How Event Based Processing yields s...
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Lowering STM Overhead with Static Analysis

  • 1. LOWERING STM OVERHEAD WITH STATIC ANALYSIS Yehuda Afek, Guy Korland, Arie Zilberstein Tel-Aviv University LCPC 2010
  • 2. OUTLINE  Background on STM, TL2.  STM overhead and common optimizations.  New optimizations.  Experimental results.  Conclusion.
  • 3. SOFTWARE TRANSACTIONAL MEMORY  Aims to ease concurrent programming.  Idea: enclose code in atomic blocks.  Code inside atomic block behaves as a transaction:  Atomic (executes altogether or not at all).  Consistent.  Isolated (Not affected by other concurrent transactions).
  • 4. SOFTWARE TRANSACTIONAL MEMORY  Implementation:  STM compiler instruments every memory access inside atomic blocks.  STM library functions handle the synchronization according to a protocol.
  • 5. TRANSACTIONAL LOCKING II  TL2 is an influential STM protocol.  Features:  Lock-based.  Word-based.  Lazy-update.  Achieves synchronization through versioned write-locks + global version clock.
  • 6. TRANSACTIONAL LOCKING II  Advantages  Locks of TL2: are held for a short time.  Zombie transactions are quickly aborted.  Rollback is cheap.
  • 7. STM OVERHEAD  Instrumenting all transactional memory accesses induces a huge performance overhead.  STM compiler optimizations reduce the overhead.
  • 8. STM COMPILER OPTIMIZATIONS  Common compiler optimizations: 1. Avoiding instrumentation of accesses to immutable and transaction-local memory. 2. Avoiding lock acquisition and releases for thread-local memory. 3. Avoiding readset population in read-only transactions.
  • 9. NEW STM COMPILER OPTIMIZATIONS  In 1. 2. 3. 4. this work: Reduce amount of instrumented memory reads using load elimination. Reduce amount of instrumented memory writes using scalar promotion. Avoid writeset lookups for memory not yet written to. Avoid writeset recordkeeping for memory that will not be read.
  • 10. LOAD ELIMINATION IN ATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; 5 instrumented } memory reads per loop iteration (using Lazy Code Motion)  if (0 < nfeatures) { nci = new_centers[index]; fi = feature[i]; for (j = 0; j < nfeatures; j++) { nci[j] = nci[j] + fi[j]; } } 2 instrumented memory reads per loop iteration
  • 11. LOAD ELIMINATION IN ATOMIC BLOCKS. 1  for (int j = 0; j < nfeatures; j++) { new_centers[index][j] = new_centers[index][j] + feature[i][j]; }  Key insight:  No need to check if new_centers[index] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  • 12. SCALAR PROMOTION IN ATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; } num_elts instrumented memory writes (using Scalar Promotion)  if (0 < num_elts) { double temp = moments[0]; try { for (int i = 0; i < num_elts; i++) { temp += data[i]; } } finally { moments[0] = temp; } instrumented } 1 memory write
  • 13. SCALAR PROMOTION IN ATOMIC BLOCKS. 2  for (int i = 0; i < num_elts; i++) { moments[0] += data[i]; }  (same) Key insight:  No need to check if moments[0] can change in other threads.  Still need to check that it cannot change locally or through method calls.
  • 14. LOAD ELIMINATION AND SCALAR PROMOTION ADVANTAGES  These optimizations are sound for every STM protocol that guarantees transaction isolation.  Lazy-update protocols, like TL2, gain the most, since reads and writes are expensive. A read looks up the value in the writeset before looking at the memory location. A  write adds to, or replaces a value in the writeset. Let’s improve it further…
  • 15. REDUNDANT WRITESET LOOKUPS. 3  Consider a transactional read: x = o.f; If we know that we didn’t yet write to o.f in this transaction…  … then we can skip looking in the writeset!   Analysis: discover redundant writeset lookups using static analysis.  Use data flow analysis to simulate readset at compile-time.  Associate every abstract memory location with a tag saying whether this location was already written to or not.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, forward analysis.
  • 16. 4. REDUNDANT WRITESET RECORDKEEPING  Consider a transactional write: o.f = x; If we know that we aren’t going to read o.f in this transaction…  … then we can perform a cheaper writeset insert.  e.g.: by not updating the Bloom filter.   Analysis: discover redundant writeset recordkeeping using static analysis.  Use data flow analysis to simulate writeset at compile-time.  Associate every abstract memory location with a tag saying whether this location is going to be read.  Analyze only inside transaction boundaries.  Interprocedural, flow-sensitive, backward analysis.
  • 17. EXPERIMENTS  We created analyses and transformations for these 4 optimizations.  Software used: Deuce STM with TL2 protocol.  Soot Java Optimization Framework.  STAMP and microbenchmarks.   Hardware used:  Sun UltraSPARC T2 Plus with 2 CPUs × 8 cores × 8 hardware threads.
  • 18. READING THE RESULTS Unoptimized + Load Elimination + Redundant Writeset Recordkeeping m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input- + Immutable, + Transaction Local, +ThreadLocal + Redundant Writeset Lookups
  • 19. RESULTS: K-MEANS Load Elimination inside tight loops (e.g., new_centers [index] from the example). m 40 -n 40 -t 0.001 –i random-n16384-d24-c16.input-
  • 20. RESULTS: LINKED LIST Locating the position of the element in all three add(), remove() and contains() transactions involves many reads to locations not written to before. write operations, 20 seconds, 10K items, 26K possible range 10%
  • 21. RESULTS: SSCA2 Many small transactions that update single shared values, and don’t read them thereafter. s 18 -i1.0 -u1.0 -l3 -p3-
  • 22. ANALYSIS  Load Elimination had the largest impact (up to 29% speedup).  No example of Scalar Promotion was found. (rare phenomenon or bad luck?)
  • 23. ANALYSIS  In transactions that perform many reads before writes, skipping the writeset lookups increased throughput by up to 28%.  Even in transactions that don’t read values after they are written, skipping the writeset recordkeeping gained no more than 4% speedup.
  • 24. SUMMARY  We presented 4 STM compiler optimizations.  Optimizations are biased towards lazy-update STMs, but can be applied with some changes to inplace-update STMs.