SlideShare a Scribd company logo
Optimizing Memory Transactions for
Large-Scale Programs
Fernando Miguel Carvalho
Supervisor: João Cachopo
Software Engineering Group
May 9, 2014
Today
Software Development
time
Shared Memory
The Problem
Thread 1
Thread 2
Thread 3
RW RW W R R W W
Shared Memory Synchronization
synchronize
The Goal
Existing Solutions
Existing Solutions
lock(obj){
...
...
}
synchronized(obj){
...
...
}
Programming fine-grained locks is hard
...
case ST7:
locksToAcquire.add(documentWriteLock);
break;
case Q6:
locksToAcquire.add(assemblyReadLocks[BASE_ASSEMBLY_LEVEL]);
locksToAcquire.add(compositePartReadLock);
for (int level = Parameters.NumAssmLevels; level > 1; level--)
locksToAcquire.add(assemblyReadLocks[level]);
break;
case ST4:
locksToAcquire.add(assemblyReadLocks[BASE_ASSEMBLY_LEVEL]);
locksToAcquire.add(documentReadLock);
break;
...
Medium-grained lock in STMBench7
Coarse-grained lock prevents scalability
time
Shared Memory
Thread 1 m()
W R
synchronized void m(){
...
...
}
Thread 2 m()
Coarse-grained lock prevents scalability
time
Shared Memory
W R W R W
waiting...
synchronized void m(){
...
...
}
Thread 1 m()
Thread 2 m()
Coarse-grained lock prevents scalability
time
Shared Memory
RW R RW R W
waiting...
W
Thread 1 m()
Thread 2 m()
Software Transactional Memory
STM
Atomicity + Consistency + Isolation
STM
time
Shared Memory
RW R RW R W
waiting...
W
Thread 1 m()
Thread 2 m()
synchronized void m(){
...
...
}
synchronized void m(){
...
...
}
STM
time
Shared Memory
W R W R W
waiting...
R RW
STM
Thread 1 m()
Thread 2 m()
atomic void m(){
...
...
}
@Atomic void m(){
...
...
}
Deuce STM framework
STM
time
Shared Memory
W R W R W
waiting...
R RW
STM
Thread 1 m()
Thread 2 m()
@Atomic void m(){
...
...
}
STM… overheads
time
Shared Memory
W R W R WR RW
STM
Trx Begin
Trx BeginThread 1 m()
Thread 2 m()
@Atomic void m(){
...
...
}
time
Shared Memory
W R W R WR RW
STM
Trx Begin
Trx Begin
barrier barrier barrier barrierbarrier barrier
Trx Commit
Trx CommitThread 1 m()
Thread 2 m()
STM… overheads @Atomic void m(){
...
...
}
Shared Memory
Thread 1 m()
Thread 2 m()
W R WR W
STM
Trx Begin
Trx Begin
barrier barrier barrier barrierbarrier barrier
Trx Commit
Trx Commit
R W R
STM… overheads @Atomic void m(){
...
...
}
Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
W R WR W
barrier barrier barrier barrierbarrier barrier
R W R
Shared Memory
Thread 1
Thread 2
RW R RW R W
waiting...
W
time
A large-scale benchmark for Java
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
A large-scale benchmark for Java
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
A large-scale benchmark for Java
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
jvstm
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
A large-scale benchmark for Java
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
jvstm
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
medium-lock
coarse-lock
jvstm
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
W R WR W
barrier barrier barrier barrierbarrier barrier
R W R
Shared Memory
Thread 1
Thread 2
RW R RW R W
waiting...
W
Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
Shared Memory
Thread 1
Thread 2
RW R RW W
waiting...
WR
Shared Memory
Shared Memory
W R WR W
barrierbarrier barrier barrierbarrier barrier
R W RR
barrier
R
barrier
Shared Memory
R
Shared Memory
Simple
Memory
Access
Transactional
Memory
Access
:Point
x: 73
y: 11
ref
Simple
Memory
Access
Transactional
Memory
Access
R
R
barrier
Shared Memory
Shared Memory
:Point
x: 73
y: 11
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
STM Metadata
:Point
x: 73
y: 11
ref
Simple
Memory
Access
Transactional
Memory
Access
R
R
barrier
Shared Memory
Shared Memory
:Point
body:
x: 73
y: 11
TRX
TRX read-set
write-set
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
STM Metadata
:Point
x: 73
y: 11
ref
Simple
Memory
Access
Transactional
Memory
Access
R
R
barrier
Shared Memory
Shared Memory
:Point
x: 73
y: 11
:Point
body:
x: 73
y: 11
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
STM MetadataTransactional
Memory
Access
R
barrier
Shared Memory
:Point
x: 73
y: 11
:Point
body:
x: 73
y: 11
TRX read-set
write-set
Do we always need this overhead?
• STM API indirection
• STM Metadata indirection
• Logging accesses in the read-set and write-set
Yes, for data under contention
No, for non-contended data
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
Approach
TRX read-set
write-set
:Point
body:
x: 73
y: 11
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
Fast path for non-contented objects
:Point
body:
x: 73
y: 11
TRX read-set
write-set
3 cases of useless STM barriers
• non-contended classes
• non-shared objects
• shared but frequently non-contended objects
3 different techniques
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
Implemented for the JVM
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
Deuce STM framework
TL2 LSA JVSTM
May be combined…
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
Deuce STM framework
TL2 LSA JVSTM
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
Transparent STM API
public class Worm implements IWorm {
final int id;
final int headSize;
final int speed;
final BodyCoord[] body;
public void moveBody(ICoordinate newCoordinate) {
for(BodyCoord c: body) {
...
c.update(newCoordinate);
...
}
}
}
STM barrier
STM barrier
STM barrier
STM barrier
Relax the STM API Transparency
@NoSyncArray(Immutable)
@NoSyncArray(TransactionLocal)
@NoSyncArray(ThreadLocal)
@NoSyncField(Immutable)
@NoSyncField(TransactionLocal)
@NoSyncField(ThreadLocal)
Carvalho & Cachopo, ICA3PP’11
In 5 different memory
locations definitions
JWormBench
@NoSyncArray(Immutable)
@NoSyncArray(TransactionLocal)
@NoSyncArray(ThreadLocal)
@NoSyncField(Immutable)
@NoSyncField(TransactionLocal)
@NoSyncField(ThreadLocal)
Carvalho & Cachopo, ICA3PP’11
JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
Throughput(×103)ops/s
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
Throughput(×103)ops/s
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
Deuce STM API with Auxiliary Annotations
Eliminates the following overheads:
Optimizations
STM API with
Annotations
Overheads
STM API
STM Metadata
Logging read-set
and write-set
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
:...
body:...
...
Classes with shared and non-shared objects
:Point
body:...
...
TRX 1 read-set
write-set
TRX 2 read-set
write-set
:Account
body:...
...
:Person
body:...
...
:Point
body:...
...
:Account
body:...
...
TRX 1TRX 1 read-set
write-set
:...
body:...
...
Captured Memory
:Point
body:...
...
read-set
write-set
TRX 2 read-set
write-set
:Account
body:...
...
:Person
body:...
...
:Point
body:...
...
Dragojevic et al., SPAA’09
Captured by their
allocating transaction
:Account
body:...
...
Proposed by Dragojevic et al. for
an unmanaged environment
e.g. Read STM Barrier in Deuce
function onReadAccess(ref, addr, val, ctx)
return ctx.onReadAccess(ref, addr, val)
end function
:…
…: val
…: …
refctx
addr
TRX read-set
write-set
function onReadAccess(ref, addr, val, ctx)
return ctx.onReadAccess(ref, addr, val)
end function
Runtime Capture Analysis
function onReadAccess(ref, addr, val, ctx)
if isCaptured(ref, ctx) then
return val
else
return ctx.onReadAccess(ref, addr, val)
end if
end function
:…
…: val
…: …
ref
addr
ctx
TRX read-set
write-set
Runtime Capture Analysis
function onReadAccess(ref, addr, val, ctx)
if isCaptured(ref, ctx) then
return val
else
return ctx.onReadAccess(ref, addr, val)
end if
end function
Overhead(isCaptured) << Overhead(ctx.onReadAccess)
To improve the STM performance:
TRX
LICM
Lightweight Identification of Captured Memory
Carvalho & Cachopo, PPoPP’13
:…
…: val
…: …
ref
ctx
• A runtime capture analysis technique
• For a managed runtime environment, such as Java
• Lightweight
TRX
function onReadAccess(ref, addr, val, ctx)
if isCaptured(ref, ctx) then
return val
else
return ctx.onReadAccess(ref, addr, val)
end if
end function
LICM
Lightweight Identification of Captured Memory
Carvalho & Cachopo, PPoPP’13
:…
…: val
…: …
ref
ctx
fingerprint:
:…
owner:
…: val
…: …
Trx Id
Trx Id
static boolean isCaptured(Object ref, Context ctx){
return ctx.fingerprint == ref.owner;
}
TRX read-set
write-set
fingerprint: 87
:...
owner:...
...
LICM
:Point
owner:11
...
TRX read-set
write-set
:Account
owner:...
...
:Person
owner:17
...
:Point
owner: 87
...
...
:Account
owner: 87
...
fingerprint: 73
TRX read-set
write-set
fingerprint: 87
:...
owner:...
...
LICM
:Point
owner:11
...
:Account
owner:...
...
:Person
owner:17
...
:Point
owner: 87
...
...
:Account
owner: 87
...
TRX read-set
write-set
fingerprint: 73
TRX read-set
write-set
fingerprint: 91
Challenge
Efficient process of generating fingerprints:
• Avoiding further synchronization
• Avoiding the counter rollover
TRX
: Object
…: …
ref
ctx
fingerprint:
:…
owner:
…: val
…: …
Trx Id
Trx Id
:...
owner:
... :...
owner:
...
JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
LICM
Eliminates the following overheads:
Optimizations
STM API with
Annotations
LICM
Overheads
STM API
STM Metadata
Logging read-set
and write-set
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
AOM
Adaptive Object Metadata
Carvalho & Cachopo, Multiprog’12
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
:Point
body:
x: 73
y: 11
AOM
Adaptive Object Metadata
Carvalho & Cachopo, Multiprog’12
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
Compact Extended
:Point
body:
x: 73
y: 11
:Point
body: null
x: 73
y: 11
extending
AOM
Adaptive Object Metadata
Carvalho & Cachopo, Multiprog’12
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
Compact Extended
:Point
body:
x: 73
y: 11
:Point
body: null
x: 73
y: 11
extending
AOM
Adaptive Object Metadata
Carvalho & Cachopo, Multiprog’12
:VBoxBody
previous: null
version: 23
value:
:Point
x: 17
y: 71
Compact Extended
:Point
body:
x: 73
y: 11
:Point
body: null
x: 73
y: 11
17
71
extending
reverting
Extending – in transaction write-back
:Point
body: null
x: 73
y: 11
1 snapshot()
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
2
3
4
CASbody()
Reverting – part of the GC’s clean task
:Point
body: null
x: 73
y: 11
:VBoxBody
previous: null
version: 23
value:
:Point
x: 17
y: 71
17
71
1
2
toCompactLayout()
null
3 CASbody()
JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
jvstm-1nosync-licm-aom
AOM
Eliminates the following overheads:
Optimizations
STM API with
Annotations
LICM AOM
Overheads
STM API
STM Metadata
Logging read-set
and write-set
Memory Consumption
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800
Mb
Seconds
STMBench7 Read Dominated
jvstm
jvstm-aom
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
LICM
AOM
STM <versus> Medium-grained Lock
in a large-scale benchmark, such as STMBench7
STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
medium-lock
coarse-lock
jvstm
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
medium-lock
coarse-lock
jvstm
STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
medium-lock
coarse-lock
jvstm
jvstm-licm
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
medium-lock
coarse-lock
jvstm
jvstm-licm
STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
medium-lock
coarse-lock
jvstm
jvstm-licm
jvstm-licm-aom
Vacation
These tests were performed with the following configuration:
-n 256 -q 90 -u 98 -r 262144 -t 65536, proposed by [Cao Minh et al. , 2008]
0
5
10
15
20
25
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
Vacation low contention
tl2
tl2-licm
jvstm
jvstm-licm-aom
Main Contributions
• JWormBench—A flexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identification of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM
Main Contributions
• JWormBench—A flexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identification of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
TRX read-set
write-set
:Point
body:
x: 73
y: 11
Fast path for non-contented objects
• JWormBench—A flexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identification of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref :Point
body:
x: 73
y: 11
TRX read-set
write-set
International conferences and workshops:
• STM with transparent API considered harmful
Springer-Verlag, ICA3PP’11, Melbourne, Australia.
• Adaptive object metadata to reduce the overheads of a multi-versioning STM
MULTIPROG’12, Paris, France.
• Objects with adaptive accessors to avoid STM barriers
WTM’12, Bern, Switzerland.
• Runtime elision of transactional barriers for captured memory
ACM, PPoPP ’13, Shenzhen, China
• Lightweight identification of captured memory for Software Transactional
Memory. Springer-Verlag, ICA3PP’13, Sorrento, Italy -- Best Paper Award
In progress:
• Journal of Parallel and Distributed Computing, Elsevier:
Optimizing memory transactions for large-scale programs
• Information Sciences, Elsevier:
Optimizing memory transactions with lightweight capture analysis

More Related Content

What's hot

Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
lcplcp1
 
Jvm Performance Tunning
Jvm Performance TunningJvm Performance Tunning
Jvm Performance Tunning
guest1f2740
 
Loom and concurrency latest
Loom and concurrency latestLoom and concurrency latest
Loom and concurrency latest
Srinivasan Raghavan
 
Lowering STM Overhead with Static Analysis
Lowering STM Overhead with Static AnalysisLowering STM Overhead with Static Analysis
Lowering STM Overhead with Static Analysis
Guy Korland
 
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
Zhen Wei
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 
FreeRTOS
FreeRTOSFreeRTOS
FreeRTOS
Ankita Tiwari
 
The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014
Jian-Hong Pan
 
Refactoring for testability c++
Refactoring for testability c++Refactoring for testability c++
Refactoring for testability c++
Dimitrios Platis
 
FPGA design with CλaSH
FPGA design with CλaSHFPGA design with CλaSH
FPGA design with CλaSHConrad Parker
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
Sergey Platonov
 
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Cybersecurity Education and Research Centre
 
Multithreading done right
Multithreading done rightMultithreading done right
Multithreading done right
Platonov Sergey
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
Azul Systems, Inc.
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCD
rsebbe
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Akihiro Hayashi
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopenHajime Tazaki
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
Open Party
 
LLVM Register Allocation
LLVM Register AllocationLLVM Register Allocation
LLVM Register Allocation
Wang Hsiangkai
 

What's hot (20)

Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Jvm Performance Tunning
Jvm Performance TunningJvm Performance Tunning
Jvm Performance Tunning
 
ocelot
ocelotocelot
ocelot
 
Loom and concurrency latest
Loom and concurrency latestLoom and concurrency latest
Loom and concurrency latest
 
Lowering STM Overhead with Static Analysis
Lowering STM Overhead with Static AnalysisLowering STM Overhead with Static Analysis
Lowering STM Overhead with Static Analysis
 
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
FreeRTOS
FreeRTOSFreeRTOS
FreeRTOS
 
The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014The Simple Scheduler in Embedded System @ OSDC.TW 2014
The Simple Scheduler in Embedded System @ OSDC.TW 2014
 
Refactoring for testability c++
Refactoring for testability c++Refactoring for testability c++
Refactoring for testability c++
 
FPGA design with CλaSH
FPGA design with CλaSHFPGA design with CλaSH
FPGA design with CλaSH
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
 
Multithreading done right
Multithreading done rightMultithreading done right
Multithreading done right
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCD
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
LLVM Register Allocation
LLVM Register AllocationLLVM Register Allocation
LLVM Register Allocation
 

Similar to opt-mem-trx

Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
Łukasz Koniecki
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
Adrian Huang
 
Real-Time Load Balancing of an Interactive Mutliplayer Game Server
Real-Time Load Balancing of an Interactive Mutliplayer Game ServerReal-Time Load Balancing of an Interactive Mutliplayer Game Server
Real-Time Load Balancing of an Interactive Mutliplayer Game ServerJames Munro
 
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
Bo-Yi Wu
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
Suman Karumuri
 
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilitiesBlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat Security Conference
 
Memory model
Memory modelMemory model
Memory model
Yi-Hsiu Hsu
 
Memory model
Memory modelMemory model
Memory model
MingdongLiao
 
Persistent Memory Programming with Java*
Persistent Memory Programming with Java*Persistent Memory Programming with Java*
Persistent Memory Programming with Java*
Intel® Software
 
[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices
[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices
[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices
srkedmi
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the Canaries
Kernel TLV
 
Coding style for good synthesis
Coding style for good synthesisCoding style for good synthesis
Coding style for good synthesis
Vinchipsytm Vlsitraining
 
Jvm Performance Tunning
Jvm Performance TunningJvm Performance Tunning
Jvm Performance Tunning
Terry Cho
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
Kinson Chan
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
Pradeep Kumar
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Community
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
Alexey Fyodorov
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Tokyo Institute of Technology
 

Similar to opt-mem-trx (20)

Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
Lec07 threading hw
Lec07 threading hwLec07 threading hw
Lec07 threading hw
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
 
Real-Time Load Balancing of an Interactive Mutliplayer Game Server
Real-Time Load Balancing of an Interactive Mutliplayer Game ServerReal-Time Load Balancing of an Interactive Mutliplayer Game Server
Real-Time Load Balancing of an Interactive Mutliplayer Game Server
 
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilitiesBlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
 
Memory model
Memory modelMemory model
Memory model
 
Memory model
Memory modelMemory model
Memory model
 
Persistent Memory Programming with Java*
Persistent Memory Programming with Java*Persistent Memory Programming with Java*
Persistent Memory Programming with Java*
 
[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices
[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices
[Blackhat EU'14] Attacking the Linux PRNG on Android and Embedded Devices
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the Canaries
 
Coding style for good synthesis
Coding style for good synthesisCoding style for good synthesis
Coding style for good synthesis
 
Jvm Performance Tunning
Jvm Performance TunningJvm Performance Tunning
Jvm Performance Tunning
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 

Recently uploaded

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
obonagu
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 

Recently uploaded (20)

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 

opt-mem-trx

  • 1. Optimizing Memory Transactions for Large-Scale Programs Fernando Miguel Carvalho Supervisor: João Cachopo Software Engineering Group May 9, 2014
  • 4. time Shared Memory The Problem Thread 1 Thread 2 Thread 3 RW RW W R R W W Shared Memory Synchronization synchronize
  • 8. Programming fine-grained locks is hard ... case ST7: locksToAcquire.add(documentWriteLock); break; case Q6: locksToAcquire.add(assemblyReadLocks[BASE_ASSEMBLY_LEVEL]); locksToAcquire.add(compositePartReadLock); for (int level = Parameters.NumAssmLevels; level > 1; level--) locksToAcquire.add(assemblyReadLocks[level]); break; case ST4: locksToAcquire.add(assemblyReadLocks[BASE_ASSEMBLY_LEVEL]); locksToAcquire.add(documentReadLock); break; ... Medium-grained lock in STMBench7
  • 9. Coarse-grained lock prevents scalability time Shared Memory Thread 1 m() W R synchronized void m(){ ... ... } Thread 2 m()
  • 10. Coarse-grained lock prevents scalability time Shared Memory W R W R W waiting... synchronized void m(){ ... ... } Thread 1 m() Thread 2 m()
  • 11. Coarse-grained lock prevents scalability time Shared Memory RW R RW R W waiting... W Thread 1 m() Thread 2 m()
  • 12. Software Transactional Memory STM Atomicity + Consistency + Isolation
  • 13. STM time Shared Memory RW R RW R W waiting... W Thread 1 m() Thread 2 m() synchronized void m(){ ... ... }
  • 14. synchronized void m(){ ... ... } STM time Shared Memory W R W R W waiting... R RW STM Thread 1 m() Thread 2 m() atomic void m(){ ... ... } @Atomic void m(){ ... ... } Deuce STM framework
  • 15. STM time Shared Memory W R W R W waiting... R RW STM Thread 1 m() Thread 2 m() @Atomic void m(){ ... ... }
  • 16. STM… overheads time Shared Memory W R W R WR RW STM Trx Begin Trx BeginThread 1 m() Thread 2 m() @Atomic void m(){ ... ... }
  • 17. time Shared Memory W R W R WR RW STM Trx Begin Trx Begin barrier barrier barrier barrierbarrier barrier Trx Commit Trx CommitThread 1 m() Thread 2 m() STM… overheads @Atomic void m(){ ... ... }
  • 18. Shared Memory Thread 1 m() Thread 2 m() W R WR W STM Trx Begin Trx Begin barrier barrier barrier barrierbarrier barrier Trx Commit Trx Commit R W R STM… overheads @Atomic void m(){ ... ... }
  • 19. Shared Memory Thread 1 Thread 2 STM Trx Begin Trx Begin Trx Commit Trx Commit W R WR W barrier barrier barrier barrierbarrier barrier R W R Shared Memory Thread 1 Thread 2 RW R RW R W waiting... W time
  • 20. A large-scale benchmark for Java These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 2 4 6 8 10 12 14 16 18 20 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread
  • 21. A large-scale benchmark for Java These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 2 4 6 8 10 12 14 16 18 20 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread coarse-lock
  • 22. A large-scale benchmark for Java 0 2 4 6 8 10 12 14 16 18 20 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread coarse-lock 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread coarse-lock jvstm These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores.
  • 23. A large-scale benchmark for Java 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread coarse-lock jvstm 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated seq-1thread medium-lock coarse-lock jvstm These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores.
  • 24. Shared Memory Thread 1 Thread 2 STM Trx Begin Trx Begin Trx Commit Trx Commit W R WR W barrier barrier barrier barrierbarrier barrier R W R Shared Memory Thread 1 Thread 2 RW R RW R W waiting... W
  • 25. Shared Memory Thread 1 Thread 2 STM Trx Begin Trx Begin Trx Commit Trx Commit Shared Memory Thread 1 Thread 2 RW R RW W waiting... WR Shared Memory Shared Memory W R WR W barrierbarrier barrier barrierbarrier barrier R W RR barrier
  • 28. :Point x: 73 y: 11 :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref STM Metadata :Point x: 73 y: 11 ref Simple Memory Access Transactional Memory Access R R barrier Shared Memory Shared Memory :Point body: x: 73 y: 11 TRX
  • 29. TRX read-set write-set :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref STM Metadata :Point x: 73 y: 11 ref Simple Memory Access Transactional Memory Access R R barrier Shared Memory Shared Memory :Point x: 73 y: 11 :Point body: x: 73 y: 11
  • 30. :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref STM MetadataTransactional Memory Access R barrier Shared Memory :Point x: 73 y: 11 :Point body: x: 73 y: 11 TRX read-set write-set Do we always need this overhead? • STM API indirection • STM Metadata indirection • Logging accesses in the read-set and write-set
  • 31. Yes, for data under contention
  • 33. :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref Approach TRX read-set write-set :Point body: x: 73 y: 11
  • 34. :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref Fast path for non-contented objects :Point body: x: 73 y: 11 TRX read-set write-set
  • 35. 3 cases of useless STM barriers • non-contended classes • non-shared objects • shared but frequently non-contended objects
  • 36. 3 different techniques • non-contended classes => Compile time technique • non-shared objects => Runtime analysis • shared but frequently non-contended objects => Runtime adaptive technique
  • 37. Implemented for the JVM • non-contended classes => Compile time technique • non-shared objects => Runtime time analysis • shared but frequently non-contended objects => Runtime adaptive technique Deuce STM framework TL2 LSA JVSTM
  • 38. May be combined… • non-contended classes => Compile time technique • non-shared objects => Runtime time analysis • shared but frequently non-contended objects => Runtime adaptive technique Deuce STM framework TL2 LSA JVSTM
  • 39. • non-contended classes => Compile time technique • non-shared objects => Runtime time analysis • shared but frequently non-contended objects => Runtime adaptive technique
  • 40. Transparent STM API public class Worm implements IWorm { final int id; final int headSize; final int speed; final BodyCoord[] body; public void moveBody(ICoordinate newCoordinate) { for(BodyCoord c: body) { ... c.update(newCoordinate); ... } } } STM barrier STM barrier STM barrier STM barrier
  • 41. Relax the STM API Transparency @NoSyncArray(Immutable) @NoSyncArray(TransactionLocal) @NoSyncArray(ThreadLocal) @NoSyncField(Immutable) @NoSyncField(TransactionLocal) @NoSyncField(ThreadLocal) Carvalho & Cachopo, ICA3PP’11
  • 42. In 5 different memory locations definitions JWormBench @NoSyncArray(Immutable) @NoSyncArray(TransactionLocal) @NoSyncArray(ThreadLocal) @NoSyncField(Immutable) @NoSyncField(TransactionLocal) @NoSyncField(ThreadLocal) Carvalho & Cachopo, ICA3PP’11
  • 43. JWormBench These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 50 100 150 200 250 300 350 400 450 Throughput(×103)ops/s 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread
  • 44. JWormBench These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 50 100 150 200 250 300 350 400 450 Throughput(×103)ops/s 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm
  • 45. JWormBench These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync
  • 46. Deuce STM API with Auxiliary Annotations Eliminates the following overheads: Optimizations STM API with Annotations Overheads STM API STM Metadata Logging read-set and write-set
  • 47. • non-contended classes => Compile time technique • non-shared objects => Runtime time analysis • shared but frequently non-contended objects => Runtime adaptive technique
  • 48. :... body:... ... Classes with shared and non-shared objects :Point body:... ... TRX 1 read-set write-set TRX 2 read-set write-set :Account body:... ... :Person body:... ... :Point body:... ... :Account body:... ...
  • 49. TRX 1TRX 1 read-set write-set :... body:... ... Captured Memory :Point body:... ... read-set write-set TRX 2 read-set write-set :Account body:... ... :Person body:... ... :Point body:... ... Dragojevic et al., SPAA’09 Captured by their allocating transaction :Account body:... ... Proposed by Dragojevic et al. for an unmanaged environment
  • 50. e.g. Read STM Barrier in Deuce function onReadAccess(ref, addr, val, ctx) return ctx.onReadAccess(ref, addr, val) end function :… …: val …: … refctx addr TRX read-set write-set
  • 51. function onReadAccess(ref, addr, val, ctx) return ctx.onReadAccess(ref, addr, val) end function Runtime Capture Analysis function onReadAccess(ref, addr, val, ctx) if isCaptured(ref, ctx) then return val else return ctx.onReadAccess(ref, addr, val) end if end function :… …: val …: … ref addr ctx TRX read-set write-set
  • 52. Runtime Capture Analysis function onReadAccess(ref, addr, val, ctx) if isCaptured(ref, ctx) then return val else return ctx.onReadAccess(ref, addr, val) end if end function Overhead(isCaptured) << Overhead(ctx.onReadAccess) To improve the STM performance:
  • 53. TRX LICM Lightweight Identification of Captured Memory Carvalho & Cachopo, PPoPP’13 :… …: val …: … ref ctx • A runtime capture analysis technique • For a managed runtime environment, such as Java • Lightweight
  • 54. TRX function onReadAccess(ref, addr, val, ctx) if isCaptured(ref, ctx) then return val else return ctx.onReadAccess(ref, addr, val) end if end function LICM Lightweight Identification of Captured Memory Carvalho & Cachopo, PPoPP’13 :… …: val …: … ref ctx fingerprint: :… owner: …: val …: … Trx Id Trx Id static boolean isCaptured(Object ref, Context ctx){ return ctx.fingerprint == ref.owner; }
  • 55. TRX read-set write-set fingerprint: 87 :... owner:... ... LICM :Point owner:11 ... TRX read-set write-set :Account owner:... ... :Person owner:17 ... :Point owner: 87 ... ... :Account owner: 87 ... fingerprint: 73
  • 56. TRX read-set write-set fingerprint: 87 :... owner:... ... LICM :Point owner:11 ... :Account owner:... ... :Person owner:17 ... :Point owner: 87 ... ... :Account owner: 87 ... TRX read-set write-set fingerprint: 73 TRX read-set write-set fingerprint: 91
  • 57. Challenge Efficient process of generating fingerprints: • Avoiding further synchronization • Avoiding the counter rollover TRX : Object …: … ref ctx fingerprint: :… owner: …: val …: … Trx Id Trx Id :... owner: ... :... owner: ...
  • 58. JWormBench These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync
  • 59. JWormBench These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores. 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync jvstm-1nosync-licm
  • 60. LICM Eliminates the following overheads: Optimizations STM API with Annotations LICM Overheads STM API STM Metadata Logging read-set and write-set
  • 61. • non-contended classes => Compile time technique • non-shared objects => Runtime time analysis • shared but frequently non-contended objects => Runtime adaptive technique
  • 62. AOM Adaptive Object Metadata Carvalho & Cachopo, Multiprog’12 :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 :Point body: x: 73 y: 11
  • 63. AOM Adaptive Object Metadata Carvalho & Cachopo, Multiprog’12 :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 Compact Extended :Point body: x: 73 y: 11 :Point body: null x: 73 y: 11 extending
  • 64. AOM Adaptive Object Metadata Carvalho & Cachopo, Multiprog’12 :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 Compact Extended :Point body: x: 73 y: 11 :Point body: null x: 73 y: 11 extending
  • 65. AOM Adaptive Object Metadata Carvalho & Cachopo, Multiprog’12 :VBoxBody previous: null version: 23 value: :Point x: 17 y: 71 Compact Extended :Point body: x: 73 y: 11 :Point body: null x: 73 y: 11 17 71 extending reverting
  • 66. Extending – in transaction write-back :Point body: null x: 73 y: 11 1 snapshot() :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 2 3 4 CASbody()
  • 67. Reverting – part of the GC’s clean task :Point body: null x: 73 y: 11 :VBoxBody previous: null version: 23 value: :Point x: 17 y: 71 17 71 1 2 toCompactLayout() null 3 CASbody()
  • 68. JWormBench 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync
  • 69. JWormBench 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync jvstm-1nosync-licm
  • 70. JWormBench 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync jvstm-1nosync-licm 0 50 100 150 200 250 300 350 400 450 1 2 4 8 16 24 32 40 48 Throughput(×103)ops/s Threads 50% RO trxs, O(n2), N Reads, 1 Write seq-1thread jvstm jvstm-5nosync jvstm-1nosync-licm jvstm-1nosync-licm-aom
  • 71. AOM Eliminates the following overheads: Optimizations STM API with Annotations LICM AOM Overheads STM API STM Metadata Logging read-set and write-set
  • 72. Memory Consumption 0 500 1000 1500 2000 2500 0 200 400 600 800 1000 1200 1400 1600 1800 Mb Seconds STMBench7 Read Dominated jvstm jvstm-aom
  • 73. • non-shared objects => Runtime time analysis • shared but frequently non-contended objects => Runtime adaptive technique LICM AOM STM <versus> Medium-grained Lock in a large-scale benchmark, such as STMBench7
  • 74. STMBench7 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated medium-lock coarse-lock jvstm These tests were performed on a machine with 4 AMD Opteron 6168 processors, each one with 12 cores, resulting in a total of 48 cores.
  • 75. 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated medium-lock coarse-lock jvstm STMBench7 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated medium-lock coarse-lock jvstm jvstm-licm
  • 76. 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated medium-lock coarse-lock jvstm jvstm-licm STMBench7 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads StmBench7 read dominated medium-lock coarse-lock jvstm jvstm-licm jvstm-licm-aom
  • 77. Vacation These tests were performed with the following configuration: -n 256 -q 90 -u 98 -r 262144 -t 65536, proposed by [Cao Minh et al. , 2008] 0 5 10 15 20 25 1 2 4 8 16 24 32 40 48 Throughput(x103)ops/sec Threads Vacation low contention tl2 tl2-licm jvstm jvstm-licm-aom
  • 78. Main Contributions • JWormBench—A flexible benchmark for transactional synchronization • 3 optimization proposals – Extended Deuce API – LICM—Lightweight Identification of Captured Memory – AOM—Adaptive Object Metadata • Implementation of these techniques in Deuce STM framework • Support for in-place metadata in Deuce STM framework • Fast access path for non-contended objects: LICM + AOM
  • 79. Main Contributions • JWormBench—A flexible benchmark for transactional synchronization • 3 optimization proposals – Extended Deuce API – LICM—Lightweight Identification of Captured Memory – AOM—Adaptive Object Metadata • Implementation of these techniques in Deuce STM framework • Support for in-place metadata in Deuce STM framework • Fast access path for non-contended objects: LICM + AOM :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref TRX read-set write-set :Point body: x: 73 y: 11
  • 80. Fast path for non-contented objects • JWormBench—A flexible benchmark for transactional synchronization • 3 optimization proposals – Extended Deuce API – LICM—Lightweight Identification of Captured Memory – AOM—Adaptive Object Metadata • Implementation of these techniques in Deuce STM framework • Support for in-place metadata in Deuce STM framework • Fast access path for non-contended objects: LICM + AOM :VBoxBody previous: version: 23 value: :VBoxBody previous: null version: 0 value: :Point x: 17 y: 71 :Point x: 73 y: 11 ref :Point body: x: 73 y: 11 TRX read-set write-set
  • 81. International conferences and workshops: • STM with transparent API considered harmful Springer-Verlag, ICA3PP’11, Melbourne, Australia. • Adaptive object metadata to reduce the overheads of a multi-versioning STM MULTIPROG’12, Paris, France. • Objects with adaptive accessors to avoid STM barriers WTM’12, Bern, Switzerland. • Runtime elision of transactional barriers for captured memory ACM, PPoPP ’13, Shenzhen, China • Lightweight identification of captured memory for Software Transactional Memory. Springer-Verlag, ICA3PP’13, Sorrento, Italy -- Best Paper Award In progress: • Journal of Parallel and Distributed Computing, Elsevier: Optimizing memory transactions for large-scale programs • Information Sciences, Elsevier: Optimizing memory transactions with lightweight capture analysis

Editor's Notes

  1. I’m going to present my phd work entitled Optimizing Memory Transactions for Large-Scale Programs
  2. Today, any computer provides more than one processing unit. We can find them everywhere. Even in our personal devices, such as: tablets, mobile phones, laptops, … everywhere.
  3. However, developing software that takes advantage of these multiple processors is not so easy as developing software for a single unit processor. => MANY problems arise from programming for multiprocessors.
  4. One of those well-known problems is the Share Memory Synchronization. I am talking about Software running multiple parallel threads that read and write to shared memory concurrently. And we must synchronize these concurrent accesses to avoid the eventual occurrence of inconsistent data. So, this is the scope of the work that I have developed in my PhD.
  5. …and my general Goal is to introduce new techniques for shared memory synchronization that are able to increase the overall performance of this kind of software.
  6. So, lets take a brief look over existing solutions. The most common solutions use lock-based techniques…
  7. …, and, they are still present in modern software environments such as: Java or .net. However, the difficulties of developing this kind of solutions are widely known…
  8. … and not all programmers are able to develop fine-lock solutions. Here we have an example of just a small part of the code of the STMbench7 benchmark responsible for the implementation of medium-lock strategy. And we can observe the complexity of managing different locks for different operations. I chose this code because the STMBench7 is one of the most complex and large-scale benchmarks for parallel applications and a I will use it in several examples along my presentation. On the other-hand, when we avoid this kind of approach and we simply use a coarse-grained lock…
  9. … such as simply using the synchronized keyword to synchronize shared access to a method, then we may prevent scalability and parallel execution. Now in this example every thread executing method m() acquires a coarse lock, before accessing shared memory. And in this case, when the second thread wants to access the shared memory and thus, it tries to acquire the same lock, then it must wait…
  10. … whereas the first thread continues to write and read to the shared memory. And the second thread waits until the former releases the global lock.
  11. And when it happens, the second thread successfully acquires the lock and proceeds to access the shared memory. So concluding, despite we have multiple processing units we are not taking advantage of this capacity of executing tasks in parallel. And thus, we may have tasks running one after other and so on.
  12. But today, there are other alternatives to lock-based solutions such as Transactional Memory. Instead of using a pessimistic approach, Transactional Memory uses an optimistic approach that let memory accesses to proceed in parallel. Provides an abstraction that automatically uses fine-locks just where they are needed.
  13. So taking again the previous example now we replace the coarse-lock with the STM infrastructure.
  14. …. and we replace the synchronized keyword with an atomic keyword provided by the STM API, or an annotation, as happens in the case of Deuce STM. So, now we may have both threads concurrently accessing shared data in parallel.
  15. But in fact, this is not the real scenario because now we have to include also in our analysis the STM-induced overheads that are related with…
  16. the transaction bookkeeping…
  17. … and the overheads of the implicit STM barriers that are now performed by all memory accesses.
  18. So, although we may perform memory accesses in parallel without violating data consistency, however we may not be able to improve the overall performance in comparison with a coarse-lock based solution.
  19. So, although we may perform memory accesses in parallel without violating data consistency, however we may not be able to improve the overall performance in comparison with a coarse-lock based solution.
  20. I observed this behavior in several experiences for large-scale programs. And here I have an example for the STMBench7, where the benchmark automatically transactified with the JVSTM performs even worse than a coarse-lock strategy. We can also observe the behavior of this benchmarks with a medium-lock strategy that is not easy to program but that is already provided in this case and represents how much we would like to achieve in performance, but desirable with less programming effort.
  21. So, my goal is to reduce the STM barriers induced overheads and thus improve the overall performance. And, the main question is: can we really reduce the overheads of an STM Barrier and thus improve the overall performance pf software synchronized with an STM?
  22. Lets take a deeper look over an STM barrier and compare a simple memory access with a transactional memory access.
  23. For instance consider an Object Point with two fields X and Y. To get the X field value of this object we have to get the reference to that object and access the corresponding field.
  24. On the other hand, and considering now this same object Point as a transactional location, then we must take into account the additional metadata that is imposed by the STM to all transactional locations. In the case of the JVSTM this metadata corresponds to a history of values. We call to each element of this history a box body. A box body stores the version of the transaction that has committed that body and its corresponding value. The transactional object points to the head of the versions’ history corresponding to the most recent committed value. So considering that we are looking for the oldest value, then we must track all the versions until we reach the desired value.
  25. And a transaction must also keep track of the locations that are read and written in the read-set and write-set. So it is easy to understand that an STM barrier typically requires orders of magnitude more machine cycles than a simple memory access.
  26. So the question is: do we really need to incur in all this overheads for all memory locations?
  27. Yes for contended data, to synchronize concurrent accesses and guarantee the data consistency.
  28. But from my experiences I observed that vast majority of the memory locations managed by a program are not under contention. And in theses cases, the corresponding memory accesses perform useless STM barriers.
  29. So this is the key that will allow me to reduce the STM-induced overheads. I will introduce techniques that avoid the tasks executed by an STM barrier for non-contended data.
  30. And now when we access an object that is not under contention we may directly read its proper fields and avoid the Metadata indirections and further STM tasks.
  31. I identified 3 major situations of useless STM barriers: non-contended classes -- classes whose objects are never under contention, because are immutable, transaction-local or thread-local. non-shared objects -- classes with both shared and non-shared objects Shared objects that in many occasions are not shared and thus do not need to perform the STM barriers in those cases.
  32. For each situation I used a different optimization technique.
  33. I choose one of the most used software development environments world wide -- Java. And to that end, I implemented my proposals in Deuce -- an STM framework -- which turns my optimizations techniques available to any supported STM algorithm. Except, the last technique that for now it was specifically designed for the JVSTM.
  34. Another advantage of my proposals is that… Because each technique deals with a different scenario then we may combine them in several ways to avoid different kinds of overheads. So, now lets look to each proposal individually.
  35. TO introduce my first technique I will show an example with part of the code of the class Worm of the WormBench bencnhmark for Java. Every memory access inside the moveBody is replaced with an STM Barrier by the STM compiler. And, every method invoked from this method may also include STM barriers. So, here I am emphasizing two statements that include STM barriers: accessing the body array and invoking the update method of a coordinate object. The STM barriers that modify the coordinate object are necessary, but what about the implicit barriers from the foreach statement? This array is not being modified and in this case the final keyword does not prevent the use of STM barriers, because it just says that the array reference is immutable and not its elements.
  36. So my first technique extends Deuce API with a couple of annotations that let the programmer specify the behavior of certain locations. So in the previous example we may use the first annotation to let the compiler know that array is immutable and thus avoid the use of STM barriers. And, this annotation can also be parameterized in a different way according to the behavior of the annotated location as… Similarly, in my solution I also included a specific annotation to control the transactification of fields.
  37. In my proposal I used the JWormBench to explore the effects on performance of relaxing the transparency of an STM with these annotations. And I used 3 annotations in 5 different memory locations definition to avoid useless STM barriers.
  38. … and with this optimization I got a speedup in performance of the Tl2 stm of almost 10 times. I ran these tests in a machine with 48 cores.
  39. So, this technique eliminates all the overheads of an STM barrier for memory locations of classes that only have non-contended objects. However, for classes that have both shared and non-shared objects we are not able to use this technique.
  40. In that case we have to identify non-shared objects at the object level and not at the level of its class definition. So lets start to see an example of classes that have both shared and non-shared objects.
  41. Here we have two transactions concurrently accessing a couple of shared objects: an Account, a Person and a Point. Now if transaction 1 instantiates some objects, for instance a Point and an Account during its execution, then we are sure that transaction 2 cannot access these objects because they are not visible outside the transaction’s boundaries until it commits successfully.
  42. So, these objects correspond to Transaction-local memory that is memory allocated inside a transaction, which cannot escape. Dragojevic introduced the concept of Captured Memory as the memory captured by its allocating transaction. In this case, all accesses to objects in captured memory do not need to perform a full barrier. So, Dragojevic proposed a Capture Analysis technique for an unmanaged environment to identify if a memory location is captured by a transaction, or not, and thus if it requires a full STM barrier, or not.
  43. To better understand the idea of a runtime Capture Analysis technique lets see an example of an STM Barrier. Here we have a simplified view of a Read barrier in Deuce Framework. This barrier receives by arguments: the target object, the field’s address, the value of that field and a context that represents the transaction object. I am hiding other low level details in this code. So, for now I just want to show that typically an STM barrier redirects the memory access to the corresponding transaction trough the context reference.
  44. So the goal of the capture analysis is to directly access the transactional object, instead of invoking the transaction, when that object is captured by its allocating transaction and avoid the further tasks of an STM barrier, such as keep tracking of the read-set and write-set. Now, here we have the same STM Barrier performing a runtime capture analysis through the isCaptured method. If the isCaptured returns true then the Read Barrier just needs to return the field’s value.
  45. To improve the STM performance, then the overhead of the capture analysis implemented by the isCaptured function should be lower than the overhead of performing a full STM barrier.
  46. To that end I implemented an efficient algorithm of runtime capture analysis for a managed runtime environment, which I called LICM.
  47. The idea of LICM is that every transaction should keep a unique identifier, called fingerprint that is recorded to every newly instantiated object in an owner field. So every time a transaction finds an object with a owner id equals to the transaction’s fingerprint then it can avoids the full STM barrier. The capture analysis algorithm just needs to perform an identity comparison between the transaction’s fingerprint and the object’s owner.
  48. So revisiting our previous example. Now, every newly allocated object has an owner corresponding to its allocating transaction’s fingerprint. And, when the allocating transaction accesses its captured objects the transaction identifies those objects as owned objects and thus it does not requires a full barrier.
  49. Later when the transaction commits successfully and those objects turn visible, then any other transaction that access those objects will perform a full STM barrier because they have a different fingerprint from the owner of those objects.
  50. So the main challenge of the LICM implementation is to find an efficient process of generating fingerprints. So, I choose a newly allocated instance of class Object as a fingerprint. This solution has the advantage of relying on the garbage collector to provide uniqueness and the ability of recycling unused fingerprints. Despite we are creating one more object per transaction, I am working in the scope of large-scale programs. So the overhead of this fingerprint object is very low in comparison to the whole program working-set. Finally, my experimental results prove the effectiveness of this technique.
  51. This are the results previously obtained with the first optimization technique.
  52. And now combining also the LICM we got even a better performance. In this case the LICM is able to identify transaction-local objects that were not excluded from transactificaton with the previous technique, because their classes have both shared and non-shared objects. Beyond that, all the transaction bookkeeping and the metadata that is being eliminated with this optimization approach also has an impact in memory consumption.
  53. So, this technique eliminates all the overheads of an STM barrier for memory locations of classes that only have non-contended objects. However, for classes that have both shared and non-shared objects we are not able to use this technique.
  54. Finally the last scenario corresponds to objects that are subject of concurrent modifications for a small period of time, but which stay unmodified after that period and for the rest of the program execution. So, in this case we are incurring in the performance and memory overheads of the metadata.
  55. So, with my third technique, the AOM – Adaptive Object Metadata – I propose that instead of a unique layout, which includes the STM Metadata, transactional objects should have an adaptive layout that includes two different object layouts.
  56. two different object layouts: a compact layout, where no memory overheads exist, and an extended layout, used when the object may be under contention. When a transactional object is created it starts in the compact layout and later when it is updated by a transaction it will be extended. The original fields values correspond to the version 0 of the versioned history. Because the JVSTM has a garbage collector algorithm that removes old versions, so eventually and if this object is no longer written by any transaction…
  57. …. then it will become with just one box body.
  58. And, in this case we can revert it back to the compact layout, discarding the additional metadata. All the operations of extending and reverting an object are lock-free and should guarantee the progress of the whole JVSTM algorithm that is also lock-free. So lets take a deeper look on each operation. And, to reasoning about the tasks of each reversion and extension process you must consider that all body elements of a history are IMMUTABLE.
  59. An object is extended during the write-back of a transaction if it is in the compact layout, which involves: creating a snapshot of the object creating a body for that snapshot that is marked with version 0 and creating the body for the new value and version that points to the entry with version 0. The operation proceeds with a compare and swap, and if it fails that means that another transaction helped in the write-back phase and this new value is already committed.
  60. On the other hand the reversion process involves: First it checks whether the first body of the object’s history is pointing to null. If it is, which means that there is only one version in the history, it copies the values contained in that body to the corresponding fields in the object and then if finished with a compare-and-swap of the body of the object to null. The CAS fails if any other transaction commits new values for this object. So, when the CAS fails nothing else needs to be done, because the object stays in the extended layout.
  61. So, this technique eliminates all the overheads of an STM barrier for memory locations of classes that only have non-contended objects. However, for classes that have both shared and non-shared objects we are not able to use this technique.
  62. However with the elimination of the metadata we have a big reduction in the memory consumption as we can see in this results in the STMBench7.
  63. To finish my presentation I would like to revisit the first graph that shows that the performance of an STM is far from the performance of a medium grained lock in a large-scale program and in particularly for the STMBEnch7.
  64. In the case of the Vacation we do not have any lock-based synchronization strategy for comparison with the STM synchronization approach. So in include here in my results another STM algorithm for comparison, for which I used also the LICM optimization technique.