opt-mem-trx

Optimizing Memory Transactions for
Large-Scale Programs
Fernando Miguel Carvalho
Supervisor: João Cachopo
Software Engineering Group
May 9, 2014

time
Shared Memory
The Problem
Thread 1
Thread 2
Thread 3
RW RW W R R W W
Shared Memory Synchronization
synchronize

Existing Solutions
lock(obj){
...
...
}
synchronized(obj){
...
...
}

Programming fine-grained locks is hard
...
case ST7:
locksToAcquire.add(documentWriteLock);
break;
case Q6:
locksToAcquire.add(assemblyReadLocks[BASE_ASSEMBLY_LEVEL]);
locksToAcquire.add(compositePartReadLock);
for (int level = Parameters.NumAssmLevels; level > 1; level--)
locksToAcquire.add(assemblyReadLocks[level]);
break;
case ST4:
locksToAcquire.add(assemblyReadLocks[BASE_ASSEMBLY_LEVEL]);
locksToAcquire.add(documentReadLock);
break;
...
Medium-grained lock in STMBench7

Coarse-grained lock prevents scalability
time
Shared Memory
Thread 1 m()
W R
synchronized void m(){
...
...
}
Thread 2 m()

time
Shared Memory
W R W R W
waiting...
...
...
}
Thread 1 m()
Thread 2 m()

time
Shared Memory
RW R RW R W
waiting...
W
Thread 1 m()
Thread 2 m()

Software Transactional Memory
STM
Atomicity + Consistency + Isolation

STM
time
Shared Memory
RW R RW R W
waiting...
W
Thread 1 m()
Thread 2 m()
...
...
}

...
...
}
STM
time
Shared Memory
W R W R W
waiting...
R RW
STM
Thread 1 m()
Thread 2 m()
atomic void m(){
...
...
}
@Atomic void m(){
...
...
}
Deuce STM framework

STM
time
Shared Memory
W R W R W
waiting...
R RW
STM
Thread 1 m()
Thread 2 m()
@Atomic void m(){
...
...
}

STM… overheads
time
Shared Memory
W R W R WR RW
STM
Trx Begin
Trx BeginThread 1 m()
Thread 2 m()
@Atomic void m(){
...
...
}

time
Shared Memory
W R W R WR RW
STM
Trx Begin
Trx Begin
barrier barrier barrier barrierbarrier barrier
Trx Commit
Trx CommitThread 1 m()
Thread 2 m()
STM… overheads @Atomic void m(){
...
...
}

Shared Memory
Thread 1 m()
Thread 2 m()
W R WR W
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
R W R
STM… overheads @Atomic void m(){
...
...
}

Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
W R WR W
R W R
Shared Memory
Thread 1
Thread 2
RW R RW R W
waiting...
W
time

A large-scale benchmark for Java
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread

0
2
4
6
8
10
12
14
16
18
20
Threads
seq-1thread
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
coarse-lock

0
2
4
6
8
10
12
14
16
18
20
Threads
seq-1thread
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
coarse-lock
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
coarse-lock
jvstm

0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
coarse-lock
jvstm
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
medium-lock
coarse-lock
jvstm

Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
W R WR W
R W R
Shared Memory
Thread 1
Thread 2
RW R RW R W
waiting...
W

Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
Shared Memory
Thread 1
Thread 2
RW R RW W
waiting...
WR
Shared Memory
Shared Memory
W R WR W
barrierbarrier barrier barrierbarrier barrier
R W RR
barrier

R
barrier
Shared Memory
R
Shared Memory
Simple
Memory
Access
Transactional
Memory
Access

:Point
x: 73
y: 11
ref
Simple
Memory
Access
Transactional
Memory
Access
R
R
barrier
Shared Memory
Shared Memory

:Point
x: 73
y: 11
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
STM Metadata
:Point
x: 73
y: 11
ref
Simple
Memory
Access
Transactional
Memory
Access
R
R
barrier
Shared Memory
Shared Memory
:Point
body:
x: 73
y: 11
TRX

TRX read-set
write-set
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
STM Metadata
:Point
x: 73
y: 11
ref
Simple
Memory
Access
Transactional
Memory
Access
R
R
barrier
Shared Memory
Shared Memory
:Point
x: 73
y: 11
:Point
body:
x: 73
y: 11

:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
STM MetadataTransactional
Memory
Access
R
barrier
Shared Memory
:Point
x: 73
y: 11
:Point
body:
x: 73
y: 11
TRX read-set
write-set
Do we always need this overhead?
• STM API indirection
• STM Metadata indirection
• Logging accesses in the read-set and write-set

Yes, for data under contention

:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
Approach
TRX read-set
write-set
:Point
body:
x: 73
y: 11

:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
Fast path for non-contented objects
:Point
body:
x: 73
y: 11
TRX read-set
write-set

3 cases of useless STM barriers
• non-contended classes
• non-shared objects
• shared but frequently non-contended objects

3 different techniques
=> Compile time technique
=> Runtime analysis
=> Runtime adaptive technique

Implemented for the JVM
=> Runtime time analysis
Deuce STM framework
TL2 LSA JVSTM

May be combined…
Deuce STM framework
TL2 LSA JVSTM

Transparent STM API
public class Worm implements IWorm {
final int id;
final int headSize;
final int speed;
final BodyCoord[] body;
public void moveBody(ICoordinate newCoordinate) {
for(BodyCoord c: body) {
...
c.update(newCoordinate);
...
}
}
}
STM barrier
STM barrier
STM barrier
STM barrier

Relax the STM API Transparency
@NoSyncArray(Immutable)
@NoSyncArray(TransactionLocal)
@NoSyncArray(ThreadLocal)
@NoSyncField(Immutable)
@NoSyncField(TransactionLocal)
@NoSyncField(ThreadLocal)
Carvalho & Cachopo, ICA3PP’11

In 5 different memory
locations definitions
JWormBench
@NoSyncArray(Immutable)
@NoSyncArray(TransactionLocal)
@NoSyncArray(ThreadLocal)
@NoSyncField(Immutable)
@NoSyncField(TransactionLocal)
@NoSyncField(ThreadLocal)
Carvalho & Cachopo, ICA3PP’11

JWormBench
0
50
100
150
200
250
300
350
400
450
Throughput(×103)ops/s
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread

JWormBench
0
50
100
150
200
250
300
350
400
450
seq-1thread
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm

JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync

Deuce STM API with Auxiliary Annotations
Eliminates the following overheads:
Optimizations
STM API with
Annotations
Overheads
STM API
STM Metadata
Logging read-set
and write-set

:...
body:...
...
Classes with shared and non-shared objects
:Point
body:...
...
TRX 1 read-set
write-set
TRX 2 read-set
write-set
:Account
body:...
...
:Person
body:...
...
:Point
body:...
...
:Account
body:...
...

TRX 1TRX 1 read-set
write-set
:...
body:...
...
Captured Memory
:Point
body:...
...
read-set
write-set
TRX 2 read-set
write-set
:Account
body:...
...
:Person
body:...
...
:Point
body:...
...
Dragojevic et al., SPAA’09
Captured by their
allocating transaction
:Account
body:...
...
Proposed by Dragojevic et al. for
an unmanaged environment

e.g. Read STM Barrier in Deuce
function onReadAccess(ref, addr, val, ctx)
return ctx.onReadAccess(ref, addr, val)
end function
:…
…: val
…: …
refctx
addr
TRX read-set
write-set

end function
Runtime Capture Analysis
if isCaptured(ref, ctx) then
return val
else
end if
end function
:…
…: val
…: …
ref
addr
ctx
TRX read-set
write-set

Runtime Capture Analysis
return val
else
end if
end function
Overhead(isCaptured) << Overhead(ctx.onReadAccess)
To improve the STM performance:

TRX
LICM
Lightweight Identification of Captured Memory
Carvalho & Cachopo, PPoPP’13
:…
…: val
…: …
ref
ctx
• A runtime capture analysis technique
• For a managed runtime environment, such as Java
• Lightweight

TRX
return val
else
end if
end function
LICM
Lightweight Identification of Captured Memory
Carvalho & Cachopo, PPoPP’13
:…
…: val
…: …
ref
ctx
fingerprint:
:…
owner:
…: val
…: …
Trx Id
Trx Id
static boolean isCaptured(Object ref, Context ctx){
return ctx.fingerprint == ref.owner;
}

TRX read-set
write-set
fingerprint: 87
:...
owner:...
...
LICM
:Point
owner:11
...
TRX read-set
write-set
:Account
owner:...
...
:Person
owner:17
...
:Point
owner: 87
...
...
:Account
owner: 87
...
fingerprint: 73

TRX read-set
write-set
fingerprint: 87
:...
owner:...
...
LICM
:Point
owner:11
...
:Account
owner:...
...
:Person
owner:17
...
:Point
owner: 87
...
...
:Account
owner: 87
...
TRX read-set
write-set
fingerprint: 73
TRX read-set
write-set
fingerprint: 91

Challenge
Efficient process of generating fingerprints:
• Avoiding further synchronization
• Avoiding the counter rollover
TRX
: Object
…: …
ref
ctx
fingerprint:
:…
owner:
…: val
…: …
Trx Id
Trx Id
:...
owner:
... :...
owner:
...

JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync

JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm

LICM
Optimizations
STM API with
Annotations
LICM
Overheads
STM API
STM Metadata
Logging read-set
and write-set

AOM
Adaptive Object Metadata
Carvalho & Cachopo, Multiprog’12
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
:Point
body:
x: 73
y: 11

AOM
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
Compact Extended
:Point
body:
x: 73
y: 11
:Point
body: null
x: 73
y: 11
extending

AOM
:VBoxBody
previous: null
version: 23
value:
:Point
x: 17
y: 71
Compact Extended
:Point
body:
x: 73
y: 11
:Point
body: null
x: 73
y: 11
17
71
extending
reverting

Extending – in transaction write-back
:Point
body: null
x: 73
y: 11
1 snapshot()
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
2
3
4
CASbody()

Reverting – part of the GC’s clean task
:Point
body: null
x: 73
y: 11
:VBoxBody
previous: null
version: 23
value:
:Point
x: 17
y: 71
17
71
1
2
toCompactLayout()
null
3 CASbody()

JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync

JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm

JWormBench
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Threads
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
jvstm-1nosync-licm-aom

AOM
Optimizations
STM API with
Annotations
LICM AOM
Overheads
STM API
STM Metadata
Logging read-set
and write-set

Memory Consumption
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800
Mb
Seconds
STMBench7 Read Dominated
jvstm
jvstm-aom

LICM
AOM
STM <versus> Medium-grained Lock
in a large-scale benchmark, such as STMBench7

STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
medium-lock
coarse-lock
jvstm

0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
medium-lock
coarse-lock
jvstm
STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
medium-lock
coarse-lock
jvstm
jvstm-licm

0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
medium-lock
coarse-lock
jvstm
jvstm-licm
STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Threads
medium-lock
coarse-lock
jvstm
jvstm-licm
jvstm-licm-aom

Vacation
These tests were performed with the following configuration:
-n 256 -q 90 -u 98 -r 262144 -t 65536, proposed by [Cao Minh et al. , 2008]
0
5
10
15
20
25
1 2 4 8 16 24 32 40 48
Threads
Vacation low contention
tl2
tl2-licm
jvstm
jvstm-licm-aom

Main Contributions
• JWormBench—A ﬂexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identiﬁcation of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM

Main Contributions
synchronization
framework
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
TRX read-set
write-set
:Point
body:
x: 73
y: 11

Fast path for non-contented objects
synchronization
framework
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref :Point
body:
x: 73
y: 11
TRX read-set
write-set

International conferences and workshops:
• STM with transparent API considered harmful
Springer-Verlag, ICA3PP’11, Melbourne, Australia.
• Adaptive object metadata to reduce the overheads of a multi-versioning STM
MULTIPROG’12, Paris, France.
• Objects with adaptive accessors to avoid STM barriers
WTM’12, Bern, Switzerland.
• Runtime elision of transactional barriers for captured memory
ACM, PPoPP ’13, Shenzhen, China
• Lightweight identiﬁcation of captured memory for Software Transactional
Memory. Springer-Verlag, ICA3PP’13, Sorrento, Italy -- Best Paper Award
In progress:
• Journal of Parallel and Distributed Computing, Elsevier:
Optimizing memory transactions for large-scale programs
• Information Sciences, Elsevier:
Optimizing memory transactions with lightweight capture analysis

opt-mem-trx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to opt-mem-trx

Similar to opt-mem-trx (20)

Recently uploaded

Recently uploaded (20)

opt-mem-trx

Editor's Notes