The research work that I describe in this dissertation is concerned with
the problem of shared-memory synchronization in large-scale
programs.
The difficulties of developing fine-grained lock-based synchronization
are well-known and many researchers have argued for the need of
alternative approaches.
Simply put, the main goal of my work is to provide an efficient
alternative to such approaches.
My proposal is based on Software Transactional Memory
(STM) and I implemented it in a well-known STM framework for
Java---Deuce STM.
To that end I propose a new approach that significantly lowers the
overhead caused by an STM in large-scale programs for which only a
small fraction of the memory is under contention. My solution
combines two novel optimization techniques in a synergistic way,
allowing us to get, for the first time, performance with an STM that
rivals the performance of the best lock-based approaches in some of
the more challenging benchmarks. My approach and experimental
results show that STMs may be the first efficient alternative to locks
for shared-memory synchronization in real-world--sized applications.
In previous work, we proposed a new multi-versioning STM--adaptive object metadata (henceforth AOM for short)---that reduces substantially both the memory and the performance overheads associated with transactional locations that are not under contention. AOM is an object-based design that follows the JVSTM general design, but it is adaptive because the metadata used for each transactional object changes over time, depending on how objects are accessed. Now we implemented a new version of the AOM that is based on the lock-free version of the JVSTM and we eliminated all the overheads of accessing objects in the compact layout during read-only transactions. To make the contention-free execution path free of any STM barrier, we duplicated the accessors of the transactional classes, so that one accesses directly the object fields and another uses STM barriers.
Profilers find performance bottlenecks in your app but provide confusing information. Let's give you insights into how your profiler and your app are really interacting. What profiling APIs are available, how they work, and what their implementation on the JVM (OpenJDK) side looks like:
Stack sampling profilers: stop motion view of your app
GetCallTrace(JVisualVM case study): The official stack sampling API
Safepoints and safepoint sampling bias
AsyncGetCallTrace(Honest Profiler Case Study): The unofficial API
JVM Profilers vs System Profilers: No API needed?
In this talk, Gil Yankovitch discusses the PaX patch for the Linux kernel, focusing on memory manager changes and security mechanisms for memory allocations, reads, writes from user/kernel space and ASLR.
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
Ever needed to add your custom logic into the network stack?
Ever hacked the network stack but wasn't certain you're doing it right?
Shmulik Ladkani talks about various mechanisms for customizing packet processing logic to the network stack's data path.
He covers covering topics such as packet sockets, netfilter hooks, traffic control actions and ebpf. We will discuss their applicable use-cases, advantages and disadvantages.
Shmulik Ladkani is a Tech Lead at Ravello Systems.
Shmulik started his career at Jungo (acquired by NDS/Cisco) implementing residential gateway software, focusing on embedded Linux, Linux kernel, networking and hardware/software integration.
51966 coffees and billions of forwarded packets later, with millions of homes running his software, Shmulik left his position as Jungo’s lead architect and joined Ravello Systems (acquired by Oracle) as tech lead, developing a virtual data center as a cloud service. He's now focused around virtualization systems, network virtualization and SDN.
In previous work, we proposed a new multi-versioning STM--adaptive object metadata (henceforth AOM for short)---that reduces substantially both the memory and the performance overheads associated with transactional locations that are not under contention. AOM is an object-based design that follows the JVSTM general design, but it is adaptive because the metadata used for each transactional object changes over time, depending on how objects are accessed. Now we implemented a new version of the AOM that is based on the lock-free version of the JVSTM and we eliminated all the overheads of accessing objects in the compact layout during read-only transactions. To make the contention-free execution path free of any STM barrier, we duplicated the accessors of the transactional classes, so that one accesses directly the object fields and another uses STM barriers.
Profilers find performance bottlenecks in your app but provide confusing information. Let's give you insights into how your profiler and your app are really interacting. What profiling APIs are available, how they work, and what their implementation on the JVM (OpenJDK) side looks like:
Stack sampling profilers: stop motion view of your app
GetCallTrace(JVisualVM case study): The official stack sampling API
Safepoints and safepoint sampling bias
AsyncGetCallTrace(Honest Profiler Case Study): The unofficial API
JVM Profilers vs System Profilers: No API needed?
In this talk, Gil Yankovitch discusses the PaX patch for the Linux kernel, focusing on memory manager changes and security mechanisms for memory allocations, reads, writes from user/kernel space and ASLR.
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
Ever needed to add your custom logic into the network stack?
Ever hacked the network stack but wasn't certain you're doing it right?
Shmulik Ladkani talks about various mechanisms for customizing packet processing logic to the network stack's data path.
He covers covering topics such as packet sockets, netfilter hooks, traffic control actions and ebpf. We will discuss their applicable use-cases, advantages and disadvantages.
Shmulik Ladkani is a Tech Lead at Ravello Systems.
Shmulik started his career at Jungo (acquired by NDS/Cisco) implementing residential gateway software, focusing on embedded Linux, Linux kernel, networking and hardware/software integration.
51966 coffees and billions of forwarded packets later, with millions of homes running his software, Shmulik left his position as Jungo’s lead architect and joined Ravello Systems (acquired by Oracle) as tech lead, developing a virtual data center as a cloud service. He's now focused around virtualization systems, network virtualization and SDN.
Project Loom is one of the most important change coming to JDK. The talk explores the Constraints and benefits of the request pre thread model and why there is a big push towards Aysnc frameworks.
How Project Loom and Structured Concurrency And Project Loom gives a new design paradigm to write scalable maintainable code
In this webinar, we talked about hard-to-test patterns in C++ and show how to refactor them. The difficulty, in this context, does not lie in the code's inherent complexity.
The focus will be on patterns technically difficult to unit test because they may:
* Require irrelevant software to be tested too
* E.g.: 3rd party libraries, classes other than the one under test
* Delay the test execution
* E.g.: sleeps inside code under test
* Require intricate structures to be copied or written from scratch
* E.g.: fakes containing a lot of logic
* Require test details to be included in the production code
* E.g.: #ifdef UNIT_TESTS
* Make changes and/or are dependent on the runtime environment
* E.g.: Creating or reading from files
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Various Open Source Cryptographic Libraries are being used these days to implement the
general purpose cryptographic functions and to provide a secure communication channel over
the internet. These libraries, that implement SSL/TLS, have been targeted by various side
channel attacks in the past that result in leakage of sensitive information flowing over the
network. Side channel attacks rely on inadvertent leakage of information from devices
through observable attributes of online communication. Some of the common side channel
attacks discovered so far rely on packet arrival and departure times (Timing Attacks), power
usage and packet sizes. Our research explores novel side channel attack that relies on CPU
architecture and instruction sets. In this research, we explored such side channel vectors
against popular SSL/TLS implementations which were previously believed to be patched
against padding oracle attacks, like the POODLE attack. We were able to successfully extract
the plaintext bits in the information exchanged using the APIs of two popular SSL/TLS
libraries.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
Azul Virtual Machine Engineer Douglas Hawkins describes how decisions made by the JVM affect how your code is compiled and run. Learn how this affects application performance and what steps you can take to optimize how the JVM acts on your code.
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
Project Loom is one of the most important change coming to JDK. The talk explores the Constraints and benefits of the request pre thread model and why there is a big push towards Aysnc frameworks.
How Project Loom and Structured Concurrency And Project Loom gives a new design paradigm to write scalable maintainable code
In this webinar, we talked about hard-to-test patterns in C++ and show how to refactor them. The difficulty, in this context, does not lie in the code's inherent complexity.
The focus will be on patterns technically difficult to unit test because they may:
* Require irrelevant software to be tested too
* E.g.: 3rd party libraries, classes other than the one under test
* Delay the test execution
* E.g.: sleeps inside code under test
* Require intricate structures to be copied or written from scratch
* E.g.: fakes containing a lot of logic
* Require test details to be included in the production code
* E.g.: #ifdef UNIT_TESTS
* Make changes and/or are dependent on the runtime environment
* E.g.: Creating or reading from files
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Various Open Source Cryptographic Libraries are being used these days to implement the
general purpose cryptographic functions and to provide a secure communication channel over
the internet. These libraries, that implement SSL/TLS, have been targeted by various side
channel attacks in the past that result in leakage of sensitive information flowing over the
network. Side channel attacks rely on inadvertent leakage of information from devices
through observable attributes of online communication. Some of the common side channel
attacks discovered so far rely on packet arrival and departure times (Timing Attacks), power
usage and packet sizes. Our research explores novel side channel attack that relies on CPU
architecture and instruction sets. In this research, we explored such side channel vectors
against popular SSL/TLS implementations which were previously believed to be patched
against padding oracle attacks, like the POODLE attack. We were able to successfully extract
the plaintext bits in the information exchanged using the APIs of two popular SSL/TLS
libraries.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
Azul Virtual Machine Engineer Douglas Hawkins describes how decisions made by the JVM affect how your code is compiled and run. Learn how this affects application performance and what steps you can take to optimize how the JVM acts on your code.
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxyBo-Yi Wu
Introduction to Percona XtraDB Cluster and HAProxy in 2014 OSDC talk.
Blog: http://blog.wu-boy.com/2014/04/osdc-2014-talk-introduction-to-percona-xtradb-cluster-and-haproxy/
OSDC: http://osdc.tw/program/2014-day2-10.html#content
Mingbo Zhang, Rutgers University
Saman Zonouz, Rutgers University
Time-of-check-to-time-of-use (TOCTOU) also known as “race condition” or “double fetch” is a long standing problem. Since memory read/write is so common an operation, it barely triggers no security mechanisms. We leverage a CPU feature called SMAP(Supervisor Mode Access Prevention) to efficiently monitor the events of kernel accessing user-mode memory. When user pages being accessed by kernel, our mitigation kicks in and protect them against further modifications from other user-mode threads. We also leverage the same CPU feature to find double fetch errors in kernel modules. A simple hypervisor is used to confine a system wide CPU feature such as SMAP to particular process.
Learn the ways to access persistent memory from Java*. Review how to use the Low-Level Persistence Library in the Persistent Memory Development Kit to retrofit the open source database Cassandra* for persistent memory.
Agenda:
This talk will provide an in-depth review of the usage of canaries in the kernel and the interaction with userspace, as well as a short review of canaries and why they are needed in general so don't be afraid if you never heard of them.
Speaker:
Gil Yankovitch, CEO, Chief Security Researcher from Nyx Security Solutions
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...Kinson Chan
Youtube version: https://www.youtube.com/watch?v=CldxhRFTdqw
To achieve single-lock atomicity in software transactional memory systems, the commit procedure often goes through a common clock variable. When there are frequent transactional commits, clock sharing becomes inefficient. Tremendous cache contention takes place between the processors and the computing throughput no longer scales with processor count. Therefore, traditional transactional memories are unable to accelerate applications with frequent commits regardless of thread count. While systems with decentralized data structures have better performance on these applications, we argue they are incomplete as they create much more aborts than traditional transactional systems. In this paper we apply two design changes, namely zone partitioning and timestamp extension, to optimize an existing decentralized algorithm. We prove the correctness and evaluate some benchmark programs with frequent transactional commits. We find it as much as several times faster than the state-of-the-art software transactional memory system. We have also reduced the abort rate of the system to an acceptable level.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
17. time
Shared Memory
W R W R WR RW
STM
Trx Begin
Trx Begin
barrier barrier barrier barrierbarrier barrier
Trx Commit
Trx CommitThread 1 m()
Thread 2 m()
STM… overheads @Atomic void m(){
...
...
}
18. Shared Memory
Thread 1 m()
Thread 2 m()
W R WR W
STM
Trx Begin
Trx Begin
barrier barrier barrier barrierbarrier barrier
Trx Commit
Trx Commit
R W R
STM… overheads @Atomic void m(){
...
...
}
19. Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
W R WR W
barrier barrier barrier barrierbarrier barrier
R W R
Shared Memory
Thread 1
Thread 2
RW R RW R W
waiting...
W
time
20. A large-scale benchmark for Java
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
21. A large-scale benchmark for Java
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
22. A large-scale benchmark for Java
0
2
4
6
8
10
12
14
16
18
20
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
jvstm
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
23. A large-scale benchmark for Java
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
coarse-lock
jvstm
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
seq-1thread
medium-lock
coarse-lock
jvstm
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
24. Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
W R WR W
barrier barrier barrier barrierbarrier barrier
R W R
Shared Memory
Thread 1
Thread 2
RW R RW R W
waiting...
W
25. Shared Memory
Thread 1
Thread 2
STM
Trx Begin
Trx Begin
Trx Commit
Trx Commit
Shared Memory
Thread 1
Thread 2
RW R RW W
waiting...
WR
Shared Memory
Shared Memory
W R WR W
barrierbarrier barrier barrierbarrier barrier
R W RR
barrier
35. 3 cases of useless STM barriers
• non-contended classes
• non-shared objects
• shared but frequently non-contended objects
36. 3 different techniques
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
37. Implemented for the JVM
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
Deuce STM framework
TL2 LSA JVSTM
38. May be combined…
• non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
Deuce STM framework
TL2 LSA JVSTM
39. • non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
40. Transparent STM API
public class Worm implements IWorm {
final int id;
final int headSize;
final int speed;
final BodyCoord[] body;
public void moveBody(ICoordinate newCoordinate) {
for(BodyCoord c: body) {
...
c.update(newCoordinate);
...
}
}
}
STM barrier
STM barrier
STM barrier
STM barrier
41. Relax the STM API Transparency
@NoSyncArray(Immutable)
@NoSyncArray(TransactionLocal)
@NoSyncArray(ThreadLocal)
@NoSyncField(Immutable)
@NoSyncField(TransactionLocal)
@NoSyncField(ThreadLocal)
Carvalho & Cachopo, ICA3PP’11
42. In 5 different memory
locations definitions
JWormBench
@NoSyncArray(Immutable)
@NoSyncArray(TransactionLocal)
@NoSyncArray(ThreadLocal)
@NoSyncField(Immutable)
@NoSyncField(TransactionLocal)
@NoSyncField(ThreadLocal)
Carvalho & Cachopo, ICA3PP’11
43. JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
Throughput(×103)ops/s
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
44. JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
Throughput(×103)ops/s
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
45. JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
46. Deuce STM API with Auxiliary Annotations
Eliminates the following overheads:
Optimizations
STM API with
Annotations
Overheads
STM API
STM Metadata
Logging read-set
and write-set
47. • non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
49. TRX 1TRX 1 read-set
write-set
:...
body:...
...
Captured Memory
:Point
body:...
...
read-set
write-set
TRX 2 read-set
write-set
:Account
body:...
...
:Person
body:...
...
:Point
body:...
...
Dragojevic et al., SPAA’09
Captured by their
allocating transaction
:Account
body:...
...
Proposed by Dragojevic et al. for
an unmanaged environment
50. e.g. Read STM Barrier in Deuce
function onReadAccess(ref, addr, val, ctx)
return ctx.onReadAccess(ref, addr, val)
end function
:…
…: val
…: …
refctx
addr
TRX read-set
write-set
51. function onReadAccess(ref, addr, val, ctx)
return ctx.onReadAccess(ref, addr, val)
end function
Runtime Capture Analysis
function onReadAccess(ref, addr, val, ctx)
if isCaptured(ref, ctx) then
return val
else
return ctx.onReadAccess(ref, addr, val)
end if
end function
:…
…: val
…: …
ref
addr
ctx
TRX read-set
write-set
52. Runtime Capture Analysis
function onReadAccess(ref, addr, val, ctx)
if isCaptured(ref, ctx) then
return val
else
return ctx.onReadAccess(ref, addr, val)
end if
end function
Overhead(isCaptured) << Overhead(ctx.onReadAccess)
To improve the STM performance:
53. TRX
LICM
Lightweight Identification of Captured Memory
Carvalho & Cachopo, PPoPP’13
:…
…: val
…: …
ref
ctx
• A runtime capture analysis technique
• For a managed runtime environment, such as Java
• Lightweight
54. TRX
function onReadAccess(ref, addr, val, ctx)
if isCaptured(ref, ctx) then
return val
else
return ctx.onReadAccess(ref, addr, val)
end if
end function
LICM
Lightweight Identification of Captured Memory
Carvalho & Cachopo, PPoPP’13
:…
…: val
…: …
ref
ctx
fingerprint:
:…
owner:
…: val
…: …
Trx Id
Trx Id
static boolean isCaptured(Object ref, Context ctx){
return ctx.fingerprint == ref.owner;
}
57. Challenge
Efficient process of generating fingerprints:
• Avoiding further synchronization
• Avoiding the counter rollover
TRX
: Object
…: …
ref
ctx
fingerprint:
:…
owner:
…: val
…: …
Trx Id
Trx Id
:...
owner:
... :...
owner:
...
58. JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
59. JWormBench
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
0
50
100
150
200
250
300
350
400
450
1 2 4 8 16 24 32 40 48
Throughput(×103)ops/s
Threads
50% RO trxs, O(n2), N Reads, 1 Write
seq-1thread
jvstm
jvstm-5nosync
jvstm-1nosync-licm
60. LICM
Eliminates the following overheads:
Optimizations
STM API with
Annotations
LICM
Overheads
STM API
STM Metadata
Logging read-set
and write-set
61. • non-contended classes
=> Compile time technique
• non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
71. AOM
Eliminates the following overheads:
Optimizations
STM API with
Annotations
LICM AOM
Overheads
STM API
STM Metadata
Logging read-set
and write-set
73. • non-shared objects
=> Runtime time analysis
• shared but frequently non-contended objects
=> Runtime adaptive technique
LICM
AOM
STM <versus> Medium-grained Lock
in a large-scale benchmark, such as STMBench7
74. STMBench7
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
StmBench7 read dominated
medium-lock
coarse-lock
jvstm
These tests were performed on a machine with 4 AMD Opteron 6168
processors, each one with 12 cores, resulting in a total of 48 cores.
77. Vacation
These tests were performed with the following configuration:
-n 256 -q 90 -u 98 -r 262144 -t 65536, proposed by [Cao Minh et al. , 2008]
0
5
10
15
20
25
1 2 4 8 16 24 32 40 48
Throughput(x103)ops/sec
Threads
Vacation low contention
tl2
tl2-licm
jvstm
jvstm-licm-aom
78. Main Contributions
• JWormBench—A flexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identification of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM
79. Main Contributions
• JWormBench—A flexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identification of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref
TRX read-set
write-set
:Point
body:
x: 73
y: 11
80. Fast path for non-contented objects
• JWormBench—A flexible benchmark for transactional
synchronization
• 3 optimization proposals
– Extended Deuce API
– LICM—Lightweight Identification of Captured Memory
– AOM—Adaptive Object Metadata
• Implementation of these techniques in Deuce STM
framework
• Support for in-place metadata in Deuce STM framework
• Fast access path for non-contended objects: LICM + AOM
:VBoxBody
previous:
version: 23
value:
:VBoxBody
previous: null
version: 0
value:
:Point
x: 17
y: 71
:Point
x: 73
y: 11
ref :Point
body:
x: 73
y: 11
TRX read-set
write-set
81. International conferences and workshops:
• STM with transparent API considered harmful
Springer-Verlag, ICA3PP’11, Melbourne, Australia.
• Adaptive object metadata to reduce the overheads of a multi-versioning STM
MULTIPROG’12, Paris, France.
• Objects with adaptive accessors to avoid STM barriers
WTM’12, Bern, Switzerland.
• Runtime elision of transactional barriers for captured memory
ACM, PPoPP ’13, Shenzhen, China
• Lightweight identification of captured memory for Software Transactional
Memory. Springer-Verlag, ICA3PP’13, Sorrento, Italy -- Best Paper Award
In progress:
• Journal of Parallel and Distributed Computing, Elsevier:
Optimizing memory transactions for large-scale programs
• Information Sciences, Elsevier:
Optimizing memory transactions with lightweight capture analysis
Editor's Notes
I’m going to present my phd work entitled Optimizing Memory Transactions for Large-Scale Programs
Today, any computer provides more than one processing unit.
We can find them everywhere. Even in our personal devices, such as: tablets, mobile phones, laptops, … everywhere.
However, developing software that takes advantage of these multiple processors is not so easy as developing software for a single unit processor. => MANY problems arise from programming for multiprocessors.
One of those well-known problems is the Share Memory Synchronization.
I am talking about Software running multiple parallel threads that read and write to shared memory concurrently.
And we must synchronize these concurrent accesses to avoid the eventual occurrence of inconsistent data.
So, this is the scope of the work that I have developed in my PhD.
…and my general Goal is to introduce new techniques for shared memory synchronization that are able to increase the overall performance of this kind of software.
So, lets take a brief look over existing solutions.
The most common solutions use lock-based techniques…
…, and, they are still present in modern software environments such as: Java or .net.
However, the difficulties of developing this kind of solutions are widely known…
… and not all programmers are able to develop fine-lock solutions.
Here we have an example of just a small part of the code of the STMbench7 benchmark responsible for the implementation of medium-lock strategy. And we can observe the complexity of managing different locks for different operations. I chose this code because the STMBench7 is one of the most complex and large-scale benchmarks for parallel applications and a I will use it in several examples along my presentation.
On the other-hand, when we avoid this kind of approach and we simply use a coarse-grained lock…
… such as simply using the synchronized keyword to synchronize shared access to a method, then we may prevent scalability and parallel execution.
Now in this example every thread executing method m() acquires a coarse lock, before accessing shared memory.
And in this case, when the second thread wants to access the shared memory and thus, it tries to acquire the same lock, then it must wait…
… whereas the first thread continues to write and read to the shared memory.
And the second thread waits until the former releases the global lock.
And when it happens, the second thread successfully acquires the lock and proceeds to access the shared memory.
So concluding, despite we have multiple processing units we are not taking advantage of this capacity of executing tasks in parallel.
And thus, we may have tasks running one after other and so on.
But today, there are other alternatives to lock-based solutions such as Transactional Memory. Instead of using a pessimistic approach, Transactional Memory uses an optimistic approach that let memory accesses to proceed in parallel.
Provides an abstraction that automatically uses fine-locks just where they are needed.
So taking again the previous example now we replace the coarse-lock with the STM infrastructure.
…. and we replace the synchronized keyword with an atomic keyword provided by the STM API, or an annotation, as happens in the case of Deuce STM.
So, now we may have both threads concurrently accessing shared data in parallel.
But in fact, this is not the real scenario because now we have to include also in our analysis the STM-induced overheads that are related with…
the transaction bookkeeping…
… and the overheads of the implicit STM barriers that are now performed by all memory accesses.
So, although we may perform memory accesses in parallel without violating data consistency, however we may not be able to improve the overall performance in comparison with a coarse-lock based solution.
So, although we may perform memory accesses in parallel without violating data consistency, however we may not be able to improve the overall performance in comparison with a coarse-lock based solution.
I observed this behavior in several experiences for large-scale programs.
And here I have an example for the STMBench7, where the benchmark automatically transactified with the JVSTM performs even worse than a coarse-lock strategy.
We can also observe the behavior of this benchmarks with a medium-lock strategy that is not easy to program but that is already provided in this case and represents how much we would like to achieve in performance, but desirable with less programming effort.
So, my goal is to reduce the STM barriers induced overheads and thus improve the overall performance.
And, the main question is: can we really reduce the overheads of an STM Barrier and thus improve the overall performance pf software synchronized with an STM?
Lets take a deeper look over an STM barrier and compare a simple memory access with a transactional memory access.
For instance consider an Object Point with two fields X and Y. To get the X field value of this object we have to get the reference to that object and access the corresponding field.
On the other hand, and considering now this same object Point as a transactional location, then we must take into account the additional metadata that is imposed by the STM to all transactional locations.
In the case of the JVSTM this metadata corresponds to a history of values.
We call to each element of this history a box body.
A box body stores the version of the transaction that has committed that body and its corresponding value.
The transactional object points to the head of the versions’ history corresponding to the most recent committed value.
So considering that we are looking for the oldest value, then we must track all the versions until we reach the desired value.
And a transaction must also keep track of the locations that are read and written in the read-set and write-set.
So it is easy to understand that an STM barrier typically requires orders of magnitude more machine cycles than a simple memory access.
So the question is: do we really need to incur in all this overheads for all memory locations?
Yes for contended data, to synchronize concurrent accesses and guarantee the data consistency.
But from my experiences I observed that vast majority of the memory locations managed by a program are not under contention.
And in theses cases, the corresponding memory accesses perform useless STM barriers.
So this is the key that will allow me to reduce the STM-induced overheads. I will introduce techniques that avoid the tasks executed by an STM barrier for non-contended data.
And now when we access an object that is not under contention we may directly read its proper fields and avoid the Metadata indirections and further STM tasks.
I identified 3 major situations of useless STM barriers:
non-contended classes -- classes whose objects are never under contention, because are immutable, transaction-local or thread-local.
non-shared objects -- classes with both shared and non-shared objects
Shared objects that in many occasions are not shared and thus do not need to perform the STM barriers in those cases.
For each situation I used a different optimization technique.
I choose one of the most used software development environments world wide -- Java.
And to that end, I implemented my proposals in Deuce -- an STM framework -- which turns my optimizations techniques available to any supported STM algorithm. Except, the last technique that for now it was specifically designed for the JVSTM.
Another advantage of my proposals is that…
Because each technique deals with a different scenario then we may combine them in several ways to avoid different kinds of overheads.
So, now lets look to each proposal individually.
TO introduce my first technique I will show an example with part of the code of the class Worm of the WormBench bencnhmark for Java.
Every memory access inside the moveBody is replaced with an STM Barrier by the STM compiler.
And, every method invoked from this method may also include STM barriers.
So, here I am emphasizing two statements that include STM barriers: accessing the body array and invoking the update method of a coordinate object.
The STM barriers that modify the coordinate object are necessary, but what about the implicit barriers from the foreach statement?
This array is not being modified and in this case the final keyword does not prevent the use of STM barriers, because it just says that the array reference is immutable and not its elements.
So my first technique extends Deuce API with a couple of annotations that let the programmer specify the behavior of certain locations.
So in the previous example we may use the first annotation to let the compiler know that array is immutable and thus avoid the use of STM barriers.
And, this annotation can also be parameterized in a different way according to the behavior of the annotated location as…
Similarly, in my solution I also included a specific annotation to control the transactification of fields.
In my proposal I used the JWormBench to explore the effects on performance of relaxing the transparency of an STM with these annotations.
And I used 3 annotations in 5 different memory locations definition to avoid useless STM barriers.
… and with this optimization I got a speedup in performance of the Tl2 stm of almost 10 times.
I ran these tests in a machine with 48 cores.
So, this technique eliminates all the overheads of an STM barrier for memory locations of classes that only have non-contended objects.
However, for classes that have both shared and non-shared objects we are not able to use this technique.
In that case we have to identify non-shared objects at the object level and not at the level of its class definition.
So lets start to see an example of classes that have both shared and non-shared objects.
Here we have two transactions concurrently accessing a couple of shared objects: an Account, a Person and a Point.
Now if transaction 1 instantiates some objects, for instance a Point and an Account during its execution, then we are sure that transaction 2 cannot access these objects because they are not visible outside the transaction’s boundaries until it commits successfully.
So, these objects correspond to Transaction-local memory that is memory allocated inside a transaction, which cannot escape.
Dragojevic introduced the concept of Captured Memory as the memory captured by its allocating transaction.
In this case, all accesses to objects in captured memory do not need to perform a full barrier.
So, Dragojevic proposed a Capture Analysis technique for an unmanaged environment to identify if a memory location is captured by a transaction, or not, and thus if it requires a full STM barrier, or not.
To better understand the idea of a runtime Capture Analysis technique lets see an example of an STM Barrier.
Here we have a simplified view of a Read barrier in Deuce Framework. This barrier receives by arguments: the target object, the field’s address, the value of that field and a context that represents the transaction object.
I am hiding other low level details in this code.
So, for now I just want to show that typically an STM barrier redirects the memory access to the corresponding transaction trough the context reference.
So the goal of the capture analysis is to directly access the transactional object, instead of invoking the transaction, when that object is captured by its allocating transaction and avoid the further tasks of an STM barrier, such as keep tracking of the read-set and write-set.
Now, here we have the same STM Barrier performing a runtime capture analysis through the isCaptured method. If the isCaptured returns true then the Read Barrier just needs to return the field’s value.
To improve the STM performance, then the overhead of the capture analysis implemented by the isCaptured function should be lower than the overhead of performing a full STM barrier.
To that end I implemented an efficient algorithm of runtime capture analysis for a managed runtime environment, which I called LICM.
The idea of LICM is that every transaction should keep a unique identifier, called fingerprint that is recorded to every newly instantiated object in an owner field.
So every time a transaction finds an object with a owner id equals to the transaction’s fingerprint then it can avoids the full STM barrier.
The capture analysis algorithm just needs to perform an identity comparison between the transaction’s fingerprint and the object’s owner.
So revisiting our previous example. Now, every newly allocated object has an owner corresponding to its allocating transaction’s fingerprint.
And, when the allocating transaction accesses its captured objects the transaction identifies those objects as owned objects and thus it does not requires a full barrier.
Later when the transaction commits successfully and those objects turn visible, then any other transaction that access those objects will perform a full STM barrier because they have a different fingerprint from the owner of those objects.
So the main challenge of the LICM implementation is to find an efficient process of generating fingerprints.
So, I choose a newly allocated instance of class Object as a fingerprint.
This solution has the advantage of relying on the garbage collector to provide uniqueness and the ability of recycling unused fingerprints.
Despite we are creating one more object per transaction, I am working in the scope of large-scale programs.
So the overhead of this fingerprint object is very low in comparison to the whole program working-set.
Finally, my experimental results prove the effectiveness of this technique.
This are the results previously obtained with the first optimization technique.
And now combining also the LICM we got even a better performance.
In this case the LICM is able to identify transaction-local objects that were not excluded from transactificaton with the previous technique, because their classes have both shared and non-shared objects.
Beyond that, all the transaction bookkeeping and the metadata that is being eliminated with this optimization approach also has an impact in memory consumption.
So, this technique eliminates all the overheads of an STM barrier for memory locations of classes that only have non-contended objects.
However, for classes that have both shared and non-shared objects we are not able to use this technique.
Finally the last scenario corresponds to objects that are subject of concurrent modifications for a small period of time, but which stay unmodified after that period and for the rest of the program execution.
So, in this case we are incurring in the performance and memory overheads of the metadata.
So, with my third technique, the AOM – Adaptive Object Metadata – I propose that instead of a unique layout, which includes the STM Metadata, transactional objects should have an adaptive layout that includes two different object layouts.
two different object layouts:
a compact layout, where no memory overheads exist,
and an extended layout, used when the object may be under contention.
When a transactional object is created it starts in the compact layout and later when it is updated by a transaction it will be extended.
The original fields values correspond to the version 0 of the versioned history.
Because the JVSTM has a garbage collector algorithm that removes old versions, so eventually and if this object is no longer written by any transaction…
…. then it will become with just one box body.
And, in this case we can revert it back to the compact layout, discarding the additional metadata.
All the operations of extending and reverting an object are lock-free and should guarantee the progress of the whole JVSTM algorithm that is also lock-free.
So lets take a deeper look on each operation. And, to reasoning about the tasks of each reversion and extension process you must consider that all body elements of a history are IMMUTABLE.
An object is extended during the write-back of a transaction if it is in the compact layout, which involves:
creating a snapshot of the object
creating a body for that snapshot that is marked with version 0 and
creating the body for the new value and version that points to the entry with version 0.
The operation proceeds with a compare and swap, and if it fails that means that another transaction helped in the write-back phase and this new value is already committed.
On the other hand the reversion process involves:
First it checks whether the first body of the object’s history is pointing to null.
If it is, which means that there is only one version in the history, it copies the values contained in that body to the corresponding fields in the object
and then if finished with a compare-and-swap of the body of the object to null.
The CAS fails if any other transaction commits new values for this object. So, when the CAS fails nothing else needs to be done, because the object stays in the extended layout.
So, this technique eliminates all the overheads of an STM barrier for memory locations of classes that only have non-contended objects.
However, for classes that have both shared and non-shared objects we are not able to use this technique.
However with the elimination of the metadata we have a big reduction in the memory consumption as we can see in this results in the STMBench7.
To finish my presentation I would like to revisit the first graph that shows that the performance of an STM is far from the performance of a medium grained lock in a large-scale program and in particularly for the STMBEnch7.
In the case of the Vacation we do not have any lock-based synchronization strategy for comparison with the STM synchronization approach.
So in include here in my results another STM algorithm for comparison, for which I used also the LICM optimization technique.