In previous work, we proposed a new multi-versioning STM--adaptive object metadata (henceforth AOM for short)---that reduces substantially both the memory and the performance overheads associated with transactional locations that are not under contention. AOM is an object-based design that follows the JVSTM general design, but it is adaptive because the metadata used for each transactional object changes over time, depending on how objects are accessed. Now we implemented a new version of the AOM that is based on the lock-free version of the JVSTM and we eliminated all the overheads of accessing objects in the compact layout during read-only transactions. To make the contention-free execution path free of any STM barrier, we duplicated the accessors of the transactional classes, so that one accesses directly the object fields and another uses STM barriers.
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Fm wtm12-v2
1. Objects with adaptive accessors
to avoid STM barriers
F. Miguel Carvalho and João Cachopo
1
WTM-2012
Software Engineering Group
Bern, Switzerland, April 10, 2012
23. AOM
Header
x of version 23
z of version 23
y of version 23
Header
32767
2147483647
34.7
value
of x
value
of y
value
of z
version
values
next
:VBoxBody
23
field x
field y
field z
version
values
next
:VBoxBody
19
null
value
of x
value
of y
value
of z
23
Compact Extended
1.
2.
33. AOM
• 1st release (Multiprog 12)
– implemented with the JVSTM lock based
– reversion and extension operations specified by
an AdaptiveObject interface
• 2nd release:
– Implemented with the JVSTM lock free
– AdaptiveObject as the root base class
– provides a Transparent API (like Deuce STM)
33
34. • increases the speedup between 13% and 35%
(* Multiprog12)
AOM with JVSTM lock based
34
0,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
1 2 4 8 10 12 14 16
Speedup
Threads
Circuit Main
0,00
0,50
1,00
1,50
2,00
2,50
3,00
1 2 4 8 10 12 14 16
Threads
Circuit Mem
LeeTM
35. • increases the speedup between 5% and 36%
new AOM with JVSTM lock free
35
LeeTM
0,00
1,00
2,00
3,00
4,00
5,00
6,00
1 2 4 8 12 16 20 24 28 32 36 40 44 48
Speedup
Threads
Circuit Main
JVSTM
AOM
0,00
1,00
2,00
3,00
4,00
5,00
1 2 4 8 12 16 20 24 28 32 36 40 44 48
Speedup
Threads
Circuit Mem
36. STAMP Vacation, low++ & long trxs & RO
• Low contention
• ++, large data sets
• -n = 256, longer transactions, instead of the
recommendation 2 or 4
• 3 kinds of transactions:
– Delete and create items: car, flight or room
– Remove defaulter clients (bill > 0)
– Query and reserve an item: car, flight or room
36Splitted in 2 transactions: RO + RW
42. Future Work
• An improved reversion algorithm
• New design for AOM that keeps the contention-free execution
path without any barrier or validation
• Integrate the AOM compiler in the implementation of the
Deuce STM
42
My name is Miguel, I came from Portugal and I work at the Software Engineering Group of Inesc-Id, part of the Technical University of Lisbon.
I’m here to present my work about “Objects with adaptive accessors to avoid STM barriers”.
This work was developed by me and professor João Cachopo.
General goal of my work => increase the applications performance!
How?
- parallelizing sequential programs;
But other problems arise from the parallelization, such as, the concurrent access to shared data.
To that end we use:
- STM to synchronize access to shared data.
But in many cases the STMs introduce overheads that are larger that the gains from the parallelization, turning its benefits useless.
In fact many STMs present good performance results in micro-benchmarks. But typically these micro-benchmarks are very simple applications that manipulate a certain kind of data-structure, such as a SkipList, an HashTable, aRedBlackTree, or other else. And perform operations over this data structure, such as deleting, moving, updating and inserting new elements.
But in more realistic benchmarks, using more complex operations and more complicated data structures, as happens with the StmBench7, the performance results are not so good.
The STM Barriers are one of the reasons that prevent an STM from achieving a better performance.
Because, instead of just reading or updating a memory location, these STM barriers:
=> Need to consult the metadata associated with the memory locations;
=> And, keep track of the read-set and write-set.
Many different approaches have been tried to mitigate the overheads incurred by STM Barriers.
My work is based on the approach of a Multi-versioning STM.
From the available implementations of a Multi-Versioning STM, we choose the JVSTM, which was the seminal STM using a Multi-versioning approach.
Others: LSA, SMV: Selective Multi-Versioning
Good for read intensive workloads.
A multi-versioning STM has the big advantage that the read-only transactions never abort and always succeed.
So under read-dominated scenarios a multi-versioning STM can increase the overall performance.
Yet, a multi-versioning STM has a big handicap/drawback in memory overheads.
Yet, to store the multiple versions of a transactional location, these STMs may incur into large memory overheads.
…. furthermore, even when shared data is not under contention we may need to tackle several versions to reach the desired value.
So, although a muli-versioning STM is good for read dominated workload, these problems turn this approach into a non consensual option.
Our goal is to take advantage of the good performance of a Multi-Versioning under read dominated scenarios, and simultaneously reduce the runtime overheads in memory and performance when the transactional localitions are not under contention.
Before introducing the AOM I will give a brief description about the JVSTM.
We have already seen two presentations related with the JVSTM, so I will pass through very quickly this description.
Let’s start to analyze the JVSTM.
In the JVSTM a transactional location is known as a versioned box.
Here we have a counter object with a transactional location called --- current
Instead of storing a single value, a versioned box keeps a history of values.
Each element of the versions’ history is a box body.
The version associated with each value corresponds to the number of the transaction that has committed that value.
A versioned box points to the head of the versions’ history corresponding to the most recent committed value.
When a transactions starts it gets its transaction number from a global counter --- lastCommitted.
This counter is updated by every read-write transaction that commits successfully.
Then, a transaction reads the body with the version equals or lower to the transaction’s version.
With this approach, the read-only transactions can always see a valid snapshot of the memory corresponding to the version captured in the moment it has begun. In other words, this means that read-only transactions are serialized in the instant they began.
Although this model can improve the performance for read-dominated workloads, yet it also add some overheads:
- It largely increase the total memory managed by an application;
- It adds extra indirections in all memory accesses. --- instead of directly access a memory location, it must track the versions’ history to get the correct version.
The key insight of our solution --- the AOM --- is based on the idea that in the majority of realistic scenarios, a large part of the locations is not under contention.
Instead of what is happening in this roundabout…
Usually and like this highway, the cars drive free and without contention, the same happens to most part of the transactional locations.
And in these cases we don’t need metadata and multiple versions either. Multiple versions are just required when several transactions contend for the same transactional object and at least, one of those transactions writes to that object.
So if we can avoid the metadata for the vast majority of objects then we can:
- Largely reduce the required memory space;
- Avoid extra indirections when reading those locations.
Our final goal is that in scenarios without contention we can read a transactional location with the same overhead as reading any other common location.
We just want to get the value from a transactional location without the need of consulting metadata, nor tracking the read-set.
To achieve this idea, we propose that transactional locations should have two different layouts:
- Compact layout – equals to the layout dictated by the object model of the runtime environment.
- Extended layout – when the object may be under contention.
Here we have an example of an object in the compact layout. This object has 3 fields. One of them requiring two slots.
To exchange between layouts, we need one additional slot, that is denoted in this picture by the Header. And, when the object is in the compact layout this slot is pointing to null.
This is the layout of an object at the beginning of its life cycle.
Yet, the idea is that this additional slot may be further reduced by using some unused bits of the objects’ header.
Later, when an object is written by a transaction, it must be extended and this header will point to the versions’ history.
Note that in the AOM we have remove the Vbox. We don’t need it. Because the own object represents the identity of the versions’ history.
Another particularity of the JVSTM is its garbage collector mechanism of old versions. This garbage collector algorithm removes versions when there are no running transactions that may need to access them.
So eventually and if this object is no longer written by any transaction, then it will become with just one box body. In this case we can it revert back to the compact layout, discarding all the additional metadata. So swinging back and forth between these two layouts we expect to reduce to much the runtime overheads.
Considering that the vast majority of objects are seldom written, then the number of objects that need to have more than one version should be residual when compared to the total number of objects in the application, reducing substantially the runtime overheads.
Now let’s review in more detail the extension and reversion process.
If an object is being extended then its header is pointing null.
And the extension includes 3 tasks.
First we create a new body corresponding to the version zero and copy the values from the object fields into this body.
Then we create a 2nd body pointing to the version zero and containing the values written by the transaction. e.g. in this case the transaction writes only the value 11 to the field x.
And finally it updates the object’s header.
Note that in this case the object fields have not been changed.
JUST the reversion process can update the object fields.
So a reading transaction may see the header pointing to null, or to the versions history. But in both cases the version 0 and the object fields contain the same values. And if a running transaction intercepts the extension process of this object and already sees it in the extended layout, then that transaction will get the version 0.
So, no matter the layout seen by the transaction it will always get consistent values.
The reversion also includes 3 tasks:
- 1st it reads the object’s header and check if the versions’ history has just one body.
- In that case, it copy the values from that body into the object fields.
- And finally it will perform a compare-and-swap operation that nullifies the header.
These three tasks are defined by three methods in the AdaptiveObject class,
This AdaptiveObject class also defines two more methods for the extension process:
- The replicate that returns a clone of this object;
- And the casHeader that performs a compare and swap operation in the object’s
Finally our instrumentation engine replaces the root of the classes hierarchy.
And the AdaptiveObject will be the base class of any transactional class.
This is the new design corresponding to the new release of the AOM.
The Lee-TM is a realistic and non-trivial benchmark that uses Lee routing algorithm to automatically produce interconnections between electronic components.
We have no scalability from 8 threads henceforth.
With these results we confirmed our expectations, but we have no scalability and we were confined to a restricted set of benchmarks. For instance, the results in the STAMP were not so promising.
With the new release we have almost the same speedup.
But we scale up to almost 44 threads in the Main circuit and 28 threads in the MEM circuit.
We used one of the recommended configurations in the STAMP – identified by low++ in its paper. And we added long transactions and support for read-only transactions.
The speedup decreased a little bit but the AOM is still better than the JVSTM and much better than the TL2.
Removing the RO transactions the speed almost decreased to half of its value.