SlideShare a Scribd company logo
1 of 33
Download to read offline
Philip A. Bernstein
Colin Reid
Ming Wu
Xinhao Yuan
Microsoft Corporation
March 7, 2012
Published at VLDB 2011: http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf
A new algorithm for optimistic concurrency control
(OCC) on tree-structured indices, called meld.
Scenario 1: A data-sharing system
The log is the database. All servers can access it.
Each transaction appends its after-images to the log.
Each server runs meld to do OCC and roll forward the log
Log
2
DBMS
cache
App
Server
Meld DBMS
cache
App
Server
Meld DBMS
cache
App
Server
Meld
Log
DBMS
App
Core
Cache
DBMS
App
Core
Core
Roll Forward
The log is the
database.
All cores can access
it.
Each transaction
appends its after-
images to the log.
One core runs meld
to do OCC and roll
forward the log
Motivation
System architecture
Meld Algorithm
Performance
Conclusion
4
5
G
B
A
H
C I
D
A D C B I H G
Binary
Search
Tree
Tree is marshaled into the log
5
D
Update
D’s value
Copy on write
To update a node, replace nodes up to the root
6
G
B
A
H
C I
D
C
G
B
6
Each server has a cache of the last committed DB state
7
A D C B I H G
Transaction execution
1. Get pointer to snapshot
2. Generate updates locally
3. Append intention log record
D C B G
Snapshot
G
B
C
D
DB cache
G
B
C
D
H
IA
last committed
Each server processes intention records in
sequence
To process transaction T’s intention record.
Check whether T experienced a conflict
If not, T committed, so the server merges the intention into
its last committed state
All servers make the same commit/abort decisions
8
A D C B I H G D C B G
T’s conflict
zone
transaction T
Did a committed transaction write
into T’s readset or writeset here?
Snapshot
1. Run transaction
2. Broadcast intention
3. Append intention to log
4. Send log location
5. De-serialize intention
6. Meld
9 9
1. Broadcasting the intention
2. Appending intention to the log
3. Optimistic concurrency control (OCC)
4. Meld
Technology will improve 1 & 2
For 3, app behavior drives OCC performance
But 4 depends on single-threaded processor
performance, which isn’t improving
Hence, it’s important to optimize Meld
10
Last-committed state
before T
Compare transaction T’s after-image to the last committed state
which is annotated with version and dependency metadata
Traverse T’s intention, comparing versions to last-committed state
Stop traversing when you reach an unchanged subtree
If version(x)=version(x) then simply replace x by x
Log x
x[read: x]x
state when T
executed
Transaction T’s
intention
x
11
Motivation
System architecture
Meld Algorithm
Performance
Conclusion
12
Must be computationally efficient
Must be deterministic
Must produce the same sequence of states on all servers
13
Meld
Latest version of
database state
Intention
record
New version of
database state
14
• T1 creates keys B,C,D,E
D
B E
C
T1
T2 D
B E
CA
D
B E
C F
T3
D
B E
C FA
• T2 and T3 do not conflict, so
the resulting melded state is
A, B, C, D, E, F
• T2 and T3 then execute
concurrently, both
based on the
result of T1
• T2 inserts A
• T3 inserts F
14
Every node n has a unique version
number (VN)
VN(n) permanently identifies the exact
content of n’s subtree
Each node n in an intention T stores
metadata about T’s snapshot
Version of n in T’s snapshot
Dependency information
Each node’s metadata compresses to
~30 bytes
15
Root
D
B E
C
Node Metadata
• version of the subtree
• dependency info
metadata
…
…
…
15
T1 D
B E
C
VN=51
VN=53
VN=54
VN=52
We need to avoid synchronization when assigning VNs
Stored as offsets from the base location of their intention
The base location is assigned when the
intention is logged
Given: T0’s root subtree has VN 50
VN of each subtree S in T1= 50 + S’s offset
16
C B E D
VN Offset +1 +2 +3 +4
Absolute VN: 50 51 52 53 54

…
T1T0
16
Subtree metadata includes a source structure version (SSV).
Intutively, SSV(n) = version of n in transaction T’s snapshot
DependsOn(n) = Y if T depends on n not having changed while
T executed
T1’s root subtree depends on the entire tree version 50.
Since SSV(D) = VN(), T1 becomes the last-committed state.
17
Absolute VN 50 51 52 53 54
C B E D
VN Offset: +1 +2 +3 +4
SSV: 0 0 0 50
DependsOn: N N N Y

…
T0 T1
17
A serial intention is one whose source version is the last
committed state.
Meld is then trivial and needs to consider only the root node.
T1 was serial.
T2 is serial, so meld makes T2 the last committed state.
Thus, a meld of a serial intention executes in constant time.
C B E D
VN Offset: +1 +2 +3 +4
SSV: 0 0 0 50
DependsOn: N N N Y
Absolute VN 51 52 53 54 55 56 57
A B D
+1 +2 +3
0 52 54
N N N
T0 T1 T2
18
19
D
B E
C
T1
T2 D
B E
CA
D
B E
C F
T3
D
B E
C FA
19
T3 is not serial because VN of D in T2 (= 57)  SSV(D) in T3 (= 54).
Meld checks if T3 conflicts with a transaction in its conflict zone
Traverses T3, comparing T3’s nodes to the last-committed state
If there are no conflicts, then since T3 is concurrent, meld
creates an ephemeral intention to merge T3’s state
20
Absolute VN 51 52 53 54 55 56 57 58 59 60
T0 T1
C B E D A B D
T2
VN Offset: +1 +2 +3 +4
SSV: 0 0 0 50
DependsOn: N N N Y
+1 +2 +3
0 53 54
N N N
F E D
T3
+1 +2 +3
0 52 54
N N N
20
M3
A committed concurrent intention produces an
ephemeral intention (e.g. M3)
It’s created deterministically in memory on all servers.
21
Absolute VN 51 52 53 54 55 56 57 58 59 60 61
+1 +2 +3
0 52 54
N N N
T0 T1
C B E D A B D
T2
VN Offset: +1 +2 +3 +4
SSV: 0 0 0 50
DependsOn: N N N Y
+1 +2 +3
0 53 54
N N N
F E D
T3
+1
57
N
D
It logically commits immediately after the intention it melds.
To meld the concurrent intention T3 above, we need to
consider metadata only on the root node D.
21
Most are automatically trimmed
Each committed intention I trims the previous ephemeral
intention with either a persisted node (if I is serial) or an
ephemeral node (if I is concurrent).
To track them use an ephemeral flag (or count) on each
node that has ephemeral descendants
Periodically run a flush transaction
It copies a set of ephemeral nodes that have no reachable
ephemeral nodes
It makes the original ephemeral nodes unreachable in the new
committed state.
It has no dependencies, so it can’t conflict
22
Phantom avoidance
Asymmetric meld operations
 Necessary in common case when subtrees do not align
 Uses a key-range as a parameter to the top-down recursion
Deletions
Use tombstones in the intention header
Checkpointing and recovery
23
23
Focus here is on meld throughput only
For latency, see the paper
We count committed and aborted transactions
Experiment setup
128K keys, all in main memory. Keys and payloads are 8 bytes.
Serializable isolation, so intentions contain readsets
De-serialize intentions on separate threads before meld
Transaction size affects meld throughput
So does conflict zone size (“concurrency degree”)
As transaction size or concurrency degree increase
 more concurrent transactions update keys with common
ancestors
 meld has to traverse deeper in the tree
24
r:w ratio is 1:1 con-di = concurrency degree i
25
26
Brute force = traverse the whole tree
27
Lots of OCC papers but none that give details of
efficient conflict-testing
By contrast, there’s a huge literature on conflict-
testing for locking
Oxenstored [Gazagnairem & Hanquezis, ICFP 09]
Similar scenario: MV trees and OCC
However, very coarse-grain conflict-testing
Uses none of our optimizations
28
New algorithm for OCC
Developed many optimizations to truncate the
conflict checking early in the tree traversal
Implemented and measure it
Future work:
Apply it to other tree structures
Measure it on various storage devices
Compare it with locking and other OCC methods on
multiversion trees
Try to apply it to physiological logging
29
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it
should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Consider node n in Intention I
SCV(n) = VN of the node that first generated the payload in n’s
predecessor
Altered(n) = true if n‘s payload differs from its predecessor’s
DependsOn(n) = true if I depends on n’s predecessor’s content
NCV(n) = if Altered(n) then VN(n) else SCV(n)
n‘s content changed if SCV(nI)  NCV(nlast committed state)
v1 v2
SCV
IsAltered=true DependsOn=true
v14 v15
SCV
 NCV=self

Implies
a
conflict
31
Again consider node n in Intention I
DependsOnTree(n) = true if I depends on n’s subtree not having
changed
NSV(n) = oldest version of n whose subtree is exactly
subtree(VN(n))
DependsOnTree(n) & NSV(nlast committed state)SSV(n)  a conflict
Can extend DependsOnTree with DependencyRange of keys
v1 v8
NSV
DependsOnTree=true
v14 v15
SSV

Implies
a
conflict
in last committed state
32
SubtreeIsOnlyReadDependent(n) = true iff
none of n’s descendants are updated in I
(i.e., have Altered = true).
Analogous to an intention-to-read lock
Avoids traversing entire tree when a descendent of n has
DependsOnTree=true and NSV(nlast committed state) = SSV(n).
It also enables computing NSV
NSV(n) = (SubtreeIsOnlyReadDependent(n) = true)  SSV(n)
else VN(n)
33

More Related Content

What's hot

Practical byzantine fault tolerance by altanai
Practical byzantine fault tolerance by altanaiPractical byzantine fault tolerance by altanai
Practical byzantine fault tolerance by altanaiALTANAI BISHT
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandraWu Liang
 
Solution of a Problem in Concurrent Programming Control : Notes
Solution of a Problem in Concurrent Programming Control : NotesSolution of a Problem in Concurrent Programming Control : Notes
Solution of a Problem in Concurrent Programming Control : NotesSubhajit Sahu
 
Lecture 08 virtual machine ii
Lecture 08 virtual machine iiLecture 08 virtual machine ii
Lecture 08 virtual machine ii鍾誠 陳鍾誠
 
Distributed Computing On Topics of: Leader election + Byzantine algorithms)
Distributed Computing On Topics of: Leader election + Byzantine algorithms)Distributed Computing On Topics of: Leader election + Byzantine algorithms)
Distributed Computing On Topics of: Leader election + Byzantine algorithms)Osama Askoura
 
Distributed Computing Set 3 - Topics of non-Byzantine Consensus
Distributed Computing Set 3 - Topics of non-Byzantine ConsensusDistributed Computing Set 3 - Topics of non-Byzantine Consensus
Distributed Computing Set 3 - Topics of non-Byzantine ConsensusOsama Askoura
 

What's hot (10)

Ch 9
Ch 9Ch 9
Ch 9
 
Practical byzantine fault tolerance by altanai
Practical byzantine fault tolerance by altanaiPractical byzantine fault tolerance by altanai
Practical byzantine fault tolerance by altanai
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
Flowctrl
FlowctrlFlowctrl
Flowctrl
 
N1802029295
N1802029295N1802029295
N1802029295
 
Omni ledger
Omni ledgerOmni ledger
Omni ledger
 
Solution of a Problem in Concurrent Programming Control : Notes
Solution of a Problem in Concurrent Programming Control : NotesSolution of a Problem in Concurrent Programming Control : Notes
Solution of a Problem in Concurrent Programming Control : Notes
 
Lecture 08 virtual machine ii
Lecture 08 virtual machine iiLecture 08 virtual machine ii
Lecture 08 virtual machine ii
 
Distributed Computing On Topics of: Leader election + Byzantine algorithms)
Distributed Computing On Topics of: Leader election + Byzantine algorithms)Distributed Computing On Topics of: Leader election + Byzantine algorithms)
Distributed Computing On Topics of: Leader election + Byzantine algorithms)
 
Distributed Computing Set 3 - Topics of non-Byzantine Consensus
Distributed Computing Set 3 - Topics of non-Byzantine ConsensusDistributed Computing Set 3 - Topics of non-Byzantine Consensus
Distributed Computing Set 3 - Topics of non-Byzantine Consensus
 

Viewers also liked

แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้P'lam K'unanon
 
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้P'lam K'unanon
 
7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwoashish61_scs
 
SparkAdv Albania_Profile_Organisational Chart
SparkAdv Albania_Profile_Organisational ChartSparkAdv Albania_Profile_Organisational Chart
SparkAdv Albania_Profile_Organisational ChartBesa Xhumari
 
presentation on waste management
presentation on waste managementpresentation on waste management
presentation on waste managementallxwell
 

Viewers also liked (9)

แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
 
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
แบบสอบถามสาเหตุที่ฟุตบอลชายทีมชาติไทยไม่สามารถเข้าสู่รอบ 32 ทีมฟุตบอลโลกได้
 
Tashkent
TashkentTashkent
Tashkent
 
7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
SparkAdv Albania_Profile_Organisational Chart
SparkAdv Albania_Profile_Organisational ChartSparkAdv Albania_Profile_Organisational Chart
SparkAdv Albania_Profile_Organisational Chart
 
presentation on waste management
presentation on waste managementpresentation on waste management
presentation on waste management
 
22 levine
22 levine22 levine
22 levine
 
10a log
10a log10a log
10a log
 
oral presentation
oral presentationoral presentation
oral presentation
 

Similar to Oc cby meldingtreesarial

Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent
 
Convolutional Error Control Coding
Convolutional Error Control CodingConvolutional Error Control Coding
Convolutional Error Control CodingMohammed Abuibaid
 
Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks
Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical ClocksOrbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks
Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical ClocksJiaqing Du
 
00 chapter07 and_08_conversion_subroutines_force_sp13
00 chapter07 and_08_conversion_subroutines_force_sp1300 chapter07 and_08_conversion_subroutines_force_sp13
00 chapter07 and_08_conversion_subroutines_force_sp13John Todora
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
Kroening et al, v2c a verilog to c translator
Kroening et al, v2c   a verilog to c translatorKroening et al, v2c   a verilog to c translator
Kroening et al, v2c a verilog to c translatorsce,bhopal
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Vlsiexpt 11 12
Vlsiexpt 11 12Vlsiexpt 11 12
Vlsiexpt 11 12JINCY Soju
 
Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726marangburu42
 
Distributed system sans consensus
Distributed system sans consensusDistributed system sans consensus
Distributed system sans consensusPraveen Singh Bora
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithmsMazin Alwaaly
 
Digital Communication Exam Help
Digital Communication Exam HelpDigital Communication Exam Help
Digital Communication Exam HelpLive Exam Helper
 
Transaction Management, Recovery and Query Processing.pptx
Transaction Management, Recovery and Query Processing.pptxTransaction Management, Recovery and Query Processing.pptx
Transaction Management, Recovery and Query Processing.pptxRoshni814224
 

Similar to Oc cby meldingtreesarial (20)

Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, Confluent
 
Cdc18 dg lee
Cdc18 dg leeCdc18 dg lee
Cdc18 dg lee
 
Convolutional Error Control Coding
Convolutional Error Control CodingConvolutional Error Control Coding
Convolutional Error Control Coding
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Slides
SlidesSlides
Slides
 
Ca unit v 27 9-2020
Ca unit v 27 9-2020Ca unit v 27 9-2020
Ca unit v 27 9-2020
 
Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks
Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical ClocksOrbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks
Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks
 
00 chapter07 and_08_conversion_subroutines_force_sp13
00 chapter07 and_08_conversion_subroutines_force_sp1300 chapter07 and_08_conversion_subroutines_force_sp13
00 chapter07 and_08_conversion_subroutines_force_sp13
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Kroening et al, v2c a verilog to c translator
Kroening et al, v2c   a verilog to c translatorKroening et al, v2c   a verilog to c translator
Kroening et al, v2c a verilog to c translator
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Vlsiexpt 11 12
Vlsiexpt 11 12Vlsiexpt 11 12
Vlsiexpt 11 12
 
Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726
 
Distributed system sans consensus
Distributed system sans consensusDistributed system sans consensus
Distributed system sans consensus
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
LCDF3_Chap_01x.pptx
LCDF3_Chap_01x.pptxLCDF3_Chap_01x.pptx
LCDF3_Chap_01x.pptx
 
Digital Communication Exam Help
Digital Communication Exam HelpDigital Communication Exam Help
Digital Communication Exam Help
 
Transaction Management, Recovery and Query Processing.pptx
Transaction Management, Recovery and Query Processing.pptxTransaction Management, Recovery and Query Processing.pptx
Transaction Management, Recovery and Query Processing.pptx
 

More from ashish61_scs

More from ashish61_scs (20)

Transactions
TransactionsTransactions
Transactions
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
20 access paths
20 access paths20 access paths
20 access paths
 
19 structured files
19 structured files19 structured files
19 structured files
 
18 philbe replication stanford99
18 philbe replication stanford9918 philbe replication stanford99
18 philbe replication stanford99
 
17 wics99 harkey
17 wics99 harkey17 wics99 harkey
17 wics99 harkey
 
16 greg hope_com_wics
16 greg hope_com_wics16 greg hope_com_wics
16 greg hope_com_wics
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
14 turing wics
14 turing wics14 turing wics
14 turing wics
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
 
11 tm
11 tm11 tm
11 tm
 
10b rm
10b rm10b rm
10b rm
 
09 workflow
09 workflow09 workflow
09 workflow
 
08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick
 
06 07 lock
06 07 lock06 07 lock
06 07 lock
 
05 tp mon_orbs
05 tp mon_orbs05 tp mon_orbs
05 tp mon_orbs
 
04 transaction models
04 transaction models04 transaction models
04 transaction models
 
03 fault model
03 fault model03 fault model
03 fault model
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault tolerance
 

Recently uploaded

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 

Recently uploaded (20)

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Oc cby meldingtreesarial

  • 1. Philip A. Bernstein Colin Reid Ming Wu Xinhao Yuan Microsoft Corporation March 7, 2012 Published at VLDB 2011: http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf
  • 2. A new algorithm for optimistic concurrency control (OCC) on tree-structured indices, called meld. Scenario 1: A data-sharing system The log is the database. All servers can access it. Each transaction appends its after-images to the log. Each server runs meld to do OCC and roll forward the log Log 2 DBMS cache App Server Meld DBMS cache App Server Meld DBMS cache App Server Meld
  • 3. Log DBMS App Core Cache DBMS App Core Core Roll Forward The log is the database. All cores can access it. Each transaction appends its after- images to the log. One core runs meld to do OCC and roll forward the log
  • 5. 5 G B A H C I D A D C B I H G Binary Search Tree Tree is marshaled into the log 5
  • 6. D Update D’s value Copy on write To update a node, replace nodes up to the root 6 G B A H C I D C G B 6
  • 7. Each server has a cache of the last committed DB state 7 A D C B I H G Transaction execution 1. Get pointer to snapshot 2. Generate updates locally 3. Append intention log record D C B G Snapshot G B C D DB cache G B C D H IA last committed
  • 8. Each server processes intention records in sequence To process transaction T’s intention record. Check whether T experienced a conflict If not, T committed, so the server merges the intention into its last committed state All servers make the same commit/abort decisions 8 A D C B I H G D C B G T’s conflict zone transaction T Did a committed transaction write into T’s readset or writeset here? Snapshot
  • 9. 1. Run transaction 2. Broadcast intention 3. Append intention to log 4. Send log location 5. De-serialize intention 6. Meld 9 9
  • 10. 1. Broadcasting the intention 2. Appending intention to the log 3. Optimistic concurrency control (OCC) 4. Meld Technology will improve 1 & 2 For 3, app behavior drives OCC performance But 4 depends on single-threaded processor performance, which isn’t improving Hence, it’s important to optimize Meld 10
  • 11. Last-committed state before T Compare transaction T’s after-image to the last committed state which is annotated with version and dependency metadata Traverse T’s intention, comparing versions to last-committed state Stop traversing when you reach an unchanged subtree If version(x)=version(x) then simply replace x by x Log x x[read: x]x state when T executed Transaction T’s intention x 11
  • 13. Must be computationally efficient Must be deterministic Must produce the same sequence of states on all servers 13 Meld Latest version of database state Intention record New version of database state
  • 14. 14 • T1 creates keys B,C,D,E D B E C T1 T2 D B E CA D B E C F T3 D B E C FA • T2 and T3 do not conflict, so the resulting melded state is A, B, C, D, E, F • T2 and T3 then execute concurrently, both based on the result of T1 • T2 inserts A • T3 inserts F 14
  • 15. Every node n has a unique version number (VN) VN(n) permanently identifies the exact content of n’s subtree Each node n in an intention T stores metadata about T’s snapshot Version of n in T’s snapshot Dependency information Each node’s metadata compresses to ~30 bytes 15 Root D B E C Node Metadata • version of the subtree • dependency info metadata … … … 15
  • 16. T1 D B E C VN=51 VN=53 VN=54 VN=52 We need to avoid synchronization when assigning VNs Stored as offsets from the base location of their intention The base location is assigned when the intention is logged Given: T0’s root subtree has VN 50 VN of each subtree S in T1= 50 + S’s offset 16 C B E D VN Offset +1 +2 +3 +4 Absolute VN: 50 51 52 53 54  … T1T0 16
  • 17. Subtree metadata includes a source structure version (SSV). Intutively, SSV(n) = version of n in transaction T’s snapshot DependsOn(n) = Y if T depends on n not having changed while T executed T1’s root subtree depends on the entire tree version 50. Since SSV(D) = VN(), T1 becomes the last-committed state. 17 Absolute VN 50 51 52 53 54 C B E D VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y  … T0 T1 17
  • 18. A serial intention is one whose source version is the last committed state. Meld is then trivial and needs to consider only the root node. T1 was serial. T2 is serial, so meld makes T2 the last committed state. Thus, a meld of a serial intention executes in constant time. C B E D VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y Absolute VN 51 52 53 54 55 56 57 A B D +1 +2 +3 0 52 54 N N N T0 T1 T2 18
  • 19. 19 D B E C T1 T2 D B E CA D B E C F T3 D B E C FA 19
  • 20. T3 is not serial because VN of D in T2 (= 57)  SSV(D) in T3 (= 54). Meld checks if T3 conflicts with a transaction in its conflict zone Traverses T3, comparing T3’s nodes to the last-committed state If there are no conflicts, then since T3 is concurrent, meld creates an ephemeral intention to merge T3’s state 20 Absolute VN 51 52 53 54 55 56 57 58 59 60 T0 T1 C B E D A B D T2 VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y +1 +2 +3 0 53 54 N N N F E D T3 +1 +2 +3 0 52 54 N N N 20
  • 21. M3 A committed concurrent intention produces an ephemeral intention (e.g. M3) It’s created deterministically in memory on all servers. 21 Absolute VN 51 52 53 54 55 56 57 58 59 60 61 +1 +2 +3 0 52 54 N N N T0 T1 C B E D A B D T2 VN Offset: +1 +2 +3 +4 SSV: 0 0 0 50 DependsOn: N N N Y +1 +2 +3 0 53 54 N N N F E D T3 +1 57 N D It logically commits immediately after the intention it melds. To meld the concurrent intention T3 above, we need to consider metadata only on the root node D. 21
  • 22. Most are automatically trimmed Each committed intention I trims the previous ephemeral intention with either a persisted node (if I is serial) or an ephemeral node (if I is concurrent). To track them use an ephemeral flag (or count) on each node that has ephemeral descendants Periodically run a flush transaction It copies a set of ephemeral nodes that have no reachable ephemeral nodes It makes the original ephemeral nodes unreachable in the new committed state. It has no dependencies, so it can’t conflict 22
  • 23. Phantom avoidance Asymmetric meld operations  Necessary in common case when subtrees do not align  Uses a key-range as a parameter to the top-down recursion Deletions Use tombstones in the intention header Checkpointing and recovery 23 23
  • 24. Focus here is on meld throughput only For latency, see the paper We count committed and aborted transactions Experiment setup 128K keys, all in main memory. Keys and payloads are 8 bytes. Serializable isolation, so intentions contain readsets De-serialize intentions on separate threads before meld Transaction size affects meld throughput So does conflict zone size (“concurrency degree”) As transaction size or concurrency degree increase  more concurrent transactions update keys with common ancestors  meld has to traverse deeper in the tree 24
  • 25. r:w ratio is 1:1 con-di = concurrency degree i 25
  • 26. 26
  • 27. Brute force = traverse the whole tree 27
  • 28. Lots of OCC papers but none that give details of efficient conflict-testing By contrast, there’s a huge literature on conflict- testing for locking Oxenstored [Gazagnairem & Hanquezis, ICFP 09] Similar scenario: MV trees and OCC However, very coarse-grain conflict-testing Uses none of our optimizations 28
  • 29. New algorithm for OCC Developed many optimizations to truncate the conflict checking early in the tree traversal Implemented and measure it Future work: Apply it to other tree structures Measure it on various storage devices Compare it with locking and other OCC methods on multiversion trees Try to apply it to physiological logging 29
  • 30. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
  • 31. Consider node n in Intention I SCV(n) = VN of the node that first generated the payload in n’s predecessor Altered(n) = true if n‘s payload differs from its predecessor’s DependsOn(n) = true if I depends on n’s predecessor’s content NCV(n) = if Altered(n) then VN(n) else SCV(n) n‘s content changed if SCV(nI)  NCV(nlast committed state) v1 v2 SCV IsAltered=true DependsOn=true v14 v15 SCV  NCV=self  Implies a conflict 31
  • 32. Again consider node n in Intention I DependsOnTree(n) = true if I depends on n’s subtree not having changed NSV(n) = oldest version of n whose subtree is exactly subtree(VN(n)) DependsOnTree(n) & NSV(nlast committed state)SSV(n)  a conflict Can extend DependsOnTree with DependencyRange of keys v1 v8 NSV DependsOnTree=true v14 v15 SSV  Implies a conflict in last committed state 32
  • 33. SubtreeIsOnlyReadDependent(n) = true iff none of n’s descendants are updated in I (i.e., have Altered = true). Analogous to an intention-to-read lock Avoids traversing entire tree when a descendent of n has DependsOnTree=true and NSV(nlast committed state) = SSV(n). It also enables computing NSV NSV(n) = (SubtreeIsOnlyReadDependent(n) = true)  SSV(n) else VN(n) 33