SlideShare a Scribd company logo
Scalable Database
Logging for Multicores
Hyungsoo Jung, et al.
Hanyang University
2017, VLDB
1
Index
▪ Motivation
▪ Design
▪ Implementation
▪ Other Issues
2
Motivation: Characteristics of Modern Databases
▪ Modern Databases
▪ Write-ahead Logging (WAL) protocol
▪ ACID properties
▪ Atomicity
▪ Consistency
▪ Isolation
▪ Durability
3
▪ Existing architecture relies on WAL protocol
Motivation: Architectural Issues (1)
4
DRAM
Central log buffer
Flush
HDD or NVM
Synchronous I/O Delay
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
5
T1 T2 T3
L1
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
6
T1 T2 T3
L1
Lock
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
7
T1 T2 T3
L1
Lock
L2 L3
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
8
T1 T2 T3
UnLock
L1
L2 L3
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
9
T1 T2 T3
L1
Lock
L2 L3
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
10
T1 T2 T3
L1 L2
Lock
L3
Motivation: Architectural Issues (2)
▪ Existing Central log buffer
11
T1 T2 T3
UnLock
L1 L2
L3
▪ Synchronous I/O delay
Motivation: Architectural Issues (3)
12
DRAM
Central log buffer
Flush
HDD or NVM
Synchronous I/O Delay
1. Buffering log.
2. Flush log to storage.
3. Write data.
Thread 1
Transaction A
Summary
Motivation.
▪ Central log buffer limits the scalability of DB logging on multicore.
→ Parallel logging on multicore
Contribution.
▪ ELEDA (Express Logging Ensuring Durable Atomicity)
▪ Fast, scalable logging method for high performance transaction
systems with guaranteed atomicity and durability.
▪ With concurrent data structures that solves performance bottlenecks
in central log buffer.
▪ Implementation
▪ Plug ELEDA to WiredTiger and Shore-MT and evaluate performance
improvements.
▪ (ex) Transaction throughput improves by higher than ~ 3.9 million
Txn/s.
13
Design: Parallel Logging on Multicore, Grasshopper (1)
14
▪ Issues on Parallel Logging on Multicore
▪ Guarantee the sequentiality of each logs.
▪ Detect log holes.
▪ Concurrently,
▪ buffering logs.
▪ writing logs to durable storage.
▪ Issues on Parallel Logging on Multicore
▪ Guarantee the sequentiality of each logs.
Design: Parallel Logging on Multicore, Grasshopper (1)
15
T1 T2 T3
Fetch_and_Add
LSN:1
LSN:2
LSN:3
(cf) LSN: Log Sequence Number
▪ Issues on Parallel Logging on Multicore
▪ Detect log holes.
Design: Parallel Logging on Multicore, Grasshopper (1)
16
T1 T2 T3
hole
Fetch_and_Add
LSN:1
LSN:2
LSN:3
L1 L3
L2
SBL (cf) SBL: sequentially buffered LSN
▪ Issues on Parallel Logging on Multicore
▪ Concurrently,
▪ buffering logs.
▪ writing logs to durable storage.
Design: Parallel Logging on Multicore, Grasshopper (1)
17
T1 T2 T3
holeL1 L3
L2
SBLFlush
Design: Parallel Logging on Multicore, Grasshopper (1)
18
▪ Issues on Parallel Logging on Multicore
▪ Guarantee the sequentiality of each logs.
▪ Detect log holes.
▪ Concurrently,
▪ buffering logs.
▪ writing logs to durable storage.
▪ So, design a concurrent data structure that satisfy,
▪ Concurrent buffering and flushing of logs,
▪ Fast log hole detection.
Design: Parallel Logging on Multicore, Grasshopper (2)
19
Thread type ELEDA-worker ELEDA-flusher Database
Data
structure
Global Central log buffer
Others
- Hopping index (R)
- C&H-list
- Min heap
⋅
- Hopping index (W)
- C&H-list
Operation - Tracking holes - Flush
- Copy log to buffer
- Garbage collection
Design: Parallel Logging on Multicore, Grasshopper (2)
▪ ELEDA logging architecture
20
DB thread
Design: Parallel Logging on Multicore, Grasshopper (2)
▪ ELEDA logging architecture
21
Flusher thread
Design: Parallel Logging on Multicore, Grasshopper (2)
▪ ELEDA logging architecture
22
Worker thread
Design: Parallel Logging on Multicore, Grasshopper (2)
▪ Grasshopper algorithm
23
Design: Parallel Logging on Multicore, Grasshopper (2)
▪ Latency-hiding techniques (asynchronous I/O)
24
Design: Parallel Logging on Multicore, Grasshopper (3)
25
Thread type ELEDA-worker ELEDA-flusher Database
Data
structure
Global Central log buffer
Others
- Hopping index (R)
- C&H-list
- Min heap
⋅
- Hopping index (W)
- C&H-list
Operation - Tracking holes - Flush
- Copy log to buffer
- Garbage collection
Design: Execution process of ELEDA-based system
26
L1
Page 1 Page 2 Page 3
L2 L3 L4 L5 L6 L7
Thread1
Thread2
Thread3
0 * 4k 1 * 4k 2 * 4k
hopping hopping crawling
Design: Execution process of ELEDA-based system
27
L1
Page 1 Page 2 Page 3
L2 L3 L4 L5 L6 L7
0 * 4k 1 * 4k 2 * 4k
hopping hopping
[1] 4096
[2] 4096
[3] 4096 / 3
Crawling
Hopping Index
Design: Execution process of ELEDA-based system
28
Hopping
head
tail
Crawling
head
tail
1 4 7
Page 1 Page 2 Page 3
Hopping
head
tail
Crawling
head
tail
2 6
Page 1 Page 2
Hopping
head
tail
Crawling
head
tail
3 5
Page 1 Page 2
1
2 3
Min heap
Thread 1
Thread 2
Thread 3
Flusher
Worker
1. Get HB by scanning Hopping index table.
HB is 2 in this case.
2. Remove items that related with page
number 2 in c-list and h-list.
3. Rebuild min heap.
4. Pop root(7) in min heap.
5. Then, SBL is 7.
6. Flush LSN 1~7 to storage.
(cf)
- HB: hopping boundary
- SBL: sequentially buffered LSN
▪ Tracking LSN holes (= log holes) and flushing SBL
Design: Execution process of ELEDA-based system
29
[1] 4096 = DB page size
[2] 4096 = DB page size
[3] 4096 / 3 < DB page size
Hopping Index
HB
Design: Execution process of ELEDA-based system
30
Hopping
head
tail
Crawling
head
tail
7
Page 3
Hopping
head
tail
Crawling
head
tail
Hopping
head
tail
Crawling
head
tail
7
Thread 1
Thread 2
Thread 3
Pop
Flusher
Worker
Design: Execution process of ELEDA-based system
31
Hopping
head
tail
Crawling
head
tail
Hopping
head
tail
Crawling
head
tail
Hopping
head
tail
Crawling
head
tail
Thread 1
Thread 2
Thread 3
1. Get HB by scanning Hopping index table. HB
is 2 in this case.
2. Remove items that related with page number 2
in c-list and h-list.
3. Rebuild min heap.
4. Pop root(7) in min heap.
5. Then, SBL is 7.
6. Flush LSN 1~7 to storage.
(cf)
- HB: hopping boundary
- SBL: sequentially buffered LSN
Implementation
▪ Applying to kernel file system, such as ext4.
▪ Abstraction
32
Thread type ELEDA-worker ELEDA-flusher Database
Data
structure
Global Central log buffer
Others
- Hopping index (R)
- C&H-list
- Min heap
⋅
- Hopping index (W)
- C&H-list
Operation - Tracking holes - Flush
- Copy log to buffer
- Garbage collection
Implementation
▪ Shore-MT
▪ Implement ELEDA to Shore-MT with Aether.
(cf) Aether: A Scalable Approach to Logging, R.Johnson et al.
▪ Details
▪ Replace its consolidation array-based logging subsystem.
▪ Modify its flush pipelining implementation for transaction
switching.
33
Other issues (1)
▪ Flush
▪ I/O unit for flushing is experimentally tailored.
▪ It depends on characteristics of applications.
▪ Average size of logs
▪ Max concurrency
(cf) 6.5.3 Effects of I/O unit size (64KiB and 512KiB)
▪ Garbage Collection & Callback
▪ GC pointer is exclusively accessed by the owner DB thread.
34
Other issues (2)
▪ Partially sequential implementation
▪ Access of DB threads to Hopping index.
▪ Evaluation
▪ Throughput and Commit latency
▪ Workloads
▪ Key-value
▪ Online transaction processing
▪ with Different Settings by DB options
▪ CPU utilization and Effects of I/O unit size
35
Summary
Motivation.
▪ Central log buffer limits the scalability of DB logging on multicore.
→ Parallel logging on multicore using Grasshopper
Contribution.
▪ ELEDA (Express Logging Ensuring Durable Atomicity)
▪ Fast, scalable logging method for high performance transaction
systems with guaranteed atomicity and durability.
▪ With concurrent data structures that solves performance bottlenecks
in central log buffer.
▪ Implementation
▪ Plug ELEDA to WiredTiger and Shore-MT and evaluate performance
improvements.
▪ (ex) Transaction throughput improves by higher than ~ 3.9 million
Txn/s.
36
TODO
▪ Analyze Shore-MT and Aether.
▪ Where can I insert logging and flusher modules?
▪ Design the logging subsystem and flusher modules.
▪ Implement ELEDA to Shore-MT.
▪ Starting point is C&H-list.
37
Progress
18.09.27
hjlee
38
Shore-MT and Aether
▪ Shore-MT
▪ Open-source multi-threaded storage manager.
▪ The authors use the EPFL branch of Shore-MT.
▪ Aether
▪ A scalable approach to logging.
▪ Details for implementation
▪ 4.1 Flush Pipelining → modified to ELEDA’s design
▪ A.1 Log buffer design
▪ A.2 Consolidation array → replaced with ELEDA’s design
▪ A.3 Modification to address a potential delays caused by the
requirement that all threads need to release their buffer in-order
39
pseudo
codes
exist.
Shore-MT
▪ Shore-MT and target for optimization
▪ Open-source multi-threaded storage manager.
▪ The authors use the EPFL branch of Shore-MT.
(cf) https://bitbucket.org/shoremt/shore-
mt/src/e832a6a586048ad3f4cdefde30cf96131d4b4525?at=default
▪ Language
▪ Cpp
▪ Related codes in src/sm/log.h & log.cpp
▪ Log manager class log_m
40
Aether
▪ Aether and TODO
▪ A scalable approach to logging.
▪ Details for implementation
▪ 4.1 Flush Pipelining → Modified to ELEDA’s design
▪ Related codes in src/sm/log_core.cpp
▪ Default flusher method
rc_t log_core::flush(lsn_t lsn, bool block)
▪ A.1 Log buffer design
▪ A.2 Consolidation array → Replaced with ELEDA’s design
▪ A.3 Modification to address a potential delays caused by the
requirement that all threads need to release their buffer in-order
▪ A.4 Difficulty of distributing the log
41
TODO
▪ Analyze Shore-MT and Aether.
▪ Shore-MT (default) → Aether → ELEDA
: Define what features (i.e. multi logging by DB threads) are
implemented in each systems.
▪ Find out which part of the ELEDA can be replaced by Flush
pipelining and Consolidation array of Aether.
▪ Design the logging subsystem and flusher modules.
▪ Implement ELEDA to Shore-MT.
▪ Starting point is C&H-list.
42
Reference
▪ Johnson, Ryan, et al. "Aether: a scalable approach to logging."
Proceedings of the VLDB Endowment 3.1-2 (2010): 681-692.
▪ Shore-MT (source code and docs), https://bitbucket.org/shoremt/
▪ Shore Storage Manager Modules,
http://research.cs.wisc.edu/shore-mt/onlinedoc/html/index.html
▪ Implementation notes of Log manager,
http://research.cs.wisc.edu/shore-
mt/onlinedoc/html/implnotes.html#LOG_M
43

More Related Content

What's hot

The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
Thomas Graf
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
Kernel TLV
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
Han Zhou
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Paris Carbone
 
Setup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorSetup & Operate Tungsten Replicator
Setup & Operate Tungsten Replicator
Continuent
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
Alexei Starovoitov
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Anne Nicolas
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
 
CRuby Committers Who's Who in 2013
CRuby Committers Who's Who in 2013CRuby Committers Who's Who in 2013
CRuby Committers Who's Who in 2013
nagachika t
 
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Continuent
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
Sasha Goldshtein
 
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to AnalyticsReplicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Linas Virbalas
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Jeff Squyres
 
Network Measurement with P4 and C on Netronome Agilio
Network Measurement with P4 and C on Netronome AgilioNetwork Measurement with P4 and C on Netronome Agilio
Network Measurement with P4 and C on Netronome Agilio
Open-NFP
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBay
Aliasgar Ginwala
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking Walkthrough
Thomas Graf
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Thomas Graf
 
Baker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API ServerBaker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API Server
Han Zhou
 
OVN Controller Incremental Processing
OVN Controller Incremental ProcessingOVN Controller Incremental Processing
OVN Controller Incremental Processing
Han Zhou
 

What's hot (20)

The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Setup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorSetup & Operate Tungsten Replicator
Setup & Operate Tungsten Replicator
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
CRuby Committers Who's Who in 2013
CRuby Committers Who's Who in 2013CRuby Committers Who's Who in 2013
CRuby Committers Who's Who in 2013
 
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
 
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to AnalyticsReplicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
 
Network Measurement with P4 and C on Netronome Agilio
Network Measurement with P4 and C on Netronome AgilioNetwork Measurement with P4 and C on Netronome Agilio
Network Measurement with P4 and C on Netronome Agilio
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBay
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking Walkthrough
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Baker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API ServerBaker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API Server
 
OVN Controller Incremental Processing
OVN Controller Incremental ProcessingOVN Controller Incremental Processing
OVN Controller Incremental Processing
 

Similar to Paper_Scalable database logging for multicores

Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
ssuser30e7d2
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
Memory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technologyMemory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technology
Ahmed Abdelazeem
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
ScyllaDB
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
Xiao Yan Li
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
Kapil Goyal
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
ButtaRajasekhar2
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 

Similar to Paper_Scalable database logging for multicores (20)

Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Memory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technologyMemory compiler tutorial – TSMC 40nm technology
Memory compiler tutorial – TSMC 40nm technology
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Risc vs cisc
Risc vs ciscRisc vs cisc
Risc vs cisc
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 

More from Hyo jeong Lee

Project_Automatic Photo Classification Web Service
Project_Automatic Photo Classification Web ServiceProject_Automatic Photo Classification Web Service
Project_Automatic Photo Classification Web Service
Hyo jeong Lee
 
Progress_190118
Progress_190118Progress_190118
Progress_190118
Hyo jeong Lee
 
Progress_190130
Progress_190130Progress_190130
Progress_190130
Hyo jeong Lee
 
Progress_190213
Progress_190213Progress_190213
Progress_190213
Hyo jeong Lee
 
Progress_190412
Progress_190412Progress_190412
Progress_190412
Hyo jeong Lee
 
Progress_190315
Progress_190315Progress_190315
Progress_190315
Hyo jeong Lee
 
Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...
Paper_An Efficient Garbage Collection in Java Virtual  Machine via Swap I/O O...Paper_An Efficient Garbage Collection in Java Virtual  Machine via Swap I/O O...
Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...
Hyo jeong Lee
 
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector PolicyPaper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy
Hyo jeong Lee
 
Howto_Tensorflow-slim
Howto_Tensorflow-slimHowto_Tensorflow-slim
Howto_Tensorflow-slim
Hyo jeong Lee
 
Howto_Tensorflow+Linear Regression
Howto_Tensorflow+Linear RegressionHowto_Tensorflow+Linear Regression
Howto_Tensorflow+Linear Regression
Hyo jeong Lee
 

More from Hyo jeong Lee (10)

Project_Automatic Photo Classification Web Service
Project_Automatic Photo Classification Web ServiceProject_Automatic Photo Classification Web Service
Project_Automatic Photo Classification Web Service
 
Progress_190118
Progress_190118Progress_190118
Progress_190118
 
Progress_190130
Progress_190130Progress_190130
Progress_190130
 
Progress_190213
Progress_190213Progress_190213
Progress_190213
 
Progress_190412
Progress_190412Progress_190412
Progress_190412
 
Progress_190315
Progress_190315Progress_190315
Progress_190315
 
Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...
Paper_An Efficient Garbage Collection in Java Virtual  Machine via Swap I/O O...Paper_An Efficient Garbage Collection in Java Virtual  Machine via Swap I/O O...
Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...
 
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector PolicyPaper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy
 
Howto_Tensorflow-slim
Howto_Tensorflow-slimHowto_Tensorflow-slim
Howto_Tensorflow-slim
 
Howto_Tensorflow+Linear Regression
Howto_Tensorflow+Linear RegressionHowto_Tensorflow+Linear Regression
Howto_Tensorflow+Linear Regression
 

Recently uploaded

Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 

Recently uploaded (20)

Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 

Paper_Scalable database logging for multicores

  • 1. Scalable Database Logging for Multicores Hyungsoo Jung, et al. Hanyang University 2017, VLDB 1
  • 2. Index ▪ Motivation ▪ Design ▪ Implementation ▪ Other Issues 2
  • 3. Motivation: Characteristics of Modern Databases ▪ Modern Databases ▪ Write-ahead Logging (WAL) protocol ▪ ACID properties ▪ Atomicity ▪ Consistency ▪ Isolation ▪ Durability 3
  • 4. ▪ Existing architecture relies on WAL protocol Motivation: Architectural Issues (1) 4 DRAM Central log buffer Flush HDD or NVM Synchronous I/O Delay
  • 5. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 5 T1 T2 T3 L1
  • 6. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 6 T1 T2 T3 L1 Lock
  • 7. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 7 T1 T2 T3 L1 Lock L2 L3
  • 8. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 8 T1 T2 T3 UnLock L1 L2 L3
  • 9. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 9 T1 T2 T3 L1 Lock L2 L3
  • 10. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 10 T1 T2 T3 L1 L2 Lock L3
  • 11. Motivation: Architectural Issues (2) ▪ Existing Central log buffer 11 T1 T2 T3 UnLock L1 L2 L3
  • 12. ▪ Synchronous I/O delay Motivation: Architectural Issues (3) 12 DRAM Central log buffer Flush HDD or NVM Synchronous I/O Delay 1. Buffering log. 2. Flush log to storage. 3. Write data. Thread 1 Transaction A
  • 13. Summary Motivation. ▪ Central log buffer limits the scalability of DB logging on multicore. → Parallel logging on multicore Contribution. ▪ ELEDA (Express Logging Ensuring Durable Atomicity) ▪ Fast, scalable logging method for high performance transaction systems with guaranteed atomicity and durability. ▪ With concurrent data structures that solves performance bottlenecks in central log buffer. ▪ Implementation ▪ Plug ELEDA to WiredTiger and Shore-MT and evaluate performance improvements. ▪ (ex) Transaction throughput improves by higher than ~ 3.9 million Txn/s. 13
  • 14. Design: Parallel Logging on Multicore, Grasshopper (1) 14 ▪ Issues on Parallel Logging on Multicore ▪ Guarantee the sequentiality of each logs. ▪ Detect log holes. ▪ Concurrently, ▪ buffering logs. ▪ writing logs to durable storage.
  • 15. ▪ Issues on Parallel Logging on Multicore ▪ Guarantee the sequentiality of each logs. Design: Parallel Logging on Multicore, Grasshopper (1) 15 T1 T2 T3 Fetch_and_Add LSN:1 LSN:2 LSN:3 (cf) LSN: Log Sequence Number
  • 16. ▪ Issues on Parallel Logging on Multicore ▪ Detect log holes. Design: Parallel Logging on Multicore, Grasshopper (1) 16 T1 T2 T3 hole Fetch_and_Add LSN:1 LSN:2 LSN:3 L1 L3 L2 SBL (cf) SBL: sequentially buffered LSN
  • 17. ▪ Issues on Parallel Logging on Multicore ▪ Concurrently, ▪ buffering logs. ▪ writing logs to durable storage. Design: Parallel Logging on Multicore, Grasshopper (1) 17 T1 T2 T3 holeL1 L3 L2 SBLFlush
  • 18. Design: Parallel Logging on Multicore, Grasshopper (1) 18 ▪ Issues on Parallel Logging on Multicore ▪ Guarantee the sequentiality of each logs. ▪ Detect log holes. ▪ Concurrently, ▪ buffering logs. ▪ writing logs to durable storage. ▪ So, design a concurrent data structure that satisfy, ▪ Concurrent buffering and flushing of logs, ▪ Fast log hole detection.
  • 19. Design: Parallel Logging on Multicore, Grasshopper (2) 19 Thread type ELEDA-worker ELEDA-flusher Database Data structure Global Central log buffer Others - Hopping index (R) - C&H-list - Min heap ⋅ - Hopping index (W) - C&H-list Operation - Tracking holes - Flush - Copy log to buffer - Garbage collection
  • 20. Design: Parallel Logging on Multicore, Grasshopper (2) ▪ ELEDA logging architecture 20 DB thread
  • 21. Design: Parallel Logging on Multicore, Grasshopper (2) ▪ ELEDA logging architecture 21 Flusher thread
  • 22. Design: Parallel Logging on Multicore, Grasshopper (2) ▪ ELEDA logging architecture 22 Worker thread
  • 23. Design: Parallel Logging on Multicore, Grasshopper (2) ▪ Grasshopper algorithm 23
  • 24. Design: Parallel Logging on Multicore, Grasshopper (2) ▪ Latency-hiding techniques (asynchronous I/O) 24
  • 25. Design: Parallel Logging on Multicore, Grasshopper (3) 25 Thread type ELEDA-worker ELEDA-flusher Database Data structure Global Central log buffer Others - Hopping index (R) - C&H-list - Min heap ⋅ - Hopping index (W) - C&H-list Operation - Tracking holes - Flush - Copy log to buffer - Garbage collection
  • 26. Design: Execution process of ELEDA-based system 26 L1 Page 1 Page 2 Page 3 L2 L3 L4 L5 L6 L7 Thread1 Thread2 Thread3 0 * 4k 1 * 4k 2 * 4k hopping hopping crawling
  • 27. Design: Execution process of ELEDA-based system 27 L1 Page 1 Page 2 Page 3 L2 L3 L4 L5 L6 L7 0 * 4k 1 * 4k 2 * 4k hopping hopping [1] 4096 [2] 4096 [3] 4096 / 3 Crawling Hopping Index
  • 28. Design: Execution process of ELEDA-based system 28 Hopping head tail Crawling head tail 1 4 7 Page 1 Page 2 Page 3 Hopping head tail Crawling head tail 2 6 Page 1 Page 2 Hopping head tail Crawling head tail 3 5 Page 1 Page 2 1 2 3 Min heap Thread 1 Thread 2 Thread 3
  • 29. Flusher Worker 1. Get HB by scanning Hopping index table. HB is 2 in this case. 2. Remove items that related with page number 2 in c-list and h-list. 3. Rebuild min heap. 4. Pop root(7) in min heap. 5. Then, SBL is 7. 6. Flush LSN 1~7 to storage. (cf) - HB: hopping boundary - SBL: sequentially buffered LSN ▪ Tracking LSN holes (= log holes) and flushing SBL Design: Execution process of ELEDA-based system 29 [1] 4096 = DB page size [2] 4096 = DB page size [3] 4096 / 3 < DB page size Hopping Index HB
  • 30. Design: Execution process of ELEDA-based system 30 Hopping head tail Crawling head tail 7 Page 3 Hopping head tail Crawling head tail Hopping head tail Crawling head tail 7 Thread 1 Thread 2 Thread 3 Pop
  • 31. Flusher Worker Design: Execution process of ELEDA-based system 31 Hopping head tail Crawling head tail Hopping head tail Crawling head tail Hopping head tail Crawling head tail Thread 1 Thread 2 Thread 3 1. Get HB by scanning Hopping index table. HB is 2 in this case. 2. Remove items that related with page number 2 in c-list and h-list. 3. Rebuild min heap. 4. Pop root(7) in min heap. 5. Then, SBL is 7. 6. Flush LSN 1~7 to storage. (cf) - HB: hopping boundary - SBL: sequentially buffered LSN
  • 32. Implementation ▪ Applying to kernel file system, such as ext4. ▪ Abstraction 32 Thread type ELEDA-worker ELEDA-flusher Database Data structure Global Central log buffer Others - Hopping index (R) - C&H-list - Min heap ⋅ - Hopping index (W) - C&H-list Operation - Tracking holes - Flush - Copy log to buffer - Garbage collection
  • 33. Implementation ▪ Shore-MT ▪ Implement ELEDA to Shore-MT with Aether. (cf) Aether: A Scalable Approach to Logging, R.Johnson et al. ▪ Details ▪ Replace its consolidation array-based logging subsystem. ▪ Modify its flush pipelining implementation for transaction switching. 33
  • 34. Other issues (1) ▪ Flush ▪ I/O unit for flushing is experimentally tailored. ▪ It depends on characteristics of applications. ▪ Average size of logs ▪ Max concurrency (cf) 6.5.3 Effects of I/O unit size (64KiB and 512KiB) ▪ Garbage Collection & Callback ▪ GC pointer is exclusively accessed by the owner DB thread. 34
  • 35. Other issues (2) ▪ Partially sequential implementation ▪ Access of DB threads to Hopping index. ▪ Evaluation ▪ Throughput and Commit latency ▪ Workloads ▪ Key-value ▪ Online transaction processing ▪ with Different Settings by DB options ▪ CPU utilization and Effects of I/O unit size 35
  • 36. Summary Motivation. ▪ Central log buffer limits the scalability of DB logging on multicore. → Parallel logging on multicore using Grasshopper Contribution. ▪ ELEDA (Express Logging Ensuring Durable Atomicity) ▪ Fast, scalable logging method for high performance transaction systems with guaranteed atomicity and durability. ▪ With concurrent data structures that solves performance bottlenecks in central log buffer. ▪ Implementation ▪ Plug ELEDA to WiredTiger and Shore-MT and evaluate performance improvements. ▪ (ex) Transaction throughput improves by higher than ~ 3.9 million Txn/s. 36
  • 37. TODO ▪ Analyze Shore-MT and Aether. ▪ Where can I insert logging and flusher modules? ▪ Design the logging subsystem and flusher modules. ▪ Implement ELEDA to Shore-MT. ▪ Starting point is C&H-list. 37
  • 39. Shore-MT and Aether ▪ Shore-MT ▪ Open-source multi-threaded storage manager. ▪ The authors use the EPFL branch of Shore-MT. ▪ Aether ▪ A scalable approach to logging. ▪ Details for implementation ▪ 4.1 Flush Pipelining → modified to ELEDA’s design ▪ A.1 Log buffer design ▪ A.2 Consolidation array → replaced with ELEDA’s design ▪ A.3 Modification to address a potential delays caused by the requirement that all threads need to release their buffer in-order 39 pseudo codes exist.
  • 40. Shore-MT ▪ Shore-MT and target for optimization ▪ Open-source multi-threaded storage manager. ▪ The authors use the EPFL branch of Shore-MT. (cf) https://bitbucket.org/shoremt/shore- mt/src/e832a6a586048ad3f4cdefde30cf96131d4b4525?at=default ▪ Language ▪ Cpp ▪ Related codes in src/sm/log.h & log.cpp ▪ Log manager class log_m 40
  • 41. Aether ▪ Aether and TODO ▪ A scalable approach to logging. ▪ Details for implementation ▪ 4.1 Flush Pipelining → Modified to ELEDA’s design ▪ Related codes in src/sm/log_core.cpp ▪ Default flusher method rc_t log_core::flush(lsn_t lsn, bool block) ▪ A.1 Log buffer design ▪ A.2 Consolidation array → Replaced with ELEDA’s design ▪ A.3 Modification to address a potential delays caused by the requirement that all threads need to release their buffer in-order ▪ A.4 Difficulty of distributing the log 41
  • 42. TODO ▪ Analyze Shore-MT and Aether. ▪ Shore-MT (default) → Aether → ELEDA : Define what features (i.e. multi logging by DB threads) are implemented in each systems. ▪ Find out which part of the ELEDA can be replaced by Flush pipelining and Consolidation array of Aether. ▪ Design the logging subsystem and flusher modules. ▪ Implement ELEDA to Shore-MT. ▪ Starting point is C&H-list. 42
  • 43. Reference ▪ Johnson, Ryan, et al. "Aether: a scalable approach to logging." Proceedings of the VLDB Endowment 3.1-2 (2010): 681-692. ▪ Shore-MT (source code and docs), https://bitbucket.org/shoremt/ ▪ Shore Storage Manager Modules, http://research.cs.wisc.edu/shore-mt/onlinedoc/html/index.html ▪ Implementation notes of Log manager, http://research.cs.wisc.edu/shore- mt/onlinedoc/html/implnotes.html#LOG_M 43

Editor's Notes

  1. 크게는 논문의 motivation과 design, implementation 순서로 설명 드림.
  2. 먼저, 현재 상용화되어있는 데이터베이스의 주요 특징은 다음 두 가지임. 먼저, 기존의 WAL 프로토콜로 인한 문제를 짚음. 이것은 뒷장에서 자세히 설명함. 또, ACID property 중 본 논문에서는 본인들의 아이디어에서 Atomicity와 Durability 측면을 특히 강조함. 특히, Durability와 performance의 trade-off를 짚었음. 기존 DB들은 여기 표와 같이 performance(속도)가 느린 것을 해결하기 위해 durability를 포기하거나, durability를 위해 속도를 조금 포기하는 옵션을 제공함. (여기서 제시한 아이디어가 두 간극을 줄임)
  3. 다음은 앞서 언급한 WAL protocol에 대한 것임. 기존에는 WAL을 위해 다음 그림과 같은 아키텍처를 주로 사용함. 여기서 저자들은 두 가지 이슈를 제시함. 첫번째는 central log buffer 자체의 scalability, 두번째는 synchronous I/O delay임.
  4. 먼저, central log buffer의 scalability 문제임. 이 예시대로, central log buffer를 공유자원으로 두고 lock을 사용한 방식은 multicore 환경에서 성능적 한계가 있음.
  5. 먼저, central log buffer의 scalability 문제임.
  6. 먼저, central log buffer의 scalability 문제임.
  7. 먼저, central log buffer의 scalability 문제임.
  8. 먼저, central log buffer의 scalability 문제임.
  9. 먼저, central log buffer의 scalability 문제임.
  10. 먼저, central log buffer의 scalability 문제임.
  11. 다음은 synchronous I/O delay 문제임. 기존 WAL protocol에 의하면 thread는 log를 먼저 쓰고 storage에 flush한 뒤에야, 실제 데이터를 write하는 order를 지켜야함. 이 때, flush, 즉 disk I/O를 하느라 thread가 기다려야하는 시간을 synchronous I/O delay라고 함. 스토리지 디바이스로 HDD 대신 NVM을 쓰면 delay가 줄기는 하지만 이 논문에서 제안한 방법을 사용하면 더 성능이 좋다고 저자들은 이야기함.
  12. 여기까지의 내용을 정리하면, 현존하는 DB의 central log buffer 방식이 scalability에 있어서 한계가 있고, synchronous I/O delay 등 성능상의 문제가 있다는 점을 모티베이션으로 했음. 저자들은 이를 해결하기 위해 멀티코어에서의 Parallel logging를 제안함. 그것을 ELEDA라고 하며, 기존 central log buffer의 성능적 병목을 해결하는 concurrent data structure를 활용해서 atomicity와 durability를 보장하면서도 성능이 좋은 트랜잭션 시스템이라고 이야기함. 성과를 간단히 언급하면, WiredTiger와 Shore-MT라는 DB에 이를 적용해서 성능적 향상을 보았다고함. 예를들면, 가장 좋은 케이스의 경우 트랜잭션 throughput이 3백90만 Transaction per second까지 향상됨을 보였음.
  13. 이제 본격적으로 디자인에 대한 내용임. 저자들은 ELEDA의 핵심 디자인 기법을 Grasshopper라 이름지었음. 앞서 말한 멀티코어에서의 Parallel logging 기법을 구현하는 데에는 다음과 같은 세가지 정도의 이슈가 있음.
  14. 먼저, 각 로그의 시퀀셜리티를 보장해주어야함. 간단히 언급하면, ELEDA에서는 lock 대신 fetch and add operation을 활용해 transaction별로 고유한 번호를 부여했음. 이를 Log sequence number, LSN이라함.
  15. central log buffer 앞 쪽부터 sequential한 log block들을 flush해주기 위해, 가장 처음 등장하는 hole이 어딨는지 빠르게 찾을 수 있어야함.
  16. 마지막으로, concurrent하게 log를 버퍼링하고, 또 버퍼된 로그를 스토리지에 flush할 수 있어야함.
  17. 정리하면, concurrent data structure와 기법으로, concurrent한 버퍼링 및 플러싱과 빠른 hole detection을 가능케해야함.
  18. 이를 위해서 ELEDA에서 제안한 방식은 다음과 같음. 먼저, concurrent한 연산을 위해 thread를 세 가지 타입으로 정의함. 먼저 hole을 트래킹하는 worker, worker가 찾아준 SBL 정보로 flush를 하는 flusher, 그리고 log를 버퍼에 copy하고 GC 등의 작업을 하는 기존 DB thread들임. 세 thread는 central log buffer를 global 자료구조로 보고, worker는 hopping index table과 hopping, crawling list, min heap 세 자료구조를 사용함. 그리고 DB thread들은 thread 별로 crawling list, hopping list의 두 개의 list를 유지함. 이게 뭐하는 자료구조인지는 뒤에서 설명함.
  19. 앞서서 간단히 언급했던 parallel logging의 구현 이슈를 더 자세히 설명함. DB thread가 transaction의 log를 buffer로 copy할 때에는 LSN(log sequence number)를 부여해서 각 thread가 독립적으로 서로 간섭 없이 (lock도 없이) central buffer에서 자신의 위치를 찾아 copy함.
  20. 기존에는 flush할 때 hole이 있으면 log hole이 채워지는 것을 기다림. multicore여도 한꺼번에 내려서 multi로 하는 이점이 없었음. > algo (1) 버퍼 중간 중간에 연속된 로그가 생기면 flush하게 함. central buffer에 lock을 쓸 수도 있지만 오버헤드가 크므로 안 쓰고 thread가 각각 fetch & add 연산 수행. 이 때 발생하는 hole을 효과적으로 detect하는 알고리즘을 바로 설명하겠음.
  21. hole을 detect하는 작업은 앞서말했듯 worker thread가 수행함. 여기서 grasshopper 알고리즘이 등장함.
  22. worker가 유지하는 자료구조 중 하나인 hopping index table에는 central log buffer를 2의 H승이라고 표현된 DB의 한 페이지 크기 단위로 잘라서 보았을 때에 한 단위 안에 들어있는 log들의 크기를 저장함. 저 테이블을 차례로 보면서 그 값이 2의 H승으로 꽉 차있다면 log를 page 단위로 hopping함. 그 값이 만약 2의 H승보다 작다면 hole이 있다는 의미이므로 그 페이지의 시작점부터는 log 단위로 크롤링함.
  23. 다음은 synchronous I/O delay에 대한 해법임. > algo (2) 이런 latency를 hiding하는 asynchronous I/O 제안함. thread가 만약 transaction 1 실행 중 log를 flush해야하는 경우가 생긴다면, 이걸 기다리지 않고 transaction 2로 context switching. 이 때 flush thread가 일을 다 끝내고 callback하면, 다시 trans 1로 돌아와서 실제 데이터 write 수행함.
  24. 다시 정리하면, Thread type은 DB, worker, flusher. DB thread들 각각은 crawling-list와 hopping-list 한 쌍을 유지. 각각 c, h list라고 줄여부름. 이 때 c-list는 LSN 단위로, h-list는 DB 페이지 단위로 유지. worker는 min heap과 hopping index를 유지. 이 때 min heap은 c-list의 LSN의 최소들만 유지. 이것들이 실제로 쓰이는 과정을 설명하기 위해 앞에서 말한 기법을 data structure level에서 보여드리겠음.
  25. 간단한 예시로 다음과 같이 스레드 1, 2, 3별로 log가 central buffer에 저렇게 위치한다고 가정. 또, DB page size는 4KB로 가정 단어 SDL: storage durable LSN SBL: sequentially buffered LSN LSN: log sequence number LSN hole: partially buffered log
  26. 이런 상황에 worker thread의 hopping index table은 다음과 같이 값을 가지게됨. 단어 SDL: storage durable LSN SBL: sequentially buffered LSN LSN: log sequence number LSN hole: partially buffered log
  27. 동시에 각 thread의 h-list와 c-list는 다음과 같은 모습을 보임. 말한대로 h-list는 page를 관리하고, c-list는 LSN(log)를 관리함. 이 때에, 각 thread의 c-list에서 최소의 LSN을 Min heap 구조로 유지함. 여기에는 생략했지만 DB thread가 LSN을 추가할 때마다 hopping index table의 값에 + log size해주며 채워감. 이 때에 DB thread들의 hopping index에 대한 multi write는 시스템적으로 불허함.
  28. 이러한 상황에 worker thread가 어떻게 hole을 tracking하고 flusher thread에게 어떤 정보를 전달하는지 설명함. 먼저, worker는 hopping index table을 스캔해서 4KB가 되지 않는 곳의 인덱스에서 하나를 뺀 hopping boundary를 찾음. 즉, hole이 있는 페이지 구간의 직전 페이지 인덱스를 얻음. 여기서는 2. 그 후 page 2번과 관련된 모든 아이템들을 c-list와 h-list에서 지움. 그 후, 다시 heap을 build함. 이 3번까지의 작업을 끝내면 다음장과 같은 그림이 됨.
  29. 이렇게됨. 여기서 min을 pop하여 SBL의 정보를 얻음.
  30. 그 후, flusher는 LSN 1부터 7까지의 시퀀셜 로그를 스토리지에 플러시함. 여기까지가 ELEDA 및 grasshopper 알고리즘에 대한 주요 설명임.
  31. 이에 대한 구현을 어떻게 할지는 다음과 같이 생각함. 앞서 정리한 표에 따른 자료구조와 메소드를 구현할 것임. 이 때, 논문의 표에 따르면 각 자료구조의 포인터에 대한 스레드 세 개의 접근 권한을 달리해서 lock 없는 concurrent logging을 완성하였으므로 저 정보를 참고하여야할 것임.
  32. 특히, wiredtiger와 shore-mt 중 shore-mt에 대한 구현은 다음과 같이 언급함. Aether를 활용해서 shore-mt에 ELEDA를 구현했음. 이 때, 기존의 array-based logging subsystem을 ELEDA의 디자인대로 교체했고, transaction context switching을 위해서 flush pipelining을 구현했다고 함.
  33. 이 외에 GC와 flush에 대한 구현도 더 고민해보아야할 것임. 특히 flusher thread 구현 시 한 번 flush할 때의 I/O 단위에 대해서는 저자도 여지를 남겨둠. 논문에서는 64와 512KiB 두 경우에 대해 평가했으며, 여전히 latency와 bandwidth 간 trade-off가 있어서 잘 조절해야한다고 언급함. (커뮤니티에 이 문제에 대해서 오픈 해놨다고 함.)
  34. 또한, hopping index를 DB thread가 접근할 때 시스템적으로 한 번에 하나씩 시퀀셜하게만 허가한다는 점 등, concurrent design과 거리가 먼 것들이 추후 개선 대상이 될 수 있을 것이라 생각함. 그리고 evaluation의 카테고리는 다음과 같으며 일단은 생략하겠음
  35. 다시 정리하면, 현존하는 DB의 central log buffer 방식이 scalability에 있어서 한계가 있고, synchronous I/O delay 등 성능상의 문제가 있다는 점을 모티베이션으로 했음. 저자들은 이를 해결하기 위해 멀티코어에서의 Parallel logging를 제안함. 그것을 ELEDA라고 하며, 기존 central log buffer의 성능적 병목을 해결하는 concurrent data structure를 활용해서 atomicity와 durability를 보장하면서도 성능이 좋은 트랜잭션 시스템이라고 이야기함. 성과를 간단히 언급하면, WiredTiger와 Shore-MT라는 DB에 이를 적용해서 성능적 향상을 보았다고함. 예를들면, 가장 좋은 케이스의 경우 트랜잭션 throughput이 390만 Transaction per second까지 향상됨을 보였음.
  36. 앞으로 할 일을 대강 세 단계로 분류했음.
  37. 어떤 부분 수정이 필요할 지 예상해볼 필요가 있음.
  38. 사실 ELEDA 논문에서는 해당 버전의 shore-MT에 멀티코어 scalability를 제공하는 여러 기능(DB locking, latching, logging)이 이미 구현이 되어있다고 함. 따라서, 기존의 logging을 기반으로 optimization을 했다고 함. 최적화 파트는 밑줄로 표시된 부분으로 예상. Aether에서 Flush pipelining, Consolidation array 각각의 내용 이해할 필요가 있음. 특히, consolidation array! flush pipelining은 flush thread와 worker thread 간 통신 구현해야할 것. (SBL 정해지면 worker가 flusher에게 그 위치를 콜백으로 넘겨줘야함.)
  39. Shore-MT(default) → Aether → ELEDA 각각이 어디까지 구현이 되어있는지 정의해야함. (ex) multi logging by DB threads, … 저 consolidation array가 어떤 기능 의미하는지 파악. 즉, 각 부분이 ELEDA의 어떤 모듈과 치환될 수 있는지 정의해야함.