SlideShare a Scribd company logo
The computer science behind a
modern distributed data store
Max Neunhöffer
OSDC, Berlin, 12 June 2018
www.arangodb.com
Overview
Topics
Resilience and Consensus
Sorting
Log-structured Merge Trees
Hybrid Logical Clocks
Distributed ACID Transactions
Overview
Topics
Resilience and Consensus
Sorting
Log-structured Merge Trees
Hybrid Logical Clocks
Distributed ACID Transactions
Bottom line: You need CompSci to implement a modern data store
Resilience and Consensus
The Problem
A modern data store is distributed,
Resilience and Consensus
The Problem
A modern data store is distributed, because it needs to scale out and/or
be resilient.
Resilience and Consensus
The Problem
A modern data store is distributed, because it needs to scale out and/or
be resilient.
Different parts of the system need to agree on things.
Resilience and Consensus
The Problem
A modern data store is distributed, because it needs to scale out and/or
be resilient.
Different parts of the system need to agree on things.
Consensus is the art to achieve this as well as possible in software.
This is relatively easy, if things are good, but very hard, if:
Resilience and Consensus
The Problem
A modern data store is distributed, because it needs to scale out and/or
be resilient.
Different parts of the system need to agree on things.
Consensus is the art to achieve this as well as possible in software.
This is relatively easy, if things are good, but very hard, if:
the network has outages,
the network has dropped or delayed or duplicated packets,
disks fail (and come back with corrupt data),
machines fail (and come back with old data),
racks fail (and come back with or without data).
Resilience and Consensus
The Problem
A modern data store is distributed, because it needs to scale out and/or
be resilient.
Different parts of the system need to agree on things.
Consensus is the art to achieve this as well as possible in software.
This is relatively easy, if things are good, but very hard, if:
the network has outages,
the network has dropped or delayed or duplicated packets,
disks fail (and come back with corrupt data),
machines fail (and come back with old data),
racks fail (and come back with or without data).
(And we have not even talked about malicious attacks and enemy action.)
Paxos and Raft
Traditionally, one uses the Paxos Consensus Protocol (1998).
More recently, Raft (2013) has been proposed.
Paxos is a challenge to understand and to implement efficiently.
Various variants exist.
Raft is designed to be understandable.
Paxos and Raft
Traditionally, one uses the Paxos Consensus Protocol (1998).
More recently, Raft (2013) has been proposed.
Paxos is a challenge to understand and to implement efficiently.
Various variants exist.
Raft is designed to be understandable.
My advice:
First try to understand Paxos for some time (do not implement it!), then
enjoy the beauty of Raft,
Paxos and Raft
Traditionally, one uses the Paxos Consensus Protocol (1998).
More recently, Raft (2013) has been proposed.
Paxos is a challenge to understand and to implement efficiently.
Various variants exist.
Raft is designed to be understandable.
My advice:
First try to understand Paxos for some time (do not implement it!), then
enjoy the beauty of Raft, but do not implement it either!
Paxos and Raft
Traditionally, one uses the Paxos Consensus Protocol (1998).
More recently, Raft (2013) has been proposed.
Paxos is a challenge to understand and to implement efficiently.
Various variants exist.
Raft is designed to be understandable.
My advice:
First try to understand Paxos for some time (do not implement it!), then
enjoy the beauty of Raft, but do not implement it either!
Use some battle-tested implementation you trust!
Paxos and Raft
Traditionally, one uses the Paxos Consensus Protocol (1998).
More recently, Raft (2013) has been proposed.
Paxos is a challenge to understand and to implement efficiently.
Various variants exist.
Raft is designed to be understandable.
My advice:
First try to understand Paxos for some time (do not implement it!), then
enjoy the beauty of Raft, but do not implement it either!
Use some battle-tested implementation you trust!
But most importantly: DO NOT TRY TO INVENT YOUR OWN!
Raft in a slide
An odd number of servers each keep a persisted log of events.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
They democratically elect a leader with absolute majority.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
They democratically elect a leader with absolute majority.
Only the leader may append to the replicated log.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
They democratically elect a leader with absolute majority.
Only the leader may append to the replicated log.
An append only counts when a majority has persisted and confirmed it.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
They democratically elect a leader with absolute majority.
Only the leader may append to the replicated log.
An append only counts when a majority has persisted and confirmed it.
Very smart logic to ensure a unique leader and automatic recovery from failure.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
They democratically elect a leader with absolute majority.
Only the leader may append to the replicated log.
An append only counts when a majority has persisted and confirmed it.
Very smart logic to ensure a unique leader and automatic recovery from failure.
It is all a lot of fun to get right, but it is proven to work.
Raft in a slide
An odd number of servers each keep a persisted log of events.
Everything is replicated to everybody.
They democratically elect a leader with absolute majority.
Only the leader may append to the replicated log.
An append only counts when a majority has persisted and confirmed it.
Very smart logic to ensure a unique leader and automatic recovery from failure.
It is all a lot of fun to get right, but it is proven to work.
One puts a key/value store on top, the log contains the changes.
Raft demo
file:///home/neunhoef/talks/talks/public/Presentation/
2018-06-12-OSDC-Berlin/raft/raft.github.io/raftscope/index.html
http://raft.github.io/raftscope/index.html
(by Diego Ongaro)
Sorting
The Problem
Data stores need indexes. In practice, we need to sort things.
Sorting
The Problem
Data stores need indexes. In practice, we need to sort things.
Most published algorithms are rubbish on modern hardware.
Sorting
The Problem
Data stores need indexes. In practice, we need to sort things.
Most published algorithms are rubbish on modern hardware.
The problem is no longer the
comparison computations but the data movement
.
Sorting
The Problem
Data stores need indexes. In practice, we need to sort things.
Most published algorithms are rubbish on modern hardware.
The problem is no longer the
comparison computations but the data movement
.
Since the time when I was a kid and have played with an Apple IIe,
compute power in one core has increased by ×20000
a single memory access by ×40
and now we have 32 cores in a CPU
this means computation has outpaced memory access by ×1280!
Idea for a parallel sorting algorithm: Merge Sort
Min−Heap:
sorted
merged
Idea for a parallel sorting algorithm: Merge Sort
Min−Heap:
sorted
merged
Nearly all comparisons hit the L2 cache!
Log structured merge trees (LSM-trees)
The Problem
People rightfully expect from a data store, that it
can hold more data than the available RAM,
works well with SSDs and spinning rust,
allows fast bulk inserts into large data sets, and
provides fast reads in a hot set that fits into RAM.
Log structured merge trees (LSM-trees)
The Problem
People rightfully expect from a data store, that it
can hold more data than the available RAM,
works well with SSDs and spinning rust,
allows fast bulk inserts into large data sets, and
provides fast reads in a hot set that fits into RAM.
Traditional B-tree based structures often fail to deliver with the last 2.
Log structured merge trees (LSM-trees)
(Source: http://www.benstopford.com/2015/02/14/log-structured-merge-trees/, Author: Ben Stopford, License: Creative Commons)
Log structured merge trees (LSM-trees)
LSM-trees — summary
writes first go into memtables,
all files are sorted and immutable,
compaction happens in the background,
merge sort can be used,
all writes use sequential I/O,
Bloom filters or Cuckoo filters for fast reads,
=⇒ good write throughput and reasonable read performance,
used in ArangoDB, BigTable, Cassandra, HBase, InfluxDB, LevelDB,
MongoDB, RocksDB, SQLite4 and WiredTiger, etc.
Hybrid Logical Clocks (HLC)
The Problem
Clocks in different nodes of distributed systems are not in sync.
Hybrid Logical Clocks (HLC)
The Problem
Clocks in different nodes of distributed systems are not in sync.
general relativity poses fundamental obstructions to synchronizity,
in practice, clock skew happens,
Google can use atomic clocks,
even with NTP (network time protocol) we have to live with ≈ 20ms.
Hybrid Logical Clocks (HLC)
The Problem
Clocks in different nodes of distributed systems are not in sync.
general relativity poses fundamental obstructions to synchronizity,
in practice, clock skew happens,
Google can use atomic clocks,
even with NTP (network time protocol) we have to live with ≈ 20ms.
Therefore, we cannot compare time stamps from different nodes!
Hybrid Logical Clocks (HLC)
The Problem
Clocks in different nodes of distributed systems are not in sync.
general relativity poses fundamental obstructions to synchronizity,
in practice, clock skew happens,
Google can use atomic clocks,
even with NTP (network time protocol) we have to live with ≈ 20ms.
Therefore, we cannot compare time stamps from different nodes!
Why would this help?
establish “happened after” relationship between events,
e.g. for conflict resolution, log sorting, detecting network delays,
time to live could be implemented easily.
Hybrid Logical Clocks (HLC)
The Idea
Every computer has a local clock, and we use NTP to synchronize.
Hybrid Logical Clocks (HLC)
The Idea
Every computer has a local clock, and we use NTP to synchronize.
If two events on different machines are linked by causality, the cause
should have a smaller time stamp than the effect.
Hybrid Logical Clocks (HLC)
The Idea
Every computer has a local clock, and we use NTP to synchronize.
If two events on different machines are linked by causality, the cause
should have a smaller time stamp than the effect.
causality ⇐⇒ a message is sent
Send a time stamp with every message. The HLC always returns a value
> max(local clock, largest time stamp ever seen).
Hybrid Logical Clocks (HLC)
The Idea
Every computer has a local clock, and we use NTP to synchronize.
If two events on different machines are linked by causality, the cause
should have a smaller time stamp than the effect.
causality ⇐⇒ a message is sent
Send a time stamp with every message. The HLC always returns a value
> max(local clock, largest time stamp ever seen).
Causality is preserved, time can “catch up” with logical time eventually.
http://muratbuffalo.blogspot.com.es/2014/07/hybrid-logical-clocks.html
Distributed ACID Transactions
Atomic either happens in its entirety or not at all
Consistent reading sees a consistent state, writing preser-
vers consistency
Isolated concurrent transactions do not see each other
Durable committed writes are preserved after shut-
down and crashes
Distributed ACID Transactions
Atomic either happens in its entirety or not at all
Consistent reading sees a consistent state, writing preser-
vers consistency
Isolated concurrent transactions do not see each other
Durable committed writes are preserved after shut-
down and crashes
(All relatively doable when transactions happen one after another!)
Distributed ACID Transactions
The Problem
In a distributed system:
How to make sure, that all nodes agree on whether
the transaction has happened? (Atomicity)
Distributed ACID Transactions
The Problem
In a distributed system:
How to make sure, that all nodes agree on whether
the transaction has happened? (Atomicity)
How to create a consistent snapshot across nodes? (Consistency)
Distributed ACID Transactions
The Problem
In a distributed system:
How to make sure, that all nodes agree on whether
the transaction has happened? (Atomicity)
How to create a consistent snapshot across nodes? (Consistency)
How to hide ongoing activities until commit? (Isolation)
Distributed ACID Transactions
The Problem
In a distributed system:
How to make sure, that all nodes agree on whether
the transaction has happened? (Atomicity)
How to create a consistent snapshot across nodes? (Consistency)
How to hide ongoing activities until commit? (Isolation)
How to handle lost nodes? (Durability)
Distributed ACID Transactions
The Problem
In a distributed system:
How to make sure, that all nodes agree on whether
the transaction has happened? (Atomicity)
How to create a consistent snapshot across nodes? (Consistency)
How to hide ongoing activities until commit? (Isolation)
How to handle lost nodes? (Durability)
We have to take replication, resilience and failover into account.
Distributed ACID Transactions
WITHOUT
Distributed databases without ACID transactions:
ArangoDB, BigTable, Couchbase, Datastax, Dynamo, Elastic, HBase,
MongoDB, RethinkDB, Riak, and lots more . . .
WITH
Distributed databases with ACID transactions:
(ArangoDB,) CockroachDB, FoundationDB, Spanner
Distributed ACID Transactions
WITHOUT
Distributed databases without ACID transactions:
ArangoDB, BigTable, Couchbase, Datastax, Dynamo, Elastic, HBase,
MongoDB, RethinkDB, Riak, and lots more . . .
WITH
Distributed databases with ACID transactions:
(ArangoDB,) CockroachDB, FoundationDB, Spanner
=⇒ Very few distributed engines promise ACID, because this is hard!
Distributed ACID Transactions
Basic Idea
Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of
a data item are kept.
Distributed ACID Transactions
Basic Idea
Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of
a data item are kept.
Do writes and replication decentrally and distributed, without them
becoming visible from other transactions.
Distributed ACID Transactions
Basic Idea
Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of
a data item are kept.
Do writes and replication decentrally and distributed, without them
becoming visible from other transactions.
Then have some place, where there is a switch, which decides when the
transaction becomes visible.
Distributed ACID Transactions
Basic Idea
Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of
a data item are kept.
Do writes and replication decentrally and distributed, without them
becoming visible from other transactions.
Then have some place, where there is a switch, which decides when the
transaction becomes visible. These “switches” need to
be persisted somewhere (durability),
scale out (no bottleneck for commit/abort),
be replicated (no single point of failure),
be resilient in case of fail-over (fault-tolerance).
Distributed ACID Transactions
Basic Idea
Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of
a data item are kept.
Do writes and replication decentrally and distributed, without them
becoming visible from other transactions.
Then have some place, where there is a switch, which decides when the
transaction becomes visible. These “switches” need to
be persisted somewhere (durability),
scale out (no bottleneck for commit/abort),
be replicated (no single point of failure),
be resilient in case of fail-over (fault-tolerance).
Transaction visibility needs to be implemented (MVCC), time stamps play
a crucial role.
Thanks
Thanks
and YES: We are hiring, too!
Links
http://the-paper-trail.org/blog/consensus-protocols-paxos
https://raft.github.io
https://en.wikipedia.org/wiki/Merge_sort
http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
http://muratbuffalo.blogspot.com.es/2014/07/hybrid-logical-clocks.html
https://research.google.com/archive/spanner.html
https://www.cockroachlabs.com/docs/v1.1/architecture/overview.html
https://www.arangodb.com
http://mesos.apache.org

More Related Content

Similar to OSDC 2018 | The Computer science behind a modern distributed data store by Max Neunhöffer

WebWorkersCamp 2010
WebWorkersCamp 2010WebWorkersCamp 2010
WebWorkersCamp 2010
Olivier Gutknecht
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
Basho and Riak at GOTO Stockholm:  "Don't Use My Database."Basho and Riak at GOTO Stockholm:  "Don't Use My Database."
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
Basho Technologies
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
Liu Yuan
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
Stefano Paluello
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Software and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : NotesSoftware and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : Notes
Subhajit Sahu
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
Adam Muise
 
Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
Theo Schlossnagle
 
Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...
Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...
Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...
CodeOps Technologies LLP
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Clustering In The Wild
Clustering In The WildClustering In The Wild
Clustering In The WildSergio Bossa
 
Ep keyote slides
Ep  keyote slidesEp  keyote slides
Ep keyote slides
Niti Suryawanshi
 
Ep keyote slides
Ep  keyote slidesEp  keyote slides
Ep keyote slides
OpenEBS
 
"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017
Neeran Karnik
 
Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...
Michelle Holley
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
Minio
 
Tarantool: теперь и с SQL
Tarantool: теперь и с SQLTarantool: теперь и с SQL
Tarantool: теперь и с SQL
Mail.ru Group
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
ice799
 

Similar to OSDC 2018 | The Computer science behind a modern distributed data store by Max Neunhöffer (20)

WebWorkersCamp 2010
WebWorkersCamp 2010WebWorkersCamp 2010
WebWorkersCamp 2010
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
Basho and Riak at GOTO Stockholm:  "Don't Use My Database."Basho and Riak at GOTO Stockholm:  "Don't Use My Database."
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Software and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : NotesSoftware and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : Notes
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...
Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...
Containers and Developer Defined Data Centers - Evan Powell - Keynote in Bang...
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Clustering In The Wild
Clustering In The WildClustering In The Wild
Clustering In The Wild
 
Ep keyote slides
Ep  keyote slidesEp  keyote slides
Ep keyote slides
 
Ep keyote slides
Ep  keyote slidesEp  keyote slides
Ep keyote slides
 
"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017
 
Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Tarantool: теперь и с SQL
Tarantool: теперь и с SQLTarantool: теперь и с SQL
Tarantool: теперь и с SQL
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 

Recently uploaded

2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 

Recently uploaded (20)

2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 

OSDC 2018 | The Computer science behind a modern distributed data store by Max Neunhöffer

  • 1. The computer science behind a modern distributed data store Max Neunhöffer OSDC, Berlin, 12 June 2018 www.arangodb.com
  • 2. Overview Topics Resilience and Consensus Sorting Log-structured Merge Trees Hybrid Logical Clocks Distributed ACID Transactions
  • 3. Overview Topics Resilience and Consensus Sorting Log-structured Merge Trees Hybrid Logical Clocks Distributed ACID Transactions Bottom line: You need CompSci to implement a modern data store
  • 4. Resilience and Consensus The Problem A modern data store is distributed,
  • 5. Resilience and Consensus The Problem A modern data store is distributed, because it needs to scale out and/or be resilient.
  • 6. Resilience and Consensus The Problem A modern data store is distributed, because it needs to scale out and/or be resilient. Different parts of the system need to agree on things.
  • 7. Resilience and Consensus The Problem A modern data store is distributed, because it needs to scale out and/or be resilient. Different parts of the system need to agree on things. Consensus is the art to achieve this as well as possible in software. This is relatively easy, if things are good, but very hard, if:
  • 8. Resilience and Consensus The Problem A modern data store is distributed, because it needs to scale out and/or be resilient. Different parts of the system need to agree on things. Consensus is the art to achieve this as well as possible in software. This is relatively easy, if things are good, but very hard, if: the network has outages, the network has dropped or delayed or duplicated packets, disks fail (and come back with corrupt data), machines fail (and come back with old data), racks fail (and come back with or without data).
  • 9. Resilience and Consensus The Problem A modern data store is distributed, because it needs to scale out and/or be resilient. Different parts of the system need to agree on things. Consensus is the art to achieve this as well as possible in software. This is relatively easy, if things are good, but very hard, if: the network has outages, the network has dropped or delayed or duplicated packets, disks fail (and come back with corrupt data), machines fail (and come back with old data), racks fail (and come back with or without data). (And we have not even talked about malicious attacks and enemy action.)
  • 10. Paxos and Raft Traditionally, one uses the Paxos Consensus Protocol (1998). More recently, Raft (2013) has been proposed. Paxos is a challenge to understand and to implement efficiently. Various variants exist. Raft is designed to be understandable.
  • 11. Paxos and Raft Traditionally, one uses the Paxos Consensus Protocol (1998). More recently, Raft (2013) has been proposed. Paxos is a challenge to understand and to implement efficiently. Various variants exist. Raft is designed to be understandable. My advice: First try to understand Paxos for some time (do not implement it!), then enjoy the beauty of Raft,
  • 12. Paxos and Raft Traditionally, one uses the Paxos Consensus Protocol (1998). More recently, Raft (2013) has been proposed. Paxos is a challenge to understand and to implement efficiently. Various variants exist. Raft is designed to be understandable. My advice: First try to understand Paxos for some time (do not implement it!), then enjoy the beauty of Raft, but do not implement it either!
  • 13. Paxos and Raft Traditionally, one uses the Paxos Consensus Protocol (1998). More recently, Raft (2013) has been proposed. Paxos is a challenge to understand and to implement efficiently. Various variants exist. Raft is designed to be understandable. My advice: First try to understand Paxos for some time (do not implement it!), then enjoy the beauty of Raft, but do not implement it either! Use some battle-tested implementation you trust!
  • 14. Paxos and Raft Traditionally, one uses the Paxos Consensus Protocol (1998). More recently, Raft (2013) has been proposed. Paxos is a challenge to understand and to implement efficiently. Various variants exist. Raft is designed to be understandable. My advice: First try to understand Paxos for some time (do not implement it!), then enjoy the beauty of Raft, but do not implement it either! Use some battle-tested implementation you trust! But most importantly: DO NOT TRY TO INVENT YOUR OWN!
  • 15. Raft in a slide An odd number of servers each keep a persisted log of events.
  • 16. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody.
  • 17. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody. They democratically elect a leader with absolute majority.
  • 18. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody. They democratically elect a leader with absolute majority. Only the leader may append to the replicated log.
  • 19. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody. They democratically elect a leader with absolute majority. Only the leader may append to the replicated log. An append only counts when a majority has persisted and confirmed it.
  • 20. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody. They democratically elect a leader with absolute majority. Only the leader may append to the replicated log. An append only counts when a majority has persisted and confirmed it. Very smart logic to ensure a unique leader and automatic recovery from failure.
  • 21. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody. They democratically elect a leader with absolute majority. Only the leader may append to the replicated log. An append only counts when a majority has persisted and confirmed it. Very smart logic to ensure a unique leader and automatic recovery from failure. It is all a lot of fun to get right, but it is proven to work.
  • 22. Raft in a slide An odd number of servers each keep a persisted log of events. Everything is replicated to everybody. They democratically elect a leader with absolute majority. Only the leader may append to the replicated log. An append only counts when a majority has persisted and confirmed it. Very smart logic to ensure a unique leader and automatic recovery from failure. It is all a lot of fun to get right, but it is proven to work. One puts a key/value store on top, the log contains the changes.
  • 24. Sorting The Problem Data stores need indexes. In practice, we need to sort things.
  • 25. Sorting The Problem Data stores need indexes. In practice, we need to sort things. Most published algorithms are rubbish on modern hardware.
  • 26. Sorting The Problem Data stores need indexes. In practice, we need to sort things. Most published algorithms are rubbish on modern hardware. The problem is no longer the comparison computations but the data movement .
  • 27. Sorting The Problem Data stores need indexes. In practice, we need to sort things. Most published algorithms are rubbish on modern hardware. The problem is no longer the comparison computations but the data movement . Since the time when I was a kid and have played with an Apple IIe, compute power in one core has increased by ×20000 a single memory access by ×40 and now we have 32 cores in a CPU this means computation has outpaced memory access by ×1280!
  • 28. Idea for a parallel sorting algorithm: Merge Sort Min−Heap: sorted merged
  • 29. Idea for a parallel sorting algorithm: Merge Sort Min−Heap: sorted merged Nearly all comparisons hit the L2 cache!
  • 30. Log structured merge trees (LSM-trees) The Problem People rightfully expect from a data store, that it can hold more data than the available RAM, works well with SSDs and spinning rust, allows fast bulk inserts into large data sets, and provides fast reads in a hot set that fits into RAM.
  • 31. Log structured merge trees (LSM-trees) The Problem People rightfully expect from a data store, that it can hold more data than the available RAM, works well with SSDs and spinning rust, allows fast bulk inserts into large data sets, and provides fast reads in a hot set that fits into RAM. Traditional B-tree based structures often fail to deliver with the last 2.
  • 32. Log structured merge trees (LSM-trees) (Source: http://www.benstopford.com/2015/02/14/log-structured-merge-trees/, Author: Ben Stopford, License: Creative Commons)
  • 33. Log structured merge trees (LSM-trees) LSM-trees — summary writes first go into memtables, all files are sorted and immutable, compaction happens in the background, merge sort can be used, all writes use sequential I/O, Bloom filters or Cuckoo filters for fast reads, =⇒ good write throughput and reasonable read performance, used in ArangoDB, BigTable, Cassandra, HBase, InfluxDB, LevelDB, MongoDB, RocksDB, SQLite4 and WiredTiger, etc.
  • 34. Hybrid Logical Clocks (HLC) The Problem Clocks in different nodes of distributed systems are not in sync.
  • 35. Hybrid Logical Clocks (HLC) The Problem Clocks in different nodes of distributed systems are not in sync. general relativity poses fundamental obstructions to synchronizity, in practice, clock skew happens, Google can use atomic clocks, even with NTP (network time protocol) we have to live with ≈ 20ms.
  • 36. Hybrid Logical Clocks (HLC) The Problem Clocks in different nodes of distributed systems are not in sync. general relativity poses fundamental obstructions to synchronizity, in practice, clock skew happens, Google can use atomic clocks, even with NTP (network time protocol) we have to live with ≈ 20ms. Therefore, we cannot compare time stamps from different nodes!
  • 37. Hybrid Logical Clocks (HLC) The Problem Clocks in different nodes of distributed systems are not in sync. general relativity poses fundamental obstructions to synchronizity, in practice, clock skew happens, Google can use atomic clocks, even with NTP (network time protocol) we have to live with ≈ 20ms. Therefore, we cannot compare time stamps from different nodes! Why would this help? establish “happened after” relationship between events, e.g. for conflict resolution, log sorting, detecting network delays, time to live could be implemented easily.
  • 38. Hybrid Logical Clocks (HLC) The Idea Every computer has a local clock, and we use NTP to synchronize.
  • 39. Hybrid Logical Clocks (HLC) The Idea Every computer has a local clock, and we use NTP to synchronize. If two events on different machines are linked by causality, the cause should have a smaller time stamp than the effect.
  • 40. Hybrid Logical Clocks (HLC) The Idea Every computer has a local clock, and we use NTP to synchronize. If two events on different machines are linked by causality, the cause should have a smaller time stamp than the effect. causality ⇐⇒ a message is sent Send a time stamp with every message. The HLC always returns a value > max(local clock, largest time stamp ever seen).
  • 41. Hybrid Logical Clocks (HLC) The Idea Every computer has a local clock, and we use NTP to synchronize. If two events on different machines are linked by causality, the cause should have a smaller time stamp than the effect. causality ⇐⇒ a message is sent Send a time stamp with every message. The HLC always returns a value > max(local clock, largest time stamp ever seen). Causality is preserved, time can “catch up” with logical time eventually. http://muratbuffalo.blogspot.com.es/2014/07/hybrid-logical-clocks.html
  • 42. Distributed ACID Transactions Atomic either happens in its entirety or not at all Consistent reading sees a consistent state, writing preser- vers consistency Isolated concurrent transactions do not see each other Durable committed writes are preserved after shut- down and crashes
  • 43. Distributed ACID Transactions Atomic either happens in its entirety or not at all Consistent reading sees a consistent state, writing preser- vers consistency Isolated concurrent transactions do not see each other Durable committed writes are preserved after shut- down and crashes (All relatively doable when transactions happen one after another!)
  • 44. Distributed ACID Transactions The Problem In a distributed system: How to make sure, that all nodes agree on whether the transaction has happened? (Atomicity)
  • 45. Distributed ACID Transactions The Problem In a distributed system: How to make sure, that all nodes agree on whether the transaction has happened? (Atomicity) How to create a consistent snapshot across nodes? (Consistency)
  • 46. Distributed ACID Transactions The Problem In a distributed system: How to make sure, that all nodes agree on whether the transaction has happened? (Atomicity) How to create a consistent snapshot across nodes? (Consistency) How to hide ongoing activities until commit? (Isolation)
  • 47. Distributed ACID Transactions The Problem In a distributed system: How to make sure, that all nodes agree on whether the transaction has happened? (Atomicity) How to create a consistent snapshot across nodes? (Consistency) How to hide ongoing activities until commit? (Isolation) How to handle lost nodes? (Durability)
  • 48. Distributed ACID Transactions The Problem In a distributed system: How to make sure, that all nodes agree on whether the transaction has happened? (Atomicity) How to create a consistent snapshot across nodes? (Consistency) How to hide ongoing activities until commit? (Isolation) How to handle lost nodes? (Durability) We have to take replication, resilience and failover into account.
  • 49. Distributed ACID Transactions WITHOUT Distributed databases without ACID transactions: ArangoDB, BigTable, Couchbase, Datastax, Dynamo, Elastic, HBase, MongoDB, RethinkDB, Riak, and lots more . . . WITH Distributed databases with ACID transactions: (ArangoDB,) CockroachDB, FoundationDB, Spanner
  • 50. Distributed ACID Transactions WITHOUT Distributed databases without ACID transactions: ArangoDB, BigTable, Couchbase, Datastax, Dynamo, Elastic, HBase, MongoDB, RethinkDB, Riak, and lots more . . . WITH Distributed databases with ACID transactions: (ArangoDB,) CockroachDB, FoundationDB, Spanner =⇒ Very few distributed engines promise ACID, because this is hard!
  • 51. Distributed ACID Transactions Basic Idea Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of a data item are kept.
  • 52. Distributed ACID Transactions Basic Idea Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of a data item are kept. Do writes and replication decentrally and distributed, without them becoming visible from other transactions.
  • 53. Distributed ACID Transactions Basic Idea Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of a data item are kept. Do writes and replication decentrally and distributed, without them becoming visible from other transactions. Then have some place, where there is a switch, which decides when the transaction becomes visible.
  • 54. Distributed ACID Transactions Basic Idea Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of a data item are kept. Do writes and replication decentrally and distributed, without them becoming visible from other transactions. Then have some place, where there is a switch, which decides when the transaction becomes visible. These “switches” need to be persisted somewhere (durability), scale out (no bottleneck for commit/abort), be replicated (no single point of failure), be resilient in case of fail-over (fault-tolerance).
  • 55. Distributed ACID Transactions Basic Idea Use Multi Version Concurrency Control (MVCC), i.e. multiple revisions of a data item are kept. Do writes and replication decentrally and distributed, without them becoming visible from other transactions. Then have some place, where there is a switch, which decides when the transaction becomes visible. These “switches” need to be persisted somewhere (durability), scale out (no bottleneck for commit/abort), be replicated (no single point of failure), be resilient in case of fail-over (fault-tolerance). Transaction visibility needs to be implemented (MVCC), time stamps play a crucial role.
  • 56. Thanks Thanks and YES: We are hiring, too!