Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Power of the Log
LSM & Append Only Data Structures
Ben Stopford
Confluent Inc
@benstopford
The Log ConnectorsConnectors
Producer Consumer
Streaming Engine
Kafka: a Streaming Platform
KAFKA’s Distributed Log
Linear ScansAppend Only
Messaging is a Log-Shaped Problem
Linear ScansAppend Only
Not all problems are Log-Shaped
Many problems benefit from being
addressed in a “log-shaped” way
Supporting Lookups
Lookups in a log
HeadTail
Trees provide Selectivity
bob dave fred hary mike steve vince
Index
But the overarching structure implies Dispersed Writes
bob dave fred hary mike steve vince
Random IO
Log Structured Merge Trees
1996
Used in a range of modern databases
•  BigTable
•  HBase
•  LevelDB
•  SQLite4
•  RocksDB
•  MongoDB
•  WiredTiger
•  Cass...
If a systems have a natural grain, it
is one formed of sequential
operations which favour locality
Caching & Prefetching
L3 cache
L2 cache
L1 cache
Pre-fetch is your
friend
CPU Caches
Page Cache
Application-level
caching
...
Write efficiency comes from
amortising writes into sequential
operations
Taken from ACMQueue: The Pathologies of Big Data
So if we go against the grain of
the system, RAM can actually be
slower than disk
Going against the grain means dispersed
operations that break locality
Poor Locality Good Locality
The beauty of the log lies in its
sequentially
Linear ScansAppend Only
LSM is about re-imagining search
as as a “log-shaped” problem
Arrange writes to be Append Only
Append Only
Journal
(Sequential IO)
Update in Place
Ordered File
(Random IO)
Bob = Carpen...
Avoid dispersed writes
Simple LSM
Writes are collected in memory
Writes
sort
write to disk
older
files
small
index file
RAM
When enough have buffered, sort.
Writes
write to disk
older
files
small
index file
Batched
sorted
RAM
Write the sorted file to disk
Writes
write to disk
older
files
Small, sorted
immutable file
Batched
sorted
Repeat...
Writes
write to disk
Older files New files
Batched
sorted
Batching -> Fast Sequential IO
Writes
write to disk
Older files New files
Batched
Sorted
memtable
That’s the core write path
What about reads?
Search reverse-chronologically
older
files
newer
files
(1) Is “bob” here?
(2) Is “bob” here?
(3) Is “bob” here?
(4) Is “bo...
Worst Case
We consult every file
We might have a lot of files!
LSM naturally optimises for writes,
over reads
This is a reasonable tradeoff to make
Optimizing reads is easier than
optimising writes
Optimisation 1
Bound the number of files
Create levels
Level-0
Level-1
Separate thread merges old files, de-
duplicating them.
Level-0
Level-1
Separate thread merges old files, de-
duplicating them.
Level-0
Level-1
Merging process is reminiscent of
merge sort
Take this further with levels
Level-0
Level-1
Level-2
Level-3
Memtable
But single reads still require many
individual lookups:
•  Number of searches:
–  1 per base level
–  1 per level above
Optimisation 2
Caching & Friends
Add Memory
i.e. More Caching / Pre-fetch
Read Ahead & Prefetch
L3 cache
L2 cache
L1 cache
Pre-fetch is your
friend
Page Cache
Disk Controller
If only there was a more efficient
way to avoid searching each file!
Elven Magic?
Bloom Filters
Answers the question:
Do I need to look in this file to
find the value for this key?
Size -> probability of ...
Bloom Filters
•  Space efficient, probabilistic
data structure
•  As keyspace grows:
–  p(collision) increases
–  Index si...
Many more degrees of freedom for
optimising reads
RAM
Disk
file metadata
& bloom filter
Log Structured Merge Trees
•  A collection of small, immutable indexes
•  All sequential operations, de-duplicate by mergi...
Subtleties
•  Writes are 1 x IO (blind writes) , rather than 2 x IO’s
(read + modify)
•  Batching writes decreases write a...
Immutability => Simpler locking semantics
Only
memtable
is mutable
Does it work?
Lots of real world examples
Measureable in the real world
•  Innodb vs MyRocks results, taken from Mark Callaghan’s blog: http://bit.ly/2mhWT7p
•  The...
Elements of Beauty
•  Reframing the problem to be Log-Centric. To go with
the grain of the system.
•  Optimise for the har...
Applies in many other areas
•  Sequentiality
–  Databases: write ahead logs
–  Columnar databases: Merge Joins
–  Kafka
• ...
Log-Centric Approaches Work in
Applications too
Event Sourcing
•  Journaling of
state changes
•  No “update in
place”
Object
Journal
+ 10.36
- 12.12
+ 23.70
+ 13.33
CQRS
Client
Command Query
Write
Optimised
Read
Optimised
log
How Applications or Services
share state
Log-Centric Services
Writer
Read-Replica
Read-Replica
Read-Replica
Writes are localised
to a single service
Log-Centric Services
Writer
Read-Replica
Read-Replica
Read-ReplicaImmutable log
Log-Centric Services
Writer
Read-Replica
Read-Replica
Read-Replica
Many, independent
read replicas
Elements of Beauty
•  Reframing the problem to be Log-Centric. To go with
the grain of the system.
•  Optimise for the har...
Decentralised Design
In both database design as well as in
application development
The Log is the central building block
Pushes us towards the natural grain of
the system
The Log
A single unifying abstraction
References
LSM:
•  benstopford.com/2015/02/14/log-structured-merge-trees/
•  smalldatum.blogspot.co.uk/2017/02/using-moder...
Thank you
@benstopford
http://benstopford.com
ben@confluent.io
Upcoming SlideShare
Loading in …5
×

The Power of the Log

3,889 views

Published on

See the full talk here: https://www.infoq.com/presentations/lsm-append-data-structures

This talk is about the beauty of sequential access and append only data structures. We'll do this in the context of a little known paper entitled “Log Structured Merge Trees”. LSM describes a surprisingly counterintuitive approach to storing and accessing data in a sequential fashion. It came to prominence in Google's Big Table paper and today, the use of Logs, LSM and append only data structures drive many of the world's most influential storage systems: Cassandra, HBase, RocksDB, Kafka and more. Finally we'll look at how the beauty of sequential access goes beyond database internals, right through to how applications communicate, share data and scale.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

The Power of the Log

  1. 1. The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc
  2. 2. @benstopford
  3. 3. The Log ConnectorsConnectors Producer Consumer Streaming Engine Kafka: a Streaming Platform
  4. 4. KAFKA’s Distributed Log Linear ScansAppend Only
  5. 5. Messaging is a Log-Shaped Problem Linear ScansAppend Only
  6. 6. Not all problems are Log-Shaped
  7. 7. Many problems benefit from being addressed in a “log-shaped” way
  8. 8. Supporting Lookups
  9. 9. Lookups in a log HeadTail
  10. 10. Trees provide Selectivity bob dave fred hary mike steve vince Index
  11. 11. But the overarching structure implies Dispersed Writes bob dave fred hary mike steve vince Random IO
  12. 12. Log Structured Merge Trees 1996
  13. 13. Used in a range of modern databases •  BigTable •  HBase •  LevelDB •  SQLite4 •  RocksDB •  MongoDB •  WiredTiger •  Cassandra •  MySQL •  InfluxDB ...
  14. 14. If a systems have a natural grain, it is one formed of sequential operations which favour locality
  15. 15. Caching & Prefetching L3 cache L2 cache L1 cache Pre-fetch is your friend CPU Caches Page Cache Application-level caching Disk Controller
  16. 16. Write efficiency comes from amortising writes into sequential operations
  17. 17. Taken from ACMQueue: The Pathologies of Big Data
  18. 18. So if we go against the grain of the system, RAM can actually be slower than disk
  19. 19. Going against the grain means dispersed operations that break locality Poor Locality Good Locality
  20. 20. The beauty of the log lies in its sequentially Linear ScansAppend Only
  21. 21. LSM is about re-imagining search as as a “log-shaped” problem
  22. 22. Arrange writes to be Append Only Append Only Journal (Sequential IO) Update in Place Ordered File (Random IO) Bob = Carpenter Bob = Carpenter Bob = Cabinet Maker Bob = Cabinet Maker
  23. 23. Avoid dispersed writes
  24. 24. Simple LSM
  25. 25. Writes are collected in memory Writes sort write to disk older files small index file RAM
  26. 26. When enough have buffered, sort. Writes write to disk older files small index file Batched sorted RAM
  27. 27. Write the sorted file to disk Writes write to disk older files Small, sorted immutable file Batched sorted
  28. 28. Repeat... Writes write to disk Older files New files Batched sorted
  29. 29. Batching -> Fast Sequential IO Writes write to disk Older files New files Batched Sorted memtable
  30. 30. That’s the core write path
  31. 31. What about reads?
  32. 32. Search reverse-chronologically older files newer files (1) Is “bob” here? (2) Is “bob” here? (3) Is “bob” here? (4) Is “bob” here?
  33. 33. Worst Case We consult every file
  34. 34. We might have a lot of files!
  35. 35. LSM naturally optimises for writes, over reads This is a reasonable tradeoff to make
  36. 36. Optimizing reads is easier than optimising writes
  37. 37. Optimisation 1 Bound the number of files
  38. 38. Create levels Level-0 Level-1
  39. 39. Separate thread merges old files, de- duplicating them. Level-0 Level-1
  40. 40. Separate thread merges old files, de- duplicating them. Level-0 Level-1
  41. 41. Merging process is reminiscent of merge sort
  42. 42. Take this further with levels Level-0 Level-1 Level-2 Level-3 Memtable
  43. 43. But single reads still require many individual lookups: •  Number of searches: –  1 per base level –  1 per level above
  44. 44. Optimisation 2 Caching & Friends
  45. 45. Add Memory i.e. More Caching / Pre-fetch
  46. 46. Read Ahead & Prefetch L3 cache L2 cache L1 cache Pre-fetch is your friend Page Cache Disk Controller
  47. 47. If only there was a more efficient way to avoid searching each file!
  48. 48. Elven Magic?
  49. 49. Bloom Filters Answers the question: Do I need to look in this file to find the value for this key? Size -> probability of false positive Key Hash Function Bit Set
  50. 50. Bloom Filters •  Space efficient, probabilistic data structure •  As keyspace grows: –  p(collision) increases –  Index size is fixed
  51. 51. Many more degrees of freedom for optimising reads RAM Disk file metadata & bloom filter
  52. 52. Log Structured Merge Trees •  A collection of small, immutable indexes •  All sequential operations, de-duplicate by merging files •  Index/Bloom in RAM to increase read performance
  53. 53. Subtleties •  Writes are 1 x IO (blind writes) , rather than 2 x IO’s (read + modify) •  Batching writes decreases write amplification. In trees leaf pages must be updated.
  54. 54. Immutability => Simpler locking semantics Only memtable is mutable
  55. 55. Does it work? Lots of real world examples
  56. 56. Measureable in the real world •  Innodb vs MyRocks results, taken from Mark Callaghan’s blog: http://bit.ly/2mhWT7p •  There are many subtleties. Take all benchmarks with a pinch of salt.
  57. 57. Elements of Beauty •  Reframing the problem to be Log-Centric. To go with the grain of the system. •  Optimise for the harder problem •  Compartmentalises writes (coordination) to a single point. Reads -> immutable structures.
  58. 58. Applies in many other areas •  Sequentiality –  Databases: write ahead logs –  Columnar databases: Merge Joins –  Kafka •  Immutability –  Snapshot isolation over explicit locking. –  Replication (state machines replication)
  59. 59. Log-Centric Approaches Work in Applications too
  60. 60. Event Sourcing •  Journaling of state changes •  No “update in place” Object Journal + 10.36 - 12.12 + 23.70 + 13.33
  61. 61. CQRS Client Command Query Write Optimised Read Optimised log
  62. 62. How Applications or Services share state
  63. 63. Log-Centric Services Writer Read-Replica Read-Replica Read-Replica Writes are localised to a single service
  64. 64. Log-Centric Services Writer Read-Replica Read-Replica Read-ReplicaImmutable log
  65. 65. Log-Centric Services Writer Read-Replica Read-Replica Read-Replica Many, independent read replicas
  66. 66. Elements of Beauty •  Reframing the problem to be Log-Centric. To go with the grain of the system. •  Optimise for the harder problem •  Compartmentalises writes (coordination) to a single point. Reads -> immutable structures.
  67. 67. Decentralised Design In both database design as well as in application development
  68. 68. The Log is the central building block Pushes us towards the natural grain of the system
  69. 69. The Log A single unifying abstraction
  70. 70. References LSM: •  benstopford.com/2015/02/14/log-structured-merge-trees/ •  smalldatum.blogspot.co.uk/2017/02/using-modern-sysbench-to-compare.html •  www.quora.com/How-does-the-Log-Structured-Merge-Tree-work •  bLSM paper: http://bit.ly/2mT7Vje Other •  Pat Helland (Immutability) cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf •  Peter Ballis (Coordination Avoidance): http://bit.ly/2m7XxnI •  Jay Kreps: I Heart Logs (O’Reilly 2014) •  The Data Dichotomy: http://bit.ly/2hk9c2K
  71. 71. Thank you @benstopford http://benstopford.com ben@confluent.io

×