Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Power of the Log: LSM & Append Only Data Structures

2,422 views

Published on

This talk is about the beauty of sequential access and append-only data structures. We'll do this in the context of a little-known paper entitled “Log Structured Merge Trees”. LSM describes a surprisingly counterintuitive approach to storing and accessing data in a sequential fashion. It came to prominence in Google's Big Table paper and today, the use of Logs, LSM and append-only data structures drive many of the world's most influential storage systems: Cassandra, HBase, RocksDB, Kafka and more. Finally, we'll look at how the beauty of sequential access goes beyond database internals, right through to how applications communicate, share data and scale.

Published in: Technology

Power of the Log: LSM & Append Only Data Structures

  1. 1. The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc
  2. 2. @benstopford
  3. 3. The Log ConnectorsConnectors Producer Consumer Streaming Engine Kafka: a Streaming Platform
  4. 4. KAFKA’s Distributed Log Linear ScansAppend Only
  5. 5. Messaging is a Log-Shaped Problem Linear ScansAppend Only
  6. 6. Not all problems are Log-Shaped
  7. 7. Many problems benefit from being addressed in a “log-shaped” way
  8. 8. Supporting Lookups
  9. 9. Lookups in a log HeadTail
  10. 10. Trees provide Selectivity bob dave fred hary mike steve vince Index
  11. 11. But the overarching structure implies Dispersed Writes bob dave fred hary mike steve vince Random IO
  12. 12. Log Structured Merge Trees 1996
  13. 13. Used in a range of modern databases •  BigTable •  HBase •  LevelDB •  SQLite4 •  RocksDB •  MongoDB •  WiredTiger •  Cassandra •  MySQL •  InfluxDB ...
  14. 14. If a systems have a natural grain, it is one formed of sequential operations which favour locality
  15. 15. Caching & Prefetching L3 cache L2 cache L1 cache Pre-fetch is your friend CPU Caches Page Cache Application-level caching Disk Controller
  16. 16. Write efficiency comes from amortising writes into sequential operations
  17. 17. Taken from ACMQueue: The Pathologies of Big Data
  18. 18. So if we go against the grain of the system, RAM can actually be slower than disk
  19. 19. Going against the grain means dispersed operations that break locality Poor Locality Good Locality
  20. 20. The beauty of the log lies in its sequentially Linear ScansAppend Only
  21. 21. LSM is about re-imagining search as as a “log-shaped” problem
  22. 22. Arrange writes to be Append Only Append Only Journal (Sequential IO) Update in Place Ordered File (Random IO) Bob = Carpenter Bob = Carpenter Bob = Cabinet Maker Bob = Cabinet Maker
  23. 23. Avoid dispersed writes
  24. 24. Simple LSM
  25. 25. Writes are collected in memory Writes sort write to disk older files small index file RAM
  26. 26. When enough have buffered, sort. Writes write to disk older files small index file Batched sorted RAM
  27. 27. Write the sorted file to disk Writes write to disk older files Small, sorted immutable file Batched sorted
  28. 28. Repeat... Writes write to disk Older files New files Batched sorted
  29. 29. Batching -> Fast Sequential IO Writes write to disk Older files New files Batched Sorted memtable
  30. 30. That’s the core write path
  31. 31. What about reads?
  32. 32. Search reverse-chronologically older files newer files (1) Is “bob” here? (2) Is “bob” here? (3) Is “bob” here? (4) Is “bob” here?
  33. 33. Worst Case We consult every file
  34. 34. We might have a lot of files!
  35. 35. LSM naturally optimises for writes, over reads This is a reasonable tradeoff to make
  36. 36. Optimizing reads is easier than optimising writes
  37. 37. Optimisation 1 Bound the number of files
  38. 38. Create levels Level-0 Level-1
  39. 39. Separate thread merges old files, de- duplicating them. Level-0 Level-1
  40. 40. Separate thread merges old files, de- duplicating them. Level-0 Level-1
  41. 41. Merging process is reminiscent of merge sort
  42. 42. Take this further with levels Level-0 Level-1 Level-2 Level-3 Memtable
  43. 43. But single reads still require many individual lookups: •  Number of searches: –  1 per base level –  1 per level above
  44. 44. Optimisation 2 Caching & Friends
  45. 45. Add Memory i.e. More Caching / Pre-fetch
  46. 46. Read Ahead & Prefetch L3 cache L2 cache L1 cache Pre-fetch is your friend Page Cache Disk Controller
  47. 47. If only there was a more efficient way to avoid searching each file!
  48. 48. Elven Magic?
  49. 49. Bloom Filters Answers the question: Do I need to look in this file to find the value for this key? Size -> probability of false positive Key Hash Function Bit Set
  50. 50. Bloom Filters •  Space efficient, probabilistic data structure •  As keyspace grows: –  p(collision) increases –  Index size is fixed
  51. 51. Many more degrees of freedom for optimising reads RAM Disk file metadata & bloom filter
  52. 52. Log Structured Merge Trees •  A collection of small, immutable indexes •  All sequential operations, de-duplicate by merging files •  Index/Bloom in RAM to increase read performance
  53. 53. Subtleties •  Writes are 1 x IO (blind writes) , rather than 2 x IO’s (read + modify) •  Batching writes decreases write amplification. In trees leaf pages must be updated.
  54. 54. Immutability => Simpler locking semantics Only memtable is mutable
  55. 55. Does it work? Lots of real world examples
  56. 56. Measureable in the real world •  Innodb vs MyRocks results, taken from Mark Callaghan’s blog: http://bit.ly/2mhWT7p •  There are many subtleties. Take all benchmarks with a pinch of salt.
  57. 57. Elements of Beauty •  Reframing the problem to be Log-Centric. To go with the grain of the system. •  Optimise for the harder problem •  Compartmentalises writes (coordination) to a single point. Reads -> immutable structures.
  58. 58. Applies in many other areas •  Sequentiality –  Databases: write ahead logs –  Columnar databases: Merge Joins –  Kafka •  Immutability –  Snapshot isolation over explicit locking. –  Replication (state machines replication)
  59. 59. Log-Centric Approaches Work in Applications too
  60. 60. Event Sourcing •  Journaling of state changes •  No “update in place” Object Journal + 10.36 - 12.12 + 23.70 + 13.33
  61. 61. CQRS Client Command Query Write Optimised Read Optimised log
  62. 62. How Applications or Services share state
  63. 63. Log-Centric Services Writer Read-Replica Read-Replica Read-Replica Writes are localised to a single service
  64. 64. Log-Centric Services Writer Read-Replica Read-Replica Read-ReplicaImmutable log
  65. 65. Log-Centric Services Writer Read-Replica Read-Replica Read-Replica Many, independent read replicas
  66. 66. Elements of Beauty •  Reframing the problem to be Log-Centric. To go with the grain of the system. •  Optimise for the harder problem •  Compartmentalises writes (coordination) to a single point. Reads -> immutable structures.
  67. 67. Decentralised Design In both database design as well as in application development
  68. 68. The Log is the central building block Pushes us towards the natural grain of the system
  69. 69. The Log A single unifying abstraction
  70. 70. References LSM: •  benstopford.com/2015/02/14/log-structured-merge-trees/ •  smalldatum.blogspot.co.uk/2017/02/using-modern-sysbench-to-compare.html •  www.quora.com/How-does-the-Log-Structured-Merge-Tree-work •  bLSM paper: http://bit.ly/2mT7Vje Other •  Pat Helland (Immutability) cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf •  Peter Ballis (Coordination Avoidance): http://bit.ly/2m7XxnI •  Jay Kreps: I Heart Logs (O’Reilly 2014) •  The Data Dichotomy: http://bit.ly/2hk9c2K
  71. 71. Thank you @benstopford http://benstopford.com ben@confluent.io

×