Building reliable systems with Apache BookKeeper

2,980 views
2,828 views

Published on

A presentation at the Barcelona JUG 19/06/2014

Published in: Technology, Education

Building reliable systems with Apache BookKeeper

  1. 1. Building reliable systems with Apache BookKeeper Matthieu Morel Ivan Kelly
  2. 2. Challenges in distributed systems
  3. 3. Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow Expect failures
  4. 4. up to 10% annual failure rates for disks/servers
  5. 5. The network is reliable NOT Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz
  6. 6. Symptoms Alex Proimos cc-by-2.0 https://flic.kr/p/bt29wL
  7. 7. Problem 1: not available
  8. 8. Problem 1: not available
  9. 9. Problem 2: inconsistencies
  10. 10. CAP consistency partition toleranceavailability zookeeper / bookkeeper cassandra
  11. 11. More issues... cc-by-2.0 https://flic.kr/p/8j57SG
  12. 12. Problem 3: split brain writer A writer A writer A writer A’ 2 writers ! writer A’ single writer for A
  13. 13. Problem 4: failure detection A B C
  14. 14. Problem 5: recovery Recovery protocol? For many systems, we need consistent data for recovery
  15. 15. Solutions ! Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF
  16. 16. guarantees protocols tools / building blocks techniques
  17. 17. Primary backup, active replication active standby active active
  18. 18. Replication f+1 replicas for f concurrent failures
  19. 19. Quorums ensemble quorum 1 quorum 2 quorum 3
  20. 20. Useful building block: ZooKeeper ● Centralized coordination service o configuration, service discovery o locking, queues, barriers o failure detection o leader election, membership ● Reliable ● Source of truth
  21. 21. Failure detection with ZooKeeper ZooKeeper Ephemeral znodes Heartbeats Timeouts Triggers cluster update
  22. 22. Recovery: protocol 1. provision / acquire new node 2. fetch under- replicated data 3. rebuild state 4. join ensemble
  23. 23. Requirement: durability sync mutations to storage append-only journal Journaling: persisting mutations Imaging: persisting the state when it changes
  24. 24. Concretely ● Write-Ahead Logging ● Databases - data stores ● Durable Messaging
  25. 25. Enter Apache BookKeeper Reliable distributed logging CC-BY-2.0 https://flic.kr/p/dSHr87
  26. 26. BookKeeper : durability service durability replication consistency on commodity hardware recovery user library A building block for reliable systems
  27. 27. The ledger abstraction op op op op op op op op op op opop opop opop op add read checkpoint Ledger 1 Ledger 2 Ledger 3
  28. 28. Guarantees If an entry has been acknowledged, it must be readable If an entry is read once, it must always be readable
  29. 29. History Initial use case : Hadoop name node recovery 2008: open sourced contrib of ZooKeeper 2011: sub-project of ZooKeeper 2012: Production
  30. 30. Community Committers from: ● Yahoo! ● Twitter ● Microsoft ● Huawei ● Facebook
  31. 31. Inside of Apache BookKeeper CC-BY-2.0 https://flic.kr/p/agpLTR
  32. 32. ledgerçl edger Architecture bookie zookeeper bookie bookie client library client system ledger ledger entry index metadata store
  33. 33. Write path ledgerçl edger bookie bookie bookie client library ledger ledger entry replication + striping
  34. 34. Reliable writes ● store digest along with entry ● fsync each entry before returning ● ACK when: ○ all previous entries ○ this entry accepted by quorum
  35. 35. Read path bookie bookie bookie read(ledger X entry Y)
  36. 36. Partial writes bookie bookie bookie read(ledger X entry Y) no quorum for this entry! Read would be inconsistent
  37. 37. Last Add Confirmed Consensus on written entries bookie bookie bookie read(ledger X entry Y) Zookeeper close(ledger X) Ledger X, LAC
  38. 38. Recovery of a ledger bookie bookie bookie zookeeper What is the last entry? piggy-back last add confirmed
  39. 39. Fencing: prevent multiple brains writer writer writer writer
  40. 40. roll Inside of a bookie L2 - E7 L2 - E6 L2 - E3 L2 - E4 L1 - E4 L1 - E2 L2 - E1 L1 - E1 L3 - E1 ... sequential entries interleaved physical file
  41. 41. Storage device disk fsync Sequential entries Synchronous writes OK for writing Reads interfere with writes add ack L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1
  42. 42. Separate read and write devices disk 2 fsync ackadd L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1 disk 1 L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1 async flush cache Similar rates Durability Read-efficient INDEX Ledger device Journal device
  43. 43. Garbage collection / compaction disk 2 L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1 disk 1 L2 - E3 L3 - E7 L2 - E1 L1 - E4 L1 - E2 L1 - E1 L1 - E4 L1 - E2 L2 - E1L1 - E1 L1 - E4 L1 - E2 L2 - E1 L1 - E1 Ledger 1 deleted L2 - E1 Entry log Journal
  44. 44. Using Apache BookKeeper as a building block Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT
  45. 45. Guarantees If an entry has been acknowledged, it must be readable If an entry is read once, it must always be readable
  46. 46. API BookKeeper createLedger openLedger deleteLedger LedgerHandle addEntry readEntry close asyncCreateLedger asyncOpenLedger asyncDeleteLedger asyncAddEntry asyncReadEntry asyncClose Asynchronous with callbacks
  47. 47. Tech stack ● Java ● Netty ● ZooKeeper
  48. 48. Performance considerations I/O bound - disk IOPS: ~ 120/s HDD, 500 000/s SSD - network: 1Gb/s ~ 100MB/s max or less in practice ~ 1KB msgs: 100 000/s per node
  49. 49. Public use cases ● Hadoop namenode (Huawei) ● WAL (HubSpot) ● Hedwig (open source) ● PNUTS cross-colo replication (Yahoo) ● Push notifications (Yahoo) ● Cloud messaging (Yahoo)
  50. 50. Primary backup BookKeeper active standby write tail apply ops build backup state Reads from open ledger Asks for current Last Add Confirmed from bookies
  51. 51. Data store WAL Bookkeeper library bookies datastore
  52. 52. Bookkeeper library bookies Data structure Durability for arbitrary (distributed) data structures!
  53. 53. Elasticity Bookkeeper library bookies
  54. 54. Elasticity Bookkeeper library bookies
  55. 55. Shared log infrastructure Bookkeeper library bookies Application A Application B System C
  56. 56. http://zookeeper.apache.org/bookkeeper/ https://github.com/apache/bookkeeper Matthieu Morel: mmorel Ivan Kelly: ivank @apache.org

×