Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building reliable systems with
Apache BookKeeper
Matthieu Morel
Ivan Kelly
Challenges in distributed systems
Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow
Expect failures
up to 10% annual failure rates for disks/servers
The network is reliable
NOT
Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz
Symptoms
Alex Proimos cc-by-2.0 https://flic.kr/p/bt29wL
Problem 1: not available
Problem 1: not available
Problem 2: inconsistencies
CAP
consistency
partition toleranceavailability
zookeeper / bookkeeper
cassandra
More issues...
cc-by-2.0 https://flic.kr/p/8j57SG
Problem 3: split brain
writer A writer A writer A
writer A’
2 writers !
writer A’
single writer for A
Problem 4: failure detection
A
B
C
Problem 5: recovery
Recovery protocol?
For many systems,
we need consistent data
for recovery
Solutions !
Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF
guarantees
protocols
tools /
building blocks
techniques
Primary backup, active replication
active standby
active
active
Replication
f+1 replicas
for f concurrent
failures
Quorums
ensemble
quorum 1
quorum 2 quorum 3
Useful building block: ZooKeeper
● Centralized coordination service
o configuration, service discovery
o locking, queues, ...
Failure detection with ZooKeeper
ZooKeeper
Ephemeral znodes
Heartbeats
Timeouts
Triggers cluster update
Recovery: protocol
1. provision
/ acquire
new node
2. fetch
under-
replicated
data
3. rebuild
state
4. join
ensemble
Requirement: durability
sync mutations
to storage
append-only
journal
Journaling:
persisting
mutations
Imaging:
persisting...
Concretely
● Write-Ahead Logging
● Databases - data stores
● Durable Messaging
Enter Apache BookKeeper
Reliable distributed logging
CC-BY-2.0 https://flic.kr/p/dSHr87
BookKeeper : durability service
durability
replication consistency
on commodity hardware
recovery
user library
A building ...
The ledger abstraction
op op op op op op op op op op opop opop opop op
add
read
checkpoint
Ledger 1
Ledger 2
Ledger 3
Guarantees
If an entry
has been acknowledged,
it must be readable
If an entry
is read once,
it must always be readable
History
Initial use case : Hadoop name node recovery
2008: open sourced contrib of ZooKeeper
2011: sub-project of ZooKeepe...
Community
Committers from:
● Yahoo!
● Twitter
● Microsoft
● Huawei
● Facebook
Inside of Apache BookKeeper
CC-BY-2.0 https://flic.kr/p/agpLTR
ledgerçl
edger
Architecture
bookie
zookeeper
bookie
bookie
client library
client system
ledger
ledger entry
index
metadata...
Write path
ledgerçl
edger
bookie
bookie
bookie
client library
ledger
ledger entry
replication
+
striping
Reliable writes
● store digest along with entry
● fsync each entry before
returning
● ACK when:
○ all previous
entries
○ t...
Read path
bookie
bookie
bookie
read(ledger X entry Y)
Partial writes
bookie
bookie
bookie
read(ledger X entry Y)
no quorum
for this
entry!
Read would be inconsistent
Last Add Confirmed
Consensus on written entries
bookie
bookie
bookie
read(ledger X entry Y)
Zookeeper
close(ledger X)
Ledg...
Recovery of a ledger
bookie
bookie
bookie
zookeeper
What is the last
entry?
piggy-back
last add confirmed
Fencing: prevent multiple brains
writer writer
writer writer
roll
Inside of a bookie
L2 - E7
L2 - E6
L2 - E3
L2 - E4
L1 - E4
L1 - E2
L2 - E1
L1 - E1
L3 - E1
...
sequential entries
int...
Storage device
disk
fsync Sequential entries
Synchronous writes
OK for writing
Reads interfere with
writes
add ack
L2 - E3...
Separate read and write devices
disk 2
fsync
ackadd
L2 - E3
L3 - E7
L1 - E4
L1 - E2
L2 - E1
L1 - E1
disk 1
L2 - E3
L3 - E7...
Garbage collection / compaction
disk 2
L2 - E3
L3 - E7
L1 - E4
L1 - E2
L2 - E1
L1 - E1
disk 1
L2 - E3
L3 - E7
L2 - E1
L1 -...
Using Apache BookKeeper
as a building block
Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT
Guarantees
If an entry
has been acknowledged,
it must be readable
If an entry
is read once,
it must always be readable
API
BookKeeper
createLedger
openLedger
deleteLedger
LedgerHandle
addEntry
readEntry
close
asyncCreateLedger
asyncOpenLedge...
Tech stack
● Java
● Netty
● ZooKeeper
Performance considerations
I/O bound
- disk IOPS: ~ 120/s HDD, 500 000/s SSD
- network: 1Gb/s ~ 100MB/s max or less in pra...
Public use cases
● Hadoop namenode (Huawei)
● WAL (HubSpot)
● Hedwig (open source)
● PNUTS cross-colo replication (Yahoo)
...
Primary backup
BookKeeper
active standby
write tail
apply ops
build backup state
Reads from open ledger
Asks for current
L...
Data store WAL
Bookkeeper library
bookies
datastore
Bookkeeper library
bookies
Data structure
Durability for arbitrary (distributed)
data structures!
Elasticity
Bookkeeper library
bookies
Elasticity
Bookkeeper library
bookies
Shared log infrastructure
Bookkeeper library
bookies
Application A Application B
System C
http://zookeeper.apache.org/bookkeeper/
https://github.com/apache/bookkeeper
Matthieu Morel: mmorel
Ivan Kelly: ivank
@apa...
Building reliable systems with Apache BookKeeper
Building reliable systems with Apache BookKeeper
Upcoming SlideShare
Loading in …5
×

Building reliable systems with Apache BookKeeper

5,441 views

Published on

A presentation at the Barcelona JUG 19/06/2014

Published in: Technology, Education

Building reliable systems with Apache BookKeeper

  1. 1. Building reliable systems with Apache BookKeeper Matthieu Morel Ivan Kelly
  2. 2. Challenges in distributed systems
  3. 3. Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow Expect failures
  4. 4. up to 10% annual failure rates for disks/servers
  5. 5. The network is reliable NOT Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz
  6. 6. Symptoms Alex Proimos cc-by-2.0 https://flic.kr/p/bt29wL
  7. 7. Problem 1: not available
  8. 8. Problem 1: not available
  9. 9. Problem 2: inconsistencies
  10. 10. CAP consistency partition toleranceavailability zookeeper / bookkeeper cassandra
  11. 11. More issues... cc-by-2.0 https://flic.kr/p/8j57SG
  12. 12. Problem 3: split brain writer A writer A writer A writer A’ 2 writers ! writer A’ single writer for A
  13. 13. Problem 4: failure detection A B C
  14. 14. Problem 5: recovery Recovery protocol? For many systems, we need consistent data for recovery
  15. 15. Solutions ! Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF
  16. 16. guarantees protocols tools / building blocks techniques
  17. 17. Primary backup, active replication active standby active active
  18. 18. Replication f+1 replicas for f concurrent failures
  19. 19. Quorums ensemble quorum 1 quorum 2 quorum 3
  20. 20. Useful building block: ZooKeeper ● Centralized coordination service o configuration, service discovery o locking, queues, barriers o failure detection o leader election, membership ● Reliable ● Source of truth
  21. 21. Failure detection with ZooKeeper ZooKeeper Ephemeral znodes Heartbeats Timeouts Triggers cluster update
  22. 22. Recovery: protocol 1. provision / acquire new node 2. fetch under- replicated data 3. rebuild state 4. join ensemble
  23. 23. Requirement: durability sync mutations to storage append-only journal Journaling: persisting mutations Imaging: persisting the state when it changes
  24. 24. Concretely ● Write-Ahead Logging ● Databases - data stores ● Durable Messaging
  25. 25. Enter Apache BookKeeper Reliable distributed logging CC-BY-2.0 https://flic.kr/p/dSHr87
  26. 26. BookKeeper : durability service durability replication consistency on commodity hardware recovery user library A building block for reliable systems
  27. 27. The ledger abstraction op op op op op op op op op op opop opop opop op add read checkpoint Ledger 1 Ledger 2 Ledger 3
  28. 28. Guarantees If an entry has been acknowledged, it must be readable If an entry is read once, it must always be readable
  29. 29. History Initial use case : Hadoop name node recovery 2008: open sourced contrib of ZooKeeper 2011: sub-project of ZooKeeper 2012: Production
  30. 30. Community Committers from: ● Yahoo! ● Twitter ● Microsoft ● Huawei ● Facebook
  31. 31. Inside of Apache BookKeeper CC-BY-2.0 https://flic.kr/p/agpLTR
  32. 32. ledgerçl edger Architecture bookie zookeeper bookie bookie client library client system ledger ledger entry index metadata store
  33. 33. Write path ledgerçl edger bookie bookie bookie client library ledger ledger entry replication + striping
  34. 34. Reliable writes ● store digest along with entry ● fsync each entry before returning ● ACK when: ○ all previous entries ○ this entry accepted by quorum
  35. 35. Read path bookie bookie bookie read(ledger X entry Y)
  36. 36. Partial writes bookie bookie bookie read(ledger X entry Y) no quorum for this entry! Read would be inconsistent
  37. 37. Last Add Confirmed Consensus on written entries bookie bookie bookie read(ledger X entry Y) Zookeeper close(ledger X) Ledger X, LAC
  38. 38. Recovery of a ledger bookie bookie bookie zookeeper What is the last entry? piggy-back last add confirmed
  39. 39. Fencing: prevent multiple brains writer writer writer writer
  40. 40. roll Inside of a bookie L2 - E7 L2 - E6 L2 - E3 L2 - E4 L1 - E4 L1 - E2 L2 - E1 L1 - E1 L3 - E1 ... sequential entries interleaved physical file
  41. 41. Storage device disk fsync Sequential entries Synchronous writes OK for writing Reads interfere with writes add ack L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1
  42. 42. Separate read and write devices disk 2 fsync ackadd L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1 disk 1 L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1 async flush cache Similar rates Durability Read-efficient INDEX Ledger device Journal device
  43. 43. Garbage collection / compaction disk 2 L2 - E3 L3 - E7 L1 - E4 L1 - E2 L2 - E1 L1 - E1 disk 1 L2 - E3 L3 - E7 L2 - E1 L1 - E4 L1 - E2 L1 - E1 L1 - E4 L1 - E2 L2 - E1L1 - E1 L1 - E4 L1 - E2 L2 - E1 L1 - E1 Ledger 1 deleted L2 - E1 Entry log Journal
  44. 44. Using Apache BookKeeper as a building block Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT
  45. 45. Guarantees If an entry has been acknowledged, it must be readable If an entry is read once, it must always be readable
  46. 46. API BookKeeper createLedger openLedger deleteLedger LedgerHandle addEntry readEntry close asyncCreateLedger asyncOpenLedger asyncDeleteLedger asyncAddEntry asyncReadEntry asyncClose Asynchronous with callbacks
  47. 47. Tech stack ● Java ● Netty ● ZooKeeper
  48. 48. Performance considerations I/O bound - disk IOPS: ~ 120/s HDD, 500 000/s SSD - network: 1Gb/s ~ 100MB/s max or less in practice ~ 1KB msgs: 100 000/s per node
  49. 49. Public use cases ● Hadoop namenode (Huawei) ● WAL (HubSpot) ● Hedwig (open source) ● PNUTS cross-colo replication (Yahoo) ● Push notifications (Yahoo) ● Cloud messaging (Yahoo)
  50. 50. Primary backup BookKeeper active standby write tail apply ops build backup state Reads from open ledger Asks for current Last Add Confirmed from bookies
  51. 51. Data store WAL Bookkeeper library bookies datastore
  52. 52. Bookkeeper library bookies Data structure Durability for arbitrary (distributed) data structures!
  53. 53. Elasticity Bookkeeper library bookies
  54. 54. Elasticity Bookkeeper library bookies
  55. 55. Shared log infrastructure Bookkeeper library bookies Application A Application B System C
  56. 56. http://zookeeper.apache.org/bookkeeper/ https://github.com/apache/bookkeeper Matthieu Morel: mmorel Ivan Kelly: ivank @apache.org

×