Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache
ZooKeeper for Apache Pulsar
Enrico Olivelli
DataStax - Luna Streaming Team
Member of Apache Pulsar, Apache BookKeeper and Apache ZooKeeper PMC,
Apache Curator VP
Agenda
● Introduction to Apache Pulsar architecture
● Overview about Apache ZooKeeper
● Overview about Apache BookKeeper
● ManagedLedger: Pulsar and BookKeeper
● Handling Failures while guaranteeing Consistency
● Live Demo with BKVM (BookKeeper Visual Manager)
2
Apache Pulsar - Core Concepts
4
- Topic:
- Sequence of Messages
- Persistent/Non-Persistent
- Partitioned/Non-Partitioned
- Tenant and Namespace:
- Logical and physical isolation of resources
- Fine grained configuration (topic/namespace/tenant/system levels)
- Subscription:
- A cursor over a topic (tracks status of acknowledgements)
- Modes: Exclusive, Failover, Shared, Key Shared
- Types: Durable, Non-Durable
- Producer:
- Normal, Exclusive
Apache ZooKeeper
5
- Born in Yahoo! and donated to the Apache Software Foundation
- Offers primitives for distributed systems coordination
- Implements a filesystem-like structure
- znodes are like directories and files
- Easy to understand
- No need for shared disks
- Strict ordering of operations
- Leader node + Followers (ZAB protocol)
- Enforced in the client
- Sessions
- Explicit notion of “lost connection”
- Heartbeat based expiration
- Ephemeral nodes
Apache ZooKeeper in Apache Pulsar
6
- Service Discovery
- Leader Election
- Metadata Management
- Configuration Management
- Used by Apache BookKeeper
Broker
Bookie
Broker
Broker
Bookie
Bookie
ZooKeeper ZooKeeper
ZooKeeper
Apache ZooKeeper - Conditional Writes (CW)
7
- Every znode has a version, a (small) content and possibly (few) children
setData(content, expectedVersion)
- Basic building block to ensure consistency
- Only the owner can update the znode
- Version conflict -> fail, assume you are no more the owner
- Successful write -> prevent others to perform the write (version
automatically incremented)
- Only one Broker can make progress at a time while working on metadata
This is not enough to ensure the overall consistency of the system !
Apache BookKeeper
8
- Born in Yahoo! and donated to the Apache Software Foundation
- Subproject of ZooKeeper, then graduated as TLP
- Implements a high performance distributed storage system
- Thick Java Client
- Bookie server: storage only
- Horizontally scalable
- Write/Read paths isolation
- Durability (journal/fsync)
- Replication
- Advanced placement policies
The Broker - the Heart of Pulsar
9
Each Broker is the Owner for a given set of topic bundles:
- Handles reads/writes
- Redirects to other brokers requests for non-owned bundles
- Handles subscriptions, consumers and producers status
- Keeps non-persistent topics data in memory
- Manages Schemas
- Handles cluster wide requests
The Broker uses Apache BookKeeper to store:
- Messages
- Subscriptions (acks)
- Schema
- Code packages (new in 2.8)
The Broker - Data flow when a message is produced
10
The Broker receives a request to publish a message:
● Verify topic ownership
● Verify authorization
● Locates the ManagedLedger instance
● Pass the encoded entry (single message or a batch) to
ManagedLedger
● ManagedLedger passes the entry to the active Ledger WriteHandle
● The BK client sends the entry in parallel to the Bookies
Producer
Broker
Bookie
Bookie
Bookie
ManagedLedger
The Broker - Data flow when a message is produced
11
The Broker receives a request to publish a message:
● Verify topic ownership
● Verify authorization
● Locates the ManagedLedger instance
● Pass the encoded entry (single message or a batch) to
ManagedLedger
● ManagedLedger passes the entry to the active Ledger WriteHandle
● The BK client sends the entry in parallel to the Bookies
● Wait for acknowledgement from the Bookies
● Acknowledge back the write to the Pulsar client
● Now the Message ID is available to the client (LedgerID-EntryID...)
Producer
Broker
Bookie
Bookie
Bookie
ManagedLedger
The Bookie - When the message is persisted
12
The Write path and the Read path are separated inside the Bookie.
Write path:
- The Bookie receives a copy of the entry
- The entry is written to the Journal
- The journal acknowledges the write after a successful fsync
- Entries are grouped in order to reduce the number of fsyncs
- The Bookies acknowledges the operation to the Client
The BookKeeper Client is responsible for:
- Selecting the Bookies (zone/region awareness)
- Waiting for confirmation
- Retransmissions
- Make a checksum of the raw payload
The Broker - ManagedLedger abstraction
13
BookKeeper relies on ZooKeeper CW features to guarantee
consistency of metadata
The Pulsar ManagedLedger is an abstraction over the BookKeeper
Ledger:
- Implements an infinite append-only stream of entries
- Concatenates BK ledgers (metadata only)
- Implements Cursors (support for durable subscriptions)
- Implements Tiered Storage
Ledger 123
Ledger 124
Ledger 137
Ledger 156
Ledger 168
topic
persistent://public/default/test
BookKeeper Ledger:
a write-once, append only, sequence of entries (byte[])
Handling Failures and ensuring Consistency
14
Failures on Broker:
- Network error/partition
- Overwhelmed Broker (Garbage collection, out of memory/CPU)
- Shutdown (or forced Bundle unload)
- ….
A new Broker becomes the Owner for the Topic (ManagedLedger)
- Perform recovery on the current BK ledger
- Create a new Ledger on BK
- Append the new Ledger ID to the list of Ledgers
- Serve write requests (verify that is the owner for each operation!)
More than one broker may start this recovery process !
ZooKeeper CW covers metadata operations,
but it does not help in the hot write path
BookKeeper Fencing and Recovery
15
- The new Broker opens the ledger in Recovery mode
- The BookKeeper Client reads from the Bookies every entry:
- Discover the max valid entry id
- Set the ledger fenced flag on the Bookies (on disks)
- Writes to ZooKeeper the new status of the Ledger
- This may fail during a CW operation !
- Only one broker can perform a successful recovery!
- The old broker:
- Receives a “Ledger Fenced error” on the next write
- Receives a “Bad Version error” while writing to ZooKeeper (if trying
to append a new ledger ID)
- It may receive a Watch Notification from ZooKeeper
At every write BookKeeper ensures the ownership of the Topic
BookKeeper fencing + ZooKeeper CW guarantee consistency of
Pulsar
Live Demo - Inspect a Pulsar Standalone instance
16
- Start Pulsar Standalone
- Use Visual Studio Code to inspect ZooKeeper contents
- Use BKVM to inspect BookKeeper contents
- Write to public/default/test
- Unload the topic
- See that the ManagedLedger created a new Ledger
Wrapping up
17
● ZooKeeper and Bookkeeper came from Yahoo! as well as Pulsar !
● Pulsar ManagedLedger is the high level abstraction over BookKeeper.
● ZooKeeper provides support for Metadata Management, Service Discovery,
Configuration and Leader Election.
● Conditional Writes (CW) guarantee consistency for Metadata operations.
● The Fencing mechanism of BookKeeper ensures Consistency on the write path
● In no case two brokers are able to write concurrently to a Topic, one of them will
eventually fail
June 15, 2021 Updates: Added Astra DB logo. Replaced Astra Streaming logo with updated version, while adding a horizontal lockup as a secondary option. Updated Luna Streaming logo.