"Event streaming platforms like Kafka have traditionally leaned on ZooKeeper as the cornerstone for coordination and metadata management. This presentation introduces Oxia, a compelling alternative solution.
Hailing from the labs of StreamNative, Oxia brings forth a genuinely horizontally scalable metadata framework. It empowers distributed messaging systems to seamlessly handle hundreds of millions of topics, all while removing the intricacies and operational burdens associated with ZooKeeper.
The transformative potential of Oxia extends to developers' messaging strategies and application architectures. It holds the promise of simplifying both, marking a significant evolution in the event streaming landscape."
2. streamnative.io
● Apache Pulsar Committer
● Former Principal Software Engineer on Splunk’s messaging
team that is responsible for Splunk’s internal Pulsar-as-a-
Service platform.
● Former Director of Solution Architecture at Streamlio.
● Global practice director of Professional Services at
Hortonworks.
David Kjerrumgaard
Developer Advocate
3. Distributed Coordination
Horizontally scalable messaging platforms are comprised of
multiple processes running on different machines.
These distributed resources must be able to discover one another
other and understand who’s serving a particular resource.
• Service discovery
• Leader election
• Operations on distributed locks
4. Metadata Storage
When building a stateful system whose purpose is to store data, it
is often necessary to retain “metadata” such as
• List of nodes in the cluster
• Location of the topic data
• Cursor offsets
• Topic policies (ACLs, TTLs, etc.)
5. No Good Solution Exists
Existing solutions like ZooKeeper had major flaws:
• Horizontal Limitation: These systems are not horizontally
scalable. An operator cannot add more nodes and expand the
cluster capacity.
• Ineffective Vertical Scaling: Increasing CPU and IO resources is
a stop-gap solution that does not resolve the problem.
• Inefficient Storage: Storing more than 1 GB of data is highly
inefficient because of their periodic snapshots. This snapshot
process repeatedly writes the same data, stealing all the IO
resources and slowing down the write operations.
6. Practical Implications of these Limitations
An inelastic metadata store results in:
• A hard upper limit on the number of topics of ~ 1 million.
• Increasingly degraded performance as the metadata size grows,
resulting in longer topic failover times and higher latency on
some operations.
• A significant amount of operational complexity w.r.t. hardware
sizing, tuning, and monitoring.
7. What About Kraft?
While Kraft has removed the Zookeeper dependency from Kafka,
and the associated operational complexity.
• It replicates the same ZooKeeper/Etcd architecture without
significant scalability improvements.
• Transfers the existing complexity of ZooKeeper to the Kafka
brokers instead.
• Just another implementation of the Paxos/Raft consensus
algorithm, which is not horizontally scalable.
8. ● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Pulsar Cluster
MetaData
Storage
10. Oxia
A scalable metadata store and coordination system that can be
used as the core infrastructure to build large-scale distributed
systems.
• Takes a fresh approach to address the problem space typically
addressed by systems like Apache ZooKeeper.
• Goal is to support clusters with hundreds of millions of topics
without a lot of specialized hardware or operational skills.
• Released under Apache License, Version 2.0
11. Why Oxia
Oxia addresses the scalability and performance issues in the
systems currently used in this space, such as ZooKeeper and KRaft.
• Design optimized for Kubernetes environment
• Transparent horizontal scalability
• Linearizable per-key operations
• Able to sustain millions of read/write per second
• Able to store 100s of GBs
13. Oxia Coordinator
The Oxia Coordinator is responsible for:
• Performing error detection and recovery.
• Coordinating storage nodes, including leader election.
• Checkpointing the status of the data shards.
o Tracks the leader of each shard.
o Which storage pods are part of the ensemble for the shard.
14. Storage Pods
Form the horizontally scalable storage layer:
• Under the direction of the Oxia coordinator.
• Used to form storage ensembles, and
based on the leader election outcome, it
will be either in leader or follower mode.
• Do not perform health/liveness checks
against each others. It’s up to the
coordinator to determine whether a
storage pod is "up" or "down".
15. Data Replication
• All the write/read operations happen
on the leader.
• The leader will log all the write
operations in the local WAL and then
push them to the followers.
• Once it receives acknowledgment
from a majority of the servers in the
ensemble, it will apply the change to
the local KV store.
17. Horizontal Scalability
Oxia achieves horizontal scalability by sharding the dataset.
• In Oxia, each namespace is independent, and is divided into
shards.
• Shards are assigned to available servers
• When cluster expands / shrinks, shards are reassigned
• Shards can be split when they are big / busy
18. Storage Scalability
Systems like ZooKeeper have a storage limitation around ~2GB,
mainly because they are taking periodic snapshots of the entire
data set.
• Oxia is designed to efficiently store metadata larger than the
available RAM, in the order of 100s of GB, across multiple shards.
• Oxia eliminates the need for periodic snapshots entirely, and
instead relies on a KV store to provide a consistent snapshot of
the data, without requiring to dump the full data set every N
updates.