Box's /events API powers our desktop sync experience and provides users with a realtime, guaranteed-delivery event stream. To do that, we use HBase to store and serve a separate message queue for each of 30+ million users. Learn how we implemented queue semantics, were able to replicate our queues between clusters to enable transparent client failover, and why we chose to build a queueing system on top of HBase.
HBaseCon 2015: Events @ Box - Using HBase as a Message Queue
/events @ Box: Using HBase
as a message queue
Share, manage and access your content from any device, anywhere
What is the /events API?
• Realtime stream of all activity happening within a user’s account
• GET /events?stream_position=1234&stream_type=all
• Persistent and re-playable
1 2 3 4 5
Why did we build it?
• Main use-case was sync switch from batch to incremental diffs
• Several requirements arose from the sync use case:
‒ Guaranteed delivery
‒ Clients can be offline for days at a time
‒ Arbitrary number of clients consuming each user’s stream
How is it implemented?
• Each user assigned a separate section of the HBase key-space
• Messages are stored in order from oldest to newest within a user’s
section of the key-space
• Reads map directly to scans from the provided position to the user’s end
• Row key structure: <pseudo-random prefix>_<user_id>_<position>
2-bytes of user_id sha1 Millisecond timestamp
Using a timestamp as a queue position
• Pro: Allows for allocating roughly monotonically increasing positions
with no co-ordination between write requests
• Con: Isn’t sufficient to guarantee append-only semantics in the presence
of parallel writes
Time-bounding and Back-scanning
• Need to ensure that clients don’t advance their stream positions past
writes that will eventually succeed
‒ But clients do need to advance position eventually
‒ How do we know when it’s safe?
• Solution: time-bound writes and back-scan reads
‒ Time-bounding: every write to HBase must complete within a fixed time-bound to be
‒ No guaranteed delivery for unsuccessful writes.
‒ Clients should retry failed writes at higher stream positions.
‒ Back-scanning: clients cannot advance their stream positions further than (current
time – back-scan interval)
‒ Back-scan interval >= write time-bound
• Provides guaranteed delivery but at the cost of duplicate events
• Master/slave architecture
‒ One cluster per DC
‒ Master cluster handles all reads and writes
‒ Slave clusters are passive replicas
• On promotion, clients transparently fail over to the new master cluster
• Can’t use native HBase replication directly
‒ Could cause clients to miss events when failing over to a lagging cluster
• Replication system needs to be aware of master/slave failovers
‒ Stop exactly replicating messages. Start appending messages to the current ends of
• Currently, use a client-level replication system piggy backing on MySQL
• Plan to switch to a system that hooks into HBase replication by
configuring itself as a slave HBase cluster
• Closest off-the-rack queuing system is Kafka
‒ Developed at LinkedIn. Open sourced in 2011.
‒ Originally built to power LinkedIn’s analytics pipeline
‒ Very similar model built around “ordered commit logs”
‒ Allow for easy addition of new subscribers
‒ Allow for varying subscriber consumption patterns slow subscribers don’t back up the
Why HBase and not Kafka?
• Better consistency vs. availability tradeoffs
‒ No automatic rack aware replica placement
‒ No automatic replica re-assignment upon replica failure
‒ On replica failure, no fast failover of new writes to new replicas.
‒ Can’t require minimum replication factor for new writes without significantly impacting
availability on replica failure
• Replication support
‒ Not enough control over Kafka queue positions to implement transparent client
failovers between replica clusters
• Unable to scale to millions of topics
‒ Currently tops out in the tens of thousands of topics.
‒ Design requires very granular topic tracking. Barrier to scale.
• We were able to leverage HBase to store millions of guaranteed delivery
message queues, each of which was:
‒ replicated between data centers
‒ independently consumable by an arbitrary number of clients
• Cluster metrics:
‒ ~30 nodes per cluster
‒ 15K write/sec at peak. Bursts of up to 40K writes/sec.
‒ 50K-60K requests/sec at peak.
Engineering Blog tech.blog.box.com
Open Source opensource.box.com