Building your own Distributed System The easy way - Cassandra Summit EU 2014

Building Your Own Distributed System
The Easy Way
Kévin Lovato - @alprema

What this presentation will
NOT talk about
• Gazillions of inserts per second
• Hundreds of nodes
• Migrations from old technology to C* that now go 100 times faster

What this presentation will talk
about
• Servers that synchronize their state
• Out of order messages
• CQL Schema design
• Time measurement madness

• Hedge fund specialized in algorithmic trading
• ~80 employees
• Our C* usage
• Historical data (6+ Tb)
• Time series (Metrics)
• Home made Service Bus (Zebus)

Service Bus 101
• Network abstraction layer
• Allows communication between services (SOA)
• Communication is enabled using Business level messages (events)
• Usually relies on a broker

Zebus 101
• Developed in .Net
• P2P
• Lightweight
• CQRS oriented
• 1+ year of production experience
• ~150M messages / day

Terminology
• Peer: A program connected to the Bus
• Subscription: A message type a Peer is interested in
• Directory server: A Peer that knows all the Peers and their Subscriptions

Directory 1 Directory 2
Peer 1 Peer 2
Peer 3
Peer 1 is not connected and needs to
register on the bus

Peer 1 Peer 2
Peer 3
Register Peers list +
Subscriptions

Peer 1 Peer 2
Peer 3
New Peer information

Peer 1 Peer 2
Peer 3
Direct communication

The Directory servers must be identical (no master)

A peer can contact any of the Directory servers at any time

Directory servers can be updated/restarted at any time

Directory servers can be updated/restarted at any time
Peers have to be able to add Subscriptions one at a time if needed

Option 1: Design a resilient
distributed system

Option 2: Let Cassandra do the
heavy lifting Pick me!
Pick me!

Make the Directory Servers
stateless
I

• Allows to offload state synchronization to Cassandra (Quorum
everywhere)
• Makes restart / crash recovery easy
• Only « business » code in the Directory Server

Handle out of order
subscriptions
II

Peer 1
Timestamps:
Naive implementation (server side)
Peer 1 is already registered on the Bus and
will need to do multiple Subscription updates

Subscriptions update A
Peer 1
Timestamps:

Peer 1
Subscriptions update B
Timestamps:

A delay (network, slow machine, etc.) causes
Directory 1 to process the update after Directory 2
Peer 1
Timestamps:

Timestamp: 00:00:01
Peer 1
Timestamps:

Timestamps:
Peer 1
Timestamp: 00:00:02

Peer 1
Stored: Subscriptions update A
Timestamps:

Stored: Subscriptions update A
Timestamps:

Peer 1
Timestamps:
Zebus implementation (client side)
Same scenario, but this time using client side
timestamps

Timestamp: 00:00:01
Peer 1
Timestamps:

Peer 1
Timestamp: 00:00:02
Timestamps:

Peer 1
Timestamps:
The delay voodoo happens again

Timestamp: 00:00:02
Peer 1
Timestamps:

Timestamps:
Peer 1
Timestamp: 00:00:01

Peer 1
Timestamp resolution is handled by C*
Stored: Subscriptions update B
Timestamps:

Timestamp resolution is handled by C*
Stored: Subscriptions update B
Timestamps:

Handle subscriptions
efficiently
III

A Peer is already registered on the bus, and
has subscribed to one event type
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info } Initial subscriptions

It now needs to add a new subscription
Peer 1
Directory 1
Peer.1 CoolEvent { misc. Info } Initial subscriptions

Peer 1
Directory 1
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent
(new) { misc. Info }
It will send all its current subscriptions + the
new one

Peer 1
Directory 1
Now imagine that the peer adds 10 000
subscriptions

Peer 1
Directory 1
Now imagine that the peer adds 10 000
subscriptions, one at a time

Peer 1
Directory 1
Peer.1 OtherEvent
(new) { misc. Info }
…10 000 other events…
Peer.1 NthEvent { misc. Info }
10 000x times

Peer 1
Directory 1
Solution: Transfer subscriptions by message
type

Peer 1
Directory 1
Peer.1 NewEvent (1st) { misc. Info }

Peer 1
Directory 1
Peer.1 NewEvent (2nd) { misc. Info }
And so on…

But then, how do we store that?

Pick the proper row granularity
IV

• We want to only do upserts (no read-before-write)
• We want Cassandra to use client timestamps to resolve out of order
updates
• Subscriptions have to be updatable one by one

One subscription per row
Peer ID MessageType Subscription Info
… … …
• Primary Key (Peer Id, MessageType)

Directory
Peer 1 and 2 need to register on the Bus
Peer 1 Peer 2

Peer.1 OtherEvent { misc. Info }
Directory
• Peer 1 registers with 2
Subscriptions
Peer 1 Peer 2

Writing
Directory
Subscriptions • Directory starts to write to C*
Peer 1 Peer 2

Still writing
Directory
Subscriptions • Directory starts to write to C* • Peer 2 registers during the write
Register
Peer 1 Peer 2

Still writing
Directory
Subscriptions • Directory starts to write to C* • Peer 2 registers during the write • Since insertion was not over,
Peer 2 gets an incomplete state
Peer 1 Peer 2

All subscriptions in one row
Peer ID All Subscriptions Blob
Peer.18 { blob }
… …
• Primary Key (Peer Id)

Peer 1 is already registered on the Bus
and needs to add two Subscriptions
Peer 1

Add Subscription 1
Peer 1
• Peer 1 adds Subscription 1

Add Subscription 2
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2

A delay (again!) slows down Directory 1, causing both
Subscriptions to be added simultaneously
Peer 1

State:
No subscriptions
State:
No subscriptions
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2 • Directory 1 gets the state to add Subscription 1 • Directory 2 gets the state to add Subscription 2

Store:
Subscription 1
Store:
Subscription 2
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2 • Directory 1 gets the state to add Subscription 1 • Directory 2 gets the state to add Subscription 2 • They both store the updated state to C*

Stored:
Either Subscription 1 or 2 depending on
which was the slowest
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2 • Directory 1 gets the state to add Subscription 1 • Directory 2 gets the state to add Subscription 2 • They both store the updated state to C* • Both store only their new subscription

Solution: Compromise
• We split subscriptions into Static and Dynamic subscriptions
• Static subscriptions cannot be updated one-by-one
• The Dynamic subscriptions list cannot be handled as atomic
• Each type has its own Column Family

Static subscriptions schema
Peer ID Endpoint IsUp […] StaticSubscriptions
Peer.18 tcp://1.2.3.4:123 true […] { blob }
… … … […] …
• Primary Key (Peer Id)

Dynamic subscriptions schema
Peer ID MessageType Subscription info
Peer.18 UserCreated { misc. Info }
… … …
• Primary Key (Peer Id, MessageType)

Miscellaneous bits of “fun”
V

DateTime.Now
• Calling DateTime.Now twice in a row can (and will) return the same value
• Its resolution is around 10ms
• We had to create a unique timestamp provider (add 1 tick when called in
same « time bucket »)

Cassandra timestamp
• .Net’s DateTime.Ticks is more precise than Cassandra’s timestamps (100
ns vs. 1 μs)
• Our custom time provider ensured uniqueness by adding 1 tick at a time,
which was lost in translation

« UselessKey »
• The Directory CF is really small and needs to be retrieved entirely and
frequently
• We used a « bool UselessKey » PartitionKey to force sequential storage
and squeeze the last bits of speeds we needed

« UselessKey »
UselessKey Peer ID MessageType Subscription info
false Peer.18 UserCreated { misc. Info }
… … …
• Primary Key (UselessKey, Peer Id, MessageType)
• You should bench (after a flush) with your real data

When you have multiple servers sharing a state, Cassandra can save you
some headaches

some headaches
The schema design is very critical, think it thoroughly and make sure you
understand what is atomic and what is not

some headaches
Client provided timestamps can be very useful, but be sure to generate
unique timestamps

some headaches
Client provided timestamps can be very useful, but be sure to generate
unique timestamps
If you are not using Java, be well-aware of data types differences between
your language and Java

Want to see the code ?
www.github.com/Abc-Arbitrage

Want to see more code ?
jobs@abc-arbitrage.com

Building your own Distributed System The easy way - Cassandra Summit EU 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building your own Distributed System The easy way - Cassandra Summit EU 2014

Similar to Building your own Distributed System The easy way - Cassandra Summit EU 2014 (20)

Recently uploaded

Recently uploaded (20)

Building your own Distributed System The easy way - Cassandra Summit EU 2014

Editor's Notes