Although Cassandra is well known for its ability to scale and handle heavy load, the team at Abc Arbitrage has preferred to expose its capacity to act as a distributed system.
In this presentation, Kévin Lovato, Software Engineer, will focus on the creation of their home-made Service Bus's Directory which relies on Cassandra to behave as a full-fledged distributed system.
Building your own Distributed System The easy way - Cassandra Summit EU 2014
1.
2. Building Your Own Distributed System
The Easy Way
Kévin Lovato - @alprema
3. What this presentation will
NOT talk about
• Gazillions of inserts per second
• Hundreds of nodes
• Migrations from old technology to C* that now go 100 times faster
4. What this presentation will talk
about
• Servers that synchronize their state
• Out of order messages
• CQL Schema design
• Time measurement madness
6. • Hedge fund specialized in algorithmic trading
• ~80 employees
• Our C* usage
• Historical data (6+ Tb)
• Time series (Metrics)
• Home made Service Bus (Zebus)
7. Service Bus 101
• Network abstraction layer
• Allows communication between services (SOA)
• Communication is enabled using Business level messages (events)
• Usually relies on a broker
8. Zebus 101
• Developed in .Net
• P2P
• Lightweight
• CQRS oriented
• 1+ year of production experience
• ~150M messages / day
10. Terminology
• Peer: A program connected to the Bus
• Subscription: A message type a Peer is interested in
• Directory server: A Peer that knows all the Peers and their Subscriptions
11. Directory 1 Directory 2
Peer 1 Peer 2
Peer 3
Peer 1 is not connected and needs to
register on the bus
17. The Directory servers must be identical (no master)
A peer can contact any of the Directory servers at any time
18. The Directory servers must be identical (no master)
A peer can contact any of the Directory servers at any time
Directory servers can be updated/restarted at any time
19. The Directory servers must be identical (no master)
A peer can contact any of the Directory servers at any time
Directory servers can be updated/restarted at any time
Peers have to be able to add Subscriptions one at a time if needed
24. • Allows to offload state synchronization to Cassandra (Quorum
everywhere)
• Makes restart / crash recovery easy
• Only « business » code in the Directory Server
26. Directory 1 Directory 2
Peer 1
Timestamps:
Naive implementation (server side)
Peer 1 is already registered on the Bus and
will need to do multiple Subscription updates
43. A Peer is already registered on the bus, and
has subscribed to one event type
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info } Initial subscriptions
44. It now needs to add a new subscription
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info } Initial subscriptions
45. Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent
(new) { misc. Info }
It will send all its current subscriptions + the
new one
46. Peer 1
Directory 1
Now imagine that the peer adds 10 000
subscriptions
47. Peer 1
Directory 1
Now imagine that the peer adds 10 000
subscriptions, one at a time
48. Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent
(new) { misc. Info }
…10 000 other events…
Peer.1 NthEvent { misc. Info }
10 000x times
49. Peer 1
Directory 1
Solution: Transfer subscriptions by message
type
50. Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 NewEvent (1st) { misc. Info }
51. Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 NewEvent (2nd) { misc. Info }
And so on…
54. • We want to only do upserts (no read-before-write)
• We want Cassandra to use client timestamps to resolve out of order
updates
• Subscriptions have to be updatable one by one
55. One subscription per row
Peer ID MessageType Subscription Info
Peer.18 CoolEvent { misc. Info }
… … …
• Primary Key (Peer Id, MessageType)
56. Directory
Peer 1 and 2 need to register on the Bus
Peer 1 Peer 2
57. Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent { misc. Info }
Directory
• Peer 1 registers with 2
Subscriptions
Peer 1 Peer 2
58. Writing
Directory
• Peer 1 registers with 2
Subscriptions • Directory starts to write to C*
Peer 1 Peer 2
59. Still writing
Directory
• Peer 1 registers with 2
Subscriptions • Directory starts to write to C* • Peer 2 registers during the write
Register
Peer 1 Peer 2
60. Still writing
Directory
• Peer 1 registers with 2
Subscriptions • Directory starts to write to C* • Peer 2 registers during the write • Since insertion was not over,
Peer 2 gets an incomplete state
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer 1 Peer 2
61. All subscriptions in one row
Peer ID All Subscriptions Blob
Peer.18 { blob }
… …
• Primary Key (Peer Id)
62. Directory 1 Directory 2
Peer 1 is already registered on the Bus
and needs to add two Subscriptions
Peer 1
65. Directory 1 Directory 2
A delay (again!) slows down Directory 1, causing both
Subscriptions to be added simultaneously
Peer 1
66. State:
No subscriptions
State:
No subscriptions
Directory 1 Directory 2
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2 • Directory 1 gets the state to add Subscription 1 • Directory 2 gets the state to add Subscription 2
67. Store:
Subscription 1
Store:
Subscription 2
Directory 1 Directory 2
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2 • Directory 1 gets the state to add Subscription 1 • Directory 2 gets the state to add Subscription 2 • They both store the updated state to C*
68. Stored:
Either Subscription 1 or 2 depending on
which was the slowest
Directory 1 Directory 2
Peer 1
• Peer 1 adds Subscription 1 • Peer 1 adds Subscription 2 • Directory 1 gets the state to add Subscription 1 • Directory 2 gets the state to add Subscription 2 • They both store the updated state to C* • Both store only their new subscription
69. Solution: Compromise
• We split subscriptions into Static and Dynamic subscriptions
• Static subscriptions cannot be updated one-by-one
• The Dynamic subscriptions list cannot be handled as atomic
• Each type has its own Column Family
73. DateTime.Now
• Calling DateTime.Now twice in a row can (and will) return the same value
• Its resolution is around 10ms
• We had to create a unique timestamp provider (add 1 tick when called in
same « time bucket »)
74. Cassandra timestamp
• .Net’s DateTime.Ticks is more precise than Cassandra’s timestamps (100
ns vs. 1 μs)
• Our custom time provider ensured uniqueness by adding 1 tick at a time,
which was lost in translation
75. « UselessKey »
• The Directory CF is really small and needs to be retrieved entirely and
frequently
• We used a « bool UselessKey » PartitionKey to force sequential storage
and squeeze the last bits of speeds we needed
76. « UselessKey »
UselessKey Peer ID MessageType Subscription info
false Peer.18 UserCreated { misc. Info }
… … …
• Primary Key (UselessKey, Peer Id, MessageType)
• You should bench (after a flush) with your real data
78. When you have multiple servers sharing a state, Cassandra can save you
some headaches
79. When you have multiple servers sharing a state, Cassandra can save you
some headaches
The schema design is very critical, think it thoroughly and make sure you
understand what is atomic and what is not
80. When you have multiple servers sharing a state, Cassandra can save you
some headaches
The schema design is very critical, think it thoroughly and make sure you
understand what is atomic and what is not
Client provided timestamps can be very useful, but be sure to generate
unique timestamps
81. When you have multiple servers sharing a state, Cassandra can save you
some headaches
The schema design is very critical, think it thoroughly and make sure you
understand what is atomic and what is not
Client provided timestamps can be very useful, but be sure to generate
unique timestamps
If you are not using Java, be well-aware of data types differences between
your language and Java
82. Want to see the code ?
www.github.com/Abc-Arbitrage
83. Want to see more code ?
jobs@abc-arbitrage.com