Using Zookeeper and Cassandra
to build a distributed, high
performance system
BcnDevCon13

Galo Navarro
@srvaroa - galo@mi...
About me
Background:

backend & architecture in
high traffic systems

Current:

software engineer @ Midokura
A talk about databases
Takeaways
Go beyond artificial SQL-NoSQL antagonisms:
We share some fundamental problems:
●
Latency, availability, durabil...
Midonet
Distributed network
virtualization system
Context, dataset,
requirements

https://www.midokura.com/
Virtualization
Computational resources on demand
VM

VM

VM

VM

VM
VM

VM

VM

VM

VM

VM

VM

VM

VM

VM

VM

VM

VM

VM

the
cloud

VM

VM

VM

VM

VM
Midokura's use case
internet

VM

vRouter
VM

vSwitch

vSwitch

VM
VM

VM

Virtual network

Each client that rents VMs on ...
Dataset
Virtual network topology

vRouter
vSwitch
vSwitch

Virtual network state
ARP table

internet

Routing tables
desti...
Usage
A daemon captures Packets sent from VMs contained on
each physical host.
On new packets, it loads a view of the virt...
Usage
The daemon simulates the trip through the virtual network
until reaching a a destination VM, and identifies the host...
Midonet architecture
Hosts

Storage cluster

API
IP bus
Constraints
Consistency
Availability

Partition

negotiable

critical

Tolerance

What happens if our service doesn't hand...
Midokura's use case
Coming to “NoSQL” not from “Big Data”
But looking for specific mixes of
●
●
●
●
●

Availability
Fault ...
Virtual Network State
Assorted data
Metrics

https://cassandra.apache.org/
Cassandra elevator pitch
A massively scalable open source NoSQL
database
Supports large amounts of structured,
semi-struct...
Cassandra architecture
P2P
No privileged nodes
Unified view

clients

DC1

DC2
DC3
Fault tolerance
Replication
Factor = 3
Consistency
level = QUORUM
ok

write (x)

ok

ok
FAIL

faulty node
Fault tolerance

Hinted handoff:
coordinator holds data
until faulty replica
recovers

w(x)
ok
Fault tolerance
RF = 3
CL = QUORUM
x

read(x)
x
x
Consistency
RF = 3
CL = QUORUM
x

read(x)

x'
x
x

The coordinator will wait
until CL possible across
replicas (or fail)
-...
Consistency

Order issued to the
disagreeing node to
reconcile its local
copy

?
read_repair
?
Multi DC
DC 1

DC 2
Minimizes
expensive
network trips

client req

RF = 6
CL = LOCAL_QUORUM
CL = EACH_QUORUM

(quorum insi...
Latency + Throughput: W
memory
memtable

write (key, value)

disk
commit log
...
...

X

ok

write (...)

clean
sstable

M...
Latency + Throughput: R
memory
memtable

read (key)

disk
commit log
...

?

...
write (...)

Caches
Bloom filters

sstabl...
Flexible data model
name
user[“1”]

Julius

user[“2”]

Marcus

email

state

jcaesar@senate.rom
jcaesar@senate.rom
stabbed...
Time series
Column names are stored physically sorted
●

Wide rows enable ordering + efficient filtering

●

Pack together...
Things to watch
●

Data model highly conditioned by queries
vs. SQL's model for many possible queries

●

Relearn performa...
Things to watch
Know well how your clients handle failovers,
and tune for your use case:
E.g.: if we process a packet we w...
Watch data changes
Service discovery
Coordination
Zookeeper
“Because coordinating
distributed systems is
a Zoo”

https://z...
Zookeeper
●

High availability

●

Performance (in memory, r > w)
In memory: limits dataset size (backed by disk)

●

Reli...
Zookeeper architecture
L

1. update

●
●
●

2. proposal

3. ack!

3. ack!

4. commit

Paxos variant
Ordered messages
Atomi...
ZK Watchers
Notifies
subscribers of
these

/midonet
/bridges

Change
here

/A
/ports
/1 =
/2 =

[.., peer = bridgeC/ports/...
ZK Watchers
change: cut and add new device
Binding changes

Binding changes

A
VM

VM

C
VM

VM

update A!

VM

B
VM

upda...
Remember
the scale!
Service discovery (WIP)
Distributed
service nodes

Clients

n1

cc
c

n2

cc
c

register

discover

cc
c

n3

Must know ZK...
Q ? A : Thank you!
Upcoming SlideShare
Loading in …5
×

BcnDevCon 2013: Usign Cassandra and Zookeeper to build a distributed, high performance system

1,043 views
825 views

Published on

Slides from my presentation at BCN Dev Con 2013.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,043
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

BcnDevCon 2013: Usign Cassandra and Zookeeper to build a distributed, high performance system

  1. 1. Using Zookeeper and Cassandra to build a distributed, high performance system BcnDevCon13 Galo Navarro @srvaroa - galo@midokura.com
  2. 2. About me Background: backend & architecture in high traffic systems Current: software engineer @ Midokura
  3. 3. A talk about databases
  4. 4. Takeaways Go beyond artificial SQL-NoSQL antagonisms: We share some fundamental problems: ● Latency, availability, durability.. ● True today, 20y ago, 3000y ago New tech signals different emphasis on solving each problem Solutions are not exclusive: you can combine them.
  5. 5. Midonet Distributed network virtualization system Context, dataset, requirements https://www.midokura.com/
  6. 6. Virtualization Computational resources on demand
  7. 7. VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM the cloud VM VM VM VM VM
  8. 8. Midokura's use case internet VM vRouter VM vSwitch vSwitch VM VM VM Virtual network Each client that rents VMs on the datacentre wants to network them as if they were their own physical resources (e.g.: same L2 domain, private addresses, isolation..) MidoNet allows the owner of the datacentre do provide that service
  9. 9. Dataset Virtual network topology vRouter vSwitch vSwitch Virtual network state ARP table internet Routing tables destination IP gateway 192.168.1.23 aa:bb:cc:dd:ee:ff 192.168.0.0/16 192.168.0.12 192.168.1.11 66.82.1.0/16 66.82.1.1 0.0.0.0/32 10.0.2.1 IP MAC 11:22:33:44:55:66 Metrics, audit logs, monitoring
  10. 10. Usage A daemon captures Packets sent from VMs contained on each physical host. On new packets, it loads a view of the virtual topology from a (distributed) data store VM VM VM VM Load virtual topology VM VM VM VM VM VM
  11. 11. Usage The daemon simulates the trip through the virtual network until reaching a a destination VM, and identifies the host Instructs the kernel to route similar packets via a tunnel VM VM VM VM VM VM VM VM VM VM
  12. 12. Midonet architecture Hosts Storage cluster API IP bus
  13. 13. Constraints Consistency Availability Partition negotiable critical Tolerance What happens if our service doesn't handle network partitions, faulty master, GC pauses, latency, lags, locks..? - Not just N users unable to see their profiles - But infrastructure failure in the entire datacentre
  14. 14. Midokura's use case Coming to “NoSQL” not from “Big Data” But looking for specific mixes of ● ● ● ● ● Availability Fault tolerance Performance Durability Low operational cost How are Cassandra and Zookeeper useful?
  15. 15. Virtual Network State Assorted data Metrics https://cassandra.apache.org/
  16. 16. Cassandra elevator pitch A massively scalable open source NoSQL database Supports large amounts of structured, semi-structured, and unstructured data (key-value) Across multiple data centers Performance, availability, linear scalability, with no SPF
  17. 17. Cassandra architecture P2P No privileged nodes Unified view clients DC1 DC2 DC3
  18. 18. Fault tolerance Replication Factor = 3 Consistency level = QUORUM ok write (x) ok ok FAIL faulty node
  19. 19. Fault tolerance Hinted handoff: coordinator holds data until faulty replica recovers w(x) ok
  20. 20. Fault tolerance RF = 3 CL = QUORUM x read(x) x x
  21. 21. Consistency RF = 3 CL = QUORUM x read(x) x' x x The coordinator will wait until CL possible across replicas (or fail) - CL can be also 1, 2, ALL..
  22. 22. Consistency Order issued to the disagreeing node to reconcile its local copy ? read_repair ?
  23. 23. Multi DC DC 1 DC 2 Minimizes expensive network trips client req RF = 6 CL = LOCAL_QUORUM CL = EACH_QUORUM (quorum inside the local DC) (quorum on each DC)
  24. 24. Latency + Throughput: W memory memtable write (key, value) disk commit log ... ... X ok write (...) clean sstable Minimize disk access Immutablility - Data in disk doesn't change - Saves IO sync locks - Requires async compaction flush index X
  25. 25. Latency + Throughput: R memory memtable read (key) disk commit log ... ? ... write (...) Caches Bloom filters sstables ? ? index ok: X X
  26. 26. Flexible data model name user[“1”] Julius user[“2”] Marcus email state jcaesar@senate.rom jcaesar@senate.rom stabbed mantonius@senate.rom NAT[“192.168.1.2:80:10.1.1.1:923”] = { } Simpler schema changes Flexible (good on growth mode) ip = “192.12.3.11”, port = “455” ttl = ....
  27. 27. Time series Column names are stored physically sorted ● Wide rows enable ordering + efficient filtering ● Pack together data that will be queried together Events (bad) <- applies SQL approach event[id] = {device=1, time=t1, val=1} event[id] = {device=1, time=t2, val=2} Events (better) event[device1] = { {time=t2, val=2}, {time=t1, val=3} .. } event[device2] = { {time=t3, val=3}, {time=t4, val=4} .. }
  28. 28. Things to watch ● Data model highly conditioned by queries vs. SQL's model for many possible queries ● Relearn performance tuning GC, caches, IO patterns, repairs.. understanding internals is as important as in SQL ● Counter intuitive internals E.g.: expired data doesn't get deleted immediately (not even “soon”) ● ...
  29. 29. Things to watch Know well how your clients handle failovers, and tune for your use case: E.g.: if we process a packet we want low latency, and no failures so: ● ● ● ● How long is a Timeout? Retry to a different node or fail fast? How to distinguish node failure from transient latency spike? How many nodes must be up to satisfy CL?
  30. 30. Watch data changes Service discovery Coordination Zookeeper “Because coordinating distributed systems is a Zoo” https://zookeeper.apache.org/
  31. 31. Zookeeper ● High availability ● Performance (in memory, r > w) In memory: limits dataset size (backed by disk) ● Reliable delivery If a node sees an update, all will eventually ● Total & causal order - Data is delivered in the same order it is sent - A message m is delivered only after all messages sent before m have been delivered
  32. 32. Zookeeper architecture L 1. update ● ● ● 2. proposal 3. ack! 3. ack! 4. commit Paxos variant Ordered messages Atomic broadcasts Leader is not SPF: new one elected upon failure ● 3. ack!
  33. 33. ZK Watchers Notifies subscribers of these /midonet /bridges Change here /A /ports /1 = /2 = [.., peer = bridgeC/ports/79, .. ] [.., peer = routerX/ports/53, .. ] /B /ports /79 = /routers ... [.., peer = bridgeA/ports/1, .. ]
  34. 34. ZK Watchers change: cut and add new device Binding changes Binding changes A VM VM C VM VM update A! VM B VM update C! VM VM VM VM update B! Important: we want to notify each node of relevant changes only!
  35. 35. Remember the scale!
  36. 36. Service discovery (WIP) Distributed service nodes Clients n1 cc c n2 cc c register discover cc c n3 Must know ZK cluster (static) but not service nodes (dynamic) notify down /nodes /n1 /n2 /n3 Ephemeral nodes: if the session that created it dies, the node disappears
  37. 37. Q ? A : Thank you!

×