• Like
BcnDevCon 2013:  Usign Cassandra and Zookeeper to build a distributed, high performance system
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

BcnDevCon 2013: Usign Cassandra and Zookeeper to build a distributed, high performance system


Slides from my presentation at BCN Dev Con 2013.

Slides from my presentation at BCN Dev Con 2013.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Using Zookeeper and Cassandra to build a distributed, high performance system BcnDevCon13 Galo Navarro @srvaroa - galo@midokura.com
  • 2. About me Background: backend & architecture in high traffic systems Current: software engineer @ Midokura
  • 3. A talk about databases
  • 4. Takeaways Go beyond artificial SQL-NoSQL antagonisms: We share some fundamental problems: ● Latency, availability, durability.. ● True today, 20y ago, 3000y ago New tech signals different emphasis on solving each problem Solutions are not exclusive: you can combine them.
  • 5. Midonet Distributed network virtualization system Context, dataset, requirements https://www.midokura.com/
  • 6. Virtualization Computational resources on demand
  • 8. Midokura's use case internet VM vRouter VM vSwitch vSwitch VM VM VM Virtual network Each client that rents VMs on the datacentre wants to network them as if they were their own physical resources (e.g.: same L2 domain, private addresses, isolation..) MidoNet allows the owner of the datacentre do provide that service
  • 9. Dataset Virtual network topology vRouter vSwitch vSwitch Virtual network state ARP table internet Routing tables destination IP gateway aa:bb:cc:dd:ee:ff IP MAC 11:22:33:44:55:66 Metrics, audit logs, monitoring
  • 10. Usage A daemon captures Packets sent from VMs contained on each physical host. On new packets, it loads a view of the virtual topology from a (distributed) data store VM VM VM VM Load virtual topology VM VM VM VM VM VM
  • 11. Usage The daemon simulates the trip through the virtual network until reaching a a destination VM, and identifies the host Instructs the kernel to route similar packets via a tunnel VM VM VM VM VM VM VM VM VM VM
  • 12. Midonet architecture Hosts Storage cluster API IP bus
  • 13. Constraints Consistency Availability Partition negotiable critical Tolerance What happens if our service doesn't handle network partitions, faulty master, GC pauses, latency, lags, locks..? - Not just N users unable to see their profiles - But infrastructure failure in the entire datacentre
  • 14. Midokura's use case Coming to “NoSQL” not from “Big Data” But looking for specific mixes of ● ● ● ● ● Availability Fault tolerance Performance Durability Low operational cost How are Cassandra and Zookeeper useful?
  • 15. Virtual Network State Assorted data Metrics https://cassandra.apache.org/
  • 16. Cassandra elevator pitch A massively scalable open source NoSQL database Supports large amounts of structured, semi-structured, and unstructured data (key-value) Across multiple data centers Performance, availability, linear scalability, with no SPF
  • 17. Cassandra architecture P2P No privileged nodes Unified view clients DC1 DC2 DC3
  • 18. Fault tolerance Replication Factor = 3 Consistency level = QUORUM ok write (x) ok ok FAIL faulty node
  • 19. Fault tolerance Hinted handoff: coordinator holds data until faulty replica recovers w(x) ok
  • 20. Fault tolerance RF = 3 CL = QUORUM x read(x) x x
  • 21. Consistency RF = 3 CL = QUORUM x read(x) x' x x The coordinator will wait until CL possible across replicas (or fail) - CL can be also 1, 2, ALL..
  • 22. Consistency Order issued to the disagreeing node to reconcile its local copy ? read_repair ?
  • 23. Multi DC DC 1 DC 2 Minimizes expensive network trips client req RF = 6 CL = LOCAL_QUORUM CL = EACH_QUORUM (quorum inside the local DC) (quorum on each DC)
  • 24. Latency + Throughput: W memory memtable write (key, value) disk commit log ... ... X ok write (...) clean sstable Minimize disk access Immutablility - Data in disk doesn't change - Saves IO sync locks - Requires async compaction flush index X
  • 25. Latency + Throughput: R memory memtable read (key) disk commit log ... ? ... write (...) Caches Bloom filters sstables ? ? index ok: X X
  • 26. Flexible data model name user[“1”] Julius user[“2”] Marcus email state jcaesar@senate.rom jcaesar@senate.rom stabbed mantonius@senate.rom NAT[“”] = { } Simpler schema changes Flexible (good on growth mode) ip = “”, port = “455” ttl = ....
  • 27. Time series Column names are stored physically sorted ● Wide rows enable ordering + efficient filtering ● Pack together data that will be queried together Events (bad) <- applies SQL approach event[id] = {device=1, time=t1, val=1} event[id] = {device=1, time=t2, val=2} Events (better) event[device1] = { {time=t2, val=2}, {time=t1, val=3} .. } event[device2] = { {time=t3, val=3}, {time=t4, val=4} .. }
  • 28. Things to watch ● Data model highly conditioned by queries vs. SQL's model for many possible queries ● Relearn performance tuning GC, caches, IO patterns, repairs.. understanding internals is as important as in SQL ● Counter intuitive internals E.g.: expired data doesn't get deleted immediately (not even “soon”) ● ...
  • 29. Things to watch Know well how your clients handle failovers, and tune for your use case: E.g.: if we process a packet we want low latency, and no failures so: ● ● ● ● How long is a Timeout? Retry to a different node or fail fast? How to distinguish node failure from transient latency spike? How many nodes must be up to satisfy CL?
  • 30. Watch data changes Service discovery Coordination Zookeeper “Because coordinating distributed systems is a Zoo” https://zookeeper.apache.org/
  • 31. Zookeeper ● High availability ● Performance (in memory, r > w) In memory: limits dataset size (backed by disk) ● Reliable delivery If a node sees an update, all will eventually ● Total & causal order - Data is delivered in the same order it is sent - A message m is delivered only after all messages sent before m have been delivered
  • 32. Zookeeper architecture L 1. update ● ● ● 2. proposal 3. ack! 3. ack! 4. commit Paxos variant Ordered messages Atomic broadcasts Leader is not SPF: new one elected upon failure ● 3. ack!
  • 33. ZK Watchers Notifies subscribers of these /midonet /bridges Change here /A /ports /1 = /2 = [.., peer = bridgeC/ports/79, .. ] [.., peer = routerX/ports/53, .. ] /B /ports /79 = /routers ... [.., peer = bridgeA/ports/1, .. ]
  • 34. ZK Watchers change: cut and add new device Binding changes Binding changes A VM VM C VM VM update A! VM B VM update C! VM VM VM VM update B! Important: we want to notify each node of relevant changes only!
  • 35. Remember the scale!
  • 36. Service discovery (WIP) Distributed service nodes Clients n1 cc c n2 cc c register discover cc c n3 Must know ZK cluster (static) but not service nodes (dynamic) notify down /nodes /n1 /n2 /n3 Ephemeral nodes: if the session that created it dies, the node disappears
  • 37. Q ? A : Thank you!