Continuous Deployment with C*:
Treating C* as First-Class Code
Michael Kjellman
@mkjellman
Software Engineer, Barracuda Networks
C* At Barracuda
• Powers 100% of our Spam and Webfilter Backend
• 48 Node Cluster
• 2 Datacenters
• Requests: 20k writes/sec 30k reads/sec
• Latency: 1 ms/write 1.6 ms/read
• > 30TB of Data
• Almost entirely native protocol/CQL3
Hardware Configuration
• 32GB of RAM
• 1x SSD
• 2x Spinning Disks
• 2x 6 Core AMD
Key Configuration Options
• key_cache_size_in_mb: 1024
• row_cache_size_in_mb: 0
• memtable_total_space_in_mb: 2048
• HEAP_NEWSIZE = “1200M” (-Xmn)
• MAX_HEAP_SIZE = “8G” (-Xmx)
• -XX:SurvivorRatio=6
• Sidenote: Java 7u40 is out!
How do I keep my graphs pretty during
a C* upgrade?
September 18th 2013
Make a C* Build
$> git clone http://git-wip-
us.apache.org/repos/asf/cassandra.git
$> git checkout –t origin/cassandra-1.2
$> git log
$> vim build.xml (change version number every
time you make a build!)
$> ant clean release
Deployment
• Make release
• Test release with CCM
• Push release to Puppet (deals with config, etc)
• Run controlled and scripted rolling restart one datacenter
at a time
– flush
– stop
– start
– validate node
Automate, Automate, Automate
So, why not just
apt-get install cassandra?
• Makes running a custom release in the future a
complete nightmare
• Lost visibility into changes in the release
• WHY are you upgrading
• Treat a C* build just as if it was a release of your
code. What commits did you put into your own
release?
MY CODE DOESN’T WORK WITHOUT A
STABLE C* CLUSTER
Simply Put:
When things go wrong
• Every commit (those by C* committers or my
own) come with potential bugs and regressions
• Gossip Bugs Can Bite Hard:
– CASSANDRA-5665: Gossiper.handleMajorStateChange
can lose existing node ApplicationState
• At 48 nodes, even small mistakes are massive
Writing your code to deal with node
failure
• Upgrading a C* cluster means constant node
failures for the duration of the rolling restart
• How does your code deal with read latency and
retries
– CASSANDRA-4705: Eager Retries for reads for 2.0+
• The mythical “constantly failing” code != stability.
– Handle exceptions (and node/read failures) gracefully!
Why treat C* like your own code
• Using C* will move much of your own
application logic to C*
• The bugs have to go somewhere!
• Data replication at database layer or at
application layer
QUESTIONS?
Thanks for Listening!

Continuous Deployment with Cassandra

  • 1.
    Continuous Deployment withC*: Treating C* as First-Class Code Michael Kjellman @mkjellman Software Engineer, Barracuda Networks
  • 3.
    C* At Barracuda •Powers 100% of our Spam and Webfilter Backend • 48 Node Cluster • 2 Datacenters • Requests: 20k writes/sec 30k reads/sec • Latency: 1 ms/write 1.6 ms/read • > 30TB of Data • Almost entirely native protocol/CQL3
  • 4.
    Hardware Configuration • 32GBof RAM • 1x SSD • 2x Spinning Disks • 2x 6 Core AMD
  • 5.
    Key Configuration Options •key_cache_size_in_mb: 1024 • row_cache_size_in_mb: 0 • memtable_total_space_in_mb: 2048 • HEAP_NEWSIZE = “1200M” (-Xmn) • MAX_HEAP_SIZE = “8G” (-Xmx) • -XX:SurvivorRatio=6 • Sidenote: Java 7u40 is out!
  • 6.
    How do Ikeep my graphs pretty during a C* upgrade? September 18th 2013
  • 7.
    Make a C*Build $> git clone http://git-wip- us.apache.org/repos/asf/cassandra.git $> git checkout –t origin/cassandra-1.2 $> git log $> vim build.xml (change version number every time you make a build!) $> ant clean release
  • 8.
    Deployment • Make release •Test release with CCM • Push release to Puppet (deals with config, etc) • Run controlled and scripted rolling restart one datacenter at a time – flush – stop – start – validate node
  • 9.
  • 10.
    So, why notjust apt-get install cassandra? • Makes running a custom release in the future a complete nightmare • Lost visibility into changes in the release • WHY are you upgrading • Treat a C* build just as if it was a release of your code. What commits did you put into your own release?
  • 11.
    MY CODE DOESN’TWORK WITHOUT A STABLE C* CLUSTER Simply Put:
  • 12.
    When things gowrong • Every commit (those by C* committers or my own) come with potential bugs and regressions • Gossip Bugs Can Bite Hard: – CASSANDRA-5665: Gossiper.handleMajorStateChange can lose existing node ApplicationState • At 48 nodes, even small mistakes are massive
  • 13.
    Writing your codeto deal with node failure • Upgrading a C* cluster means constant node failures for the duration of the rolling restart • How does your code deal with read latency and retries – CASSANDRA-4705: Eager Retries for reads for 2.0+ • The mythical “constantly failing” code != stability. – Handle exceptions (and node/read failures) gracefully!
  • 14.
    Why treat C*like your own code • Using C* will move much of your own application logic to C* • The bugs have to go somewhere! • Data replication at database layer or at application layer
  • 15.