From the front lines
Saving stranded clusters
#CassandraSummit
Who am I and what do I do?
• Ben Bromhead
• Co-founder and CTO of Instaclustr -> www.instaclustr.com
• Instaclustr provides Cassandra-as-a-Service in the cloud.
• Currently in AWS, Google Cloud in private beta with more to come.
• We currently manage 50+ nodes for various customers, who do
various things with it.
In the beginning…
• Well designed schemas
• No data migrations
• Everything was perfect and happy
• Then we got customers
Our first C* patch
• CASSANDRA-6521
• Cassandra wouldn’t check the length of a column name in a range
predicate for slice operations.
• So for large column names it would throw an assertion error.
• Which would in turn tie up threads, causing the node to be
unresponsive and eventually the whole cluster.
Our first C* patch
• What was the size of the column name that would cause this issue?
• Around 130kb
• wat…
Our first migration
• Receive frantic phone call
• Self managed cluster has been down for 48 hours, for a company
that gets 25 million monthly unique views.
• They are hurting
Our first migration
• The cluster was running a very early version of C* 2.0
• Update/patch the old cluster, get everything back online
• Start the migration process…
Our first migration
• Bulkload manages to kill their new cluster with us in about 5
minutes.
• Open logs
• Read 1 live and 38456 tombstoned cells (see
tombstone_warn_threshold)
• For every column family
• wat…
Conclusion
• Everything is awesome
• Then reality occurs
• It’s actually way more fun
• Want to make C* even better? We are hiring!

Apache Cassandra Management

  • 1.
    From the frontlines Saving stranded clusters #CassandraSummit
  • 2.
    Who am Iand what do I do? • Ben Bromhead • Co-founder and CTO of Instaclustr -> www.instaclustr.com • Instaclustr provides Cassandra-as-a-Service in the cloud. • Currently in AWS, Google Cloud in private beta with more to come. • We currently manage 50+ nodes for various customers, who do various things with it.
  • 3.
    In the beginning… •Well designed schemas • No data migrations • Everything was perfect and happy • Then we got customers
  • 4.
    Our first C*patch • CASSANDRA-6521 • Cassandra wouldn’t check the length of a column name in a range predicate for slice operations. • So for large column names it would throw an assertion error. • Which would in turn tie up threads, causing the node to be unresponsive and eventually the whole cluster.
  • 5.
    Our first C*patch • What was the size of the column name that would cause this issue? • Around 130kb • wat…
  • 6.
    Our first migration •Receive frantic phone call • Self managed cluster has been down for 48 hours, for a company that gets 25 million monthly unique views. • They are hurting
  • 7.
    Our first migration •The cluster was running a very early version of C* 2.0 • Update/patch the old cluster, get everything back online • Start the migration process…
  • 8.
    Our first migration •Bulkload manages to kill their new cluster with us in about 5 minutes. • Open logs • Read 1 live and 38456 tombstoned cells (see tombstone_warn_threshold) • For every column family • wat…
  • 9.
    Conclusion • Everything isawesome • Then reality occurs • It’s actually way more fun • Want to make C* even better? We are hiring!