Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known


Published on

A brief intro to how Barracuda Networks uses Cassandra and the ways in which they are replacing their MySQL infrastructure, with Cassandra. This presentation will include the lessons they've learned along the way during this migration.

Speaker: Michael Kjellman, Software Engineer at Barracuda Networks

Michael Kjellman is a Software Engineer, from San Francisco, working at Barracuda Networks. Michael works across multiple products, technologies, and languages. He primarily works on Barracuda's spam infrastructure and web filter classification data.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • -usage changed and significantly increased
  • It’s never really real timeIs it 1 second? 3 seconds? 1 hour?When do you have a business problem due to the fact you are not “real-time” enough?
  • -We had a technical “realtime” issue that translated (more importantly) to a business problem. We weren’t catching spam fast enough.-Example: -> 8 minutes from the first hit to classified translated into 180 messages in customers inboxes-How to close that gap to near zero?-New system classified the same domain in 3 seconds from the first hit. 0 messages in customers inboxes
  • Our Rewrite by the numbers
  • -The data grows as business continues to grow and there is a need to consolidate and aggregate data across products and systems
  • -What does “legacy” bring to mind at most companies. Ops team ducktape (The data has a life of its own)-Over time, the various layers of duck tape make operations harder and hardersystems built with good intentions but frequently hit an inflection point where the underlying database problem can’t be fixed anymore-ducktape isn’t good enough anymoreadd a slave-addmemcache-attempt to better batch queries
  • -If the legacy system is preventing implementation, then new system design is required-our inflection point: throwing away valuable data to keep the system stable-five years ago, continuous failure in your persistence tier was virtually unthinkable five years ago
  • -Getting technical buy in from all parties that C* and other tools were the “right” tool going forwardHad to engineer our migration and rewrite in stages to provide tangible business value earlierCouldn’t just “go away” for a year and promise a perfect solution sometime down the roadBusiness requirements forced hybrid period with the old and new systems operated in parallelGetting technical buy-in
  • -The up front costs are high, but the ability to implement anything going forward is a powerful proposition.
  • -the old problems won’t go away during the migration-prepare to manage expectations that things might get worse before they get better
  • C* != “#dontneedtothinkaboutmyschema”Counters and CompositesOptimize for use caseDon’t be afraid of writes. Storage is cheap. If multiple writes make for a cleaner, simpler read path, do it.Optimize to reduce the number of tombstones
  • -talk about the first iteration, where I also tried the select * approach to prefill our cache. Not necessary and more importantly bad design.-mysql / relational database mentality of batch retrieval-possible to get the same result, but required different thinking and logic
  • Almost impossible to get it right the first time-give example of elements that were in MySQL incorrectly with a timestamp of 0 for the epoch. I incorrectly assumed that > 0 would be valid. Our initial sync missed all elements with the incorrect timestamp of 0-how we had to split up our sync code into pieces-how important is the speed of your syncing
  • -give example of bcd, where to remove and make external changes in the hashtable, bcd would read every n seconds from a mysql (select *) and then delete all after retrieving the records-goes back to article number 2
  • -If MySQL was the bottleneck before, after migrating to C* other elements might now become the bottleneck
  • -Deploying changes to distributed systems is more complicated and more prone to human error-give example of person who tried to manually upgrade 30+ node cluster and made human error which resulted in app being down-with distributed systems comes more complication, and minor mistakes can lead to cascading failures
  • Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known

    1. 1. Hindsight is 20/20:MySQL to CassandraMichael Kjellman (@mkjellman)Barracuda Networks
    2. 2. What I Do• Build and maintain “real-time” Spamdetection and Web Filter classification• Java/Perl/C (and bits of everything else)• Author perlcassa (Perl C* client)• Frontend? Backend? Customer? Internal?Broken RAID Card? Bad Disk? I touch it all.
    3. 3. Our C* Cluster• In production for ~2 years since 0.8• Running 1.2.5 + minor patches• 24 nodes in 2 datacenters• (2) 2TB Hard Drives (no RAID)• (1) Small SSD for small hot CFs• 64GB of RAM• Puppet for management• Cobbler for deployment• Target max load at 600GB per node
    4. 4. What is “real-time” exactly?
    5. 5. Our Rewrite by the NumbersCassandraBasedMySQLBasedAverage ApplicationLatency2.41ms 5.0msElements in Database 32,836,767 3,946,713Elements ApplicationHandles32,836,767 314,974Element Seen Prior toTracking1st request VariousThresholdsDatacenters 2 1Average Latency ofAutomatedClassification3 seconds 8 minutes
    6. 6. Should you Rewrite?• How To Survive a Ground-Up Rewrite Without LosingYour Sanity[1] – Joel Spolsky• Past engineering decisions preventingimplementation of new business requirements• New threats smarter and more targeted[1]
    7. 7. Evolving Legacy Systems• Even good developers can write sloppy code• Too much duct tape– Most layers applied around the database
    8. 8. Hitting the Reset Button• Plan for continuous failure• Easily Scalable• No Single Point of Failure – that you know of• Many smaller boxes vs. one monolithic box
    9. 9. Whiteboard to Reality• Get technical buy-in from all parties• Migrate and rewrite in stages– Business requirements forced hybrid period withthe old and new systems operated in parallel
    10. 10. Cassandra is Not…1. Direct MySQL replacement2. Magic bullet to solve everything
    11. 11. Migrating• Painful• Painful• Painful• Tons of rewriting• Tons of regressions• Did I say painful?
    12. 12. So Why Migrate?• C* is the best option for persistence tier• Business success motivation• Don’t let your database hold you back
    13. 13. Lessons Learned (the good)• Carefully defining data model up front• Creating a flexible systems architecture thatadapts well to changes during implementation• Seriously – “Measure twice, cut once.”
    14. 14. Lessons Learned (the bad)• Consider migration and delivery requirementsfrom the very beginning• Adjust expectations – didn’t expect relying onlegacy systems for so long• Make syncing data between systems a priority
    15. 15. Tips1. Start with the queries2. Think differently regarding reads3. Syncing and migrating data4. Don’t use C* as a queue5. Estimate capacity6. Automate, Automate, Automate7. Some maintenance required
    16. 16. 1. Start with the Queries• C* != “#dontneedtothinkaboutmyschema”• Counters and Composites• Optimize for use case– Don’t be afraid of writes. Storage is cheap.– Optimize to reduce the number of tombstones
    17. 17. 2. Think Differently Regarding Reads• Do you really need all that data at once?• mysql> SELECT * FROM mysupercooltableWHERE foo = ‘bar’;– Slow, but eventually will work• cqlsh> SELECT * FROM myreallybigcfWHERE foo = ‘bar’;– Won’t work. Expect RPC timeout exceptions on readsgenerally after ~10,000 rows even with paging• Our solutions:– ElasticSearch– Hadoop/Pig
    18. 18. 3. Syncing and Migrating Data• Sync and migration scripts – take moreseriously than production code• Design sync to be continuous with bothsystems running in parallel during migration• Prioritize the sync
    19. 19. 4. Don’t use C* as a Queue• Cassandra anti-patterns: Queues and queue-like datasets[2] – Aleksey Yeschenko• Tombstones + read performance• Our solution:– Kafka (multiple publisher, multiple consumerdurable queue)[2]
    20. 20. 5. Estimate Capacity• Don’t forget the Java heap (8GB Max)• Plan capacity – today and future• Stress Tool – profile node and multiply• MySQL hardware != Cassandra hardware• New bottlenecks thanks to C* being soawesome?• I/O still an important concern with C*
    21. 21. 6. Automate, Automate, Automate• Love your inner Ops self. Distributed systemsmove complexity to operations.• Puppet or something similar (really)• Learn CCM earlier rather than later–
    22. 22. 7. Some Maintenance Required• Repairs & Cleanup ops– automate and run frequently• Rolling restart meet rollingrepair• Learn jconsole• Solution:– Jolokia (JMX via HTTP)
    23. 23. Where is Barracuda Today?• 2 years in production with Cassandra• Definitely the right choice for our persistencetier• 2 product lines on C* based system andanother major product in beta• Achieved “real-time” response
    24. 24. 2.0 and Beyond• Thrift -> CQL• CQL helps the MySQL to C* migration– Easier to comprehend / grasp• Everyone understands SELECT * FROM cf WHEREkey = ‘foo’;• CAS and other 2.0 features make C* an evenbetter replacement option for MySQL
    25. 25. C* Community• Supercalifragilisticexpialidocious community!• Riak, HBase, Oracle are other options. How istheir dev community?• Great client support. Great people. Greatmotivated developers.• IRC: #cassandra on freenode• Mailing List: