Hindsight is 20/20:MySQL to CassandraMichael Kjellman (@mkjellman)Barracuda Networks#cassandra13
What I Do• Build and maintain “real-time” Spam detectionand Web Filter classification• Java/Perl/C (and bits of everything...
Our C* Cluster• In production for ~2 years since 0.8• Running 1.2.5 + minor patches• 24 nodes in 2 datacenters• (2) 2TB Ha...
What is “real-time” exactly?#cassandra13
#cassandra13
Our Rewrite by the NumbersCassandra Based MySQL BasedAverage Application Latency 2.41ms 5.0msElements in Database 32,836,7...
Should you Rewrite?• How To Survive a Ground-Up Rewrite WithoutLosing Your Sanity[1] – Joel Spolsky• Past engineering deci...
Evolving Legacy Systems• Even good developers can write sloppy code• Too much duct tape– Most layers applied around the da...
Hitting the Reset Button• Plan for continuous failure• Easily Scalable• No Single Point of Failure – that you know of• Man...
Whiteboard to Reality• Get technical buy-in from all parties• Migrate and rewrite in stages– Business requirements forced ...
#cassandra13
Cassandra is Not…1. Direct MySQL replacement2. Magic bullet to solve everything#cassandra13
Migrating• Painful• Painful• Painful• Tons of rewriting• Tons of regressions• Did I say painful?#cassandra13
So Why Migrate?• C* is the best option for persistence tier• Business success motivation• Don‟t let your database hold you...
Lessons Learned (the good)• Carefully defining data model up front• Creating a flexible systems architecture thatadapts we...
Lessons Learned (the bad)• Consider migration and delivery requirementsfrom the very beginning• Adjust expectations – didn...
Tips1. Define requirements early2. Start with the queries3. Think differently regarding reads4. Syncing and migrating data...
1. Define Requirements Early• What kind of queries will your application make?• Do you need ordered results for all of you...
2. Start with the Queries• C* != “#dontneedtothinkaboutmyschema”• Counters and Composites• Optimize for use case– Don‟t be...
3. Think Differently Regarding Reads• Do you really need all that data at once?• mysql> SELECT * FROM mysupercooltable WHE...
4. Syncing and Migrating Data• Sync and migration scripts – take more seriouslythan production code• Design sync to be con...
5. Don‟t use C* as a Queue• Cassandra anti-patterns: Queues and queue-likedatasets[2] – Aleksey Yeschenko• Tombstones + re...
6. Estimate Capacity• Don‟t forget the Java heap (8GB Max)• Plan capacity – today and future• Stress Tool – profile node a...
7. Automate, Automate, Automate• Love your inner Ops self. Distributed systemsmove complexity to operations.• Puppet or so...
8. Some Maintenance Required• Repairs & Cleanup ops– automate and run frequently• Rolling restart meet rollingrepair• Lear...
Where is Barracuda Today?• 2 years in production with Cassandra• Definitely the right choice for our persistence tier• 2 p...
2.0 and Beyond• Thrift -> CQL• CQL helps the MySQL to C* migration– Easier to comprehend / grasp• Everyone understands SEL...
C* Community• Supercalifragilisticexpialidocious community!• Riak, HBase, Oracle are other options. How istheir dev commun...
Upcoming SlideShare
Loading in …5
×

Hindsight is 20/20: MySQL to Cassandra

1,108 views

Published on

Learn about Barracuda Networks and our transition from MySQL to Cassandra. I gave this talk at the 2013 Cassandra Summit.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,108
On SlideShare
0
From Embeds
0
Number of Embeds
82
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • -usage changed and significantly increased
  • It’s never really real timeIs it 1 second? 3 seconds? 1 hour?When do you have a business problem due to the fact you are not “real-time” enough?
  • -We had a technical “realtime” issue that translated (more importantly) to a business problem. We weren’t catching spam fast enough.-Example: vimaseg.com.br -> 8 minutes from the first hit to classified translated into 180 messages in customers inboxes-How to close that gap to near zero?-New system classified the same domain in 3 seconds from the first hit. 0 messages in customers inboxes
  • Our Rewrite by the numbers
  • -The data grows as business continues to grow and there is a need to consolidate and aggregate data across products and systems
  • -What does “legacy” bring to mind at most companies. Ops team ducktape (The data has a life of its own)-Over time, the various layers of duck tape make operations harder and hardersystems built with good intentions but frequently hit an inflection point where the underlying database problem can’t be fixed anymore-ducktape isn’t good enough anymoreadd a slave-addmemcache-attempt to better batch queries
  • -If the legacy system is preventing implementation, then new system design is required-our inflection point: throwing away valuable data to keep the system stable-five years ago, continuous failure in your persistence tier was virtually unthinkable five years ago
  • -Getting technical buy in from all parties that C* and other tools were the “right” tool going forwardHad to engineer our migration and rewrite in stages to provide tangible business value earlierCouldn’t just “go away” for a year and promise a perfect solution sometime down the roadBusiness requirements forced hybrid period with the old and new systems operated in parallelGetting technical buy-in
  • -The up front costs are high, but the ability to implement anything going forward is a powerful proposition.
  • -the old problems won’t go away during the migration-prepare to manage expectations that things might get worse before they get better
  • What kind of queries will your application make?Do you need ordered results for all of your rows? (Solr or ElasticSearch)What is your read load? What is your write load? It almost certainly won’t be what you think it is. Get real numbers.
  • C* != “#dontneedtothinkaboutmyschema”Counters and CompositesOptimize for use caseDon’t be afraid of writes. Storage is cheap. If multiple writes make for a cleaner, simpler read path, do it.Optimize to reduce the number of tombstones
  • -talk about the first iteration, where I also tried the select * approach to prefill our cache. Not necessary and more importantly bad design.-mysql / relational database mentality of batch retrieval-possible to get the same result, but required different thinking and logic
  • Almost impossible to get it right the first time-give example of elements that were in MySQL incorrectly with a timestamp of 0 for the epoch. I incorrectly assumed that > 0 would be valid. Our initial sync missed all elements with the incorrect timestamp of 0-how we had to split up our sync code into pieces-how important is the speed of your syncing
  • -give example of bcd, where to remove and make external changes in the hashtable, bcd would read every n seconds from a mysql (select *) and then delete all after retrieving the records-goes back to article number 2
  • -If MySQL was the bottleneck before, after migrating to C* other elements might now become the bottleneck
  • -Deploying changes to distributed systems is more complicated and more prone to human error-give example of person who tried to manually upgrade 30+ node cluster and made human error which resulted in app being down-with distributed systems comes more complication, and minor mistakes can lead to cascading failures
  • Hindsight is 20/20: MySQL to Cassandra

    1. 1. Hindsight is 20/20:MySQL to CassandraMichael Kjellman (@mkjellman)Barracuda Networks#cassandra13
    2. 2. What I Do• Build and maintain “real-time” Spam detectionand Web Filter classification• Java/Perl/C (and bits of everything else)• Author perlcassa (Perl C* client)• Frontend? Backend? Customer? Internal?Broken RAID Card? Bad Disk? I touch it all.#cassandra13
    3. 3. Our C* Cluster• In production for ~2 years since 0.8• Running 1.2.5 + minor patches• 24 nodes in 2 datacenters• (2) 2TB Hard Drives (no RAID)• (1) Small SSD for small hot CFs• 64GB of RAM• Puppet for management• Cobbler for deployment• Target max load at 600GB per node#cassandra13
    4. 4. What is “real-time” exactly?#cassandra13
    5. 5. #cassandra13
    6. 6. Our Rewrite by the NumbersCassandra Based MySQL BasedAverage Application Latency 2.41ms 5.0msElements in Database 32,836,767 3,946,713Elements Application Handles 32,836,767 314,974Element Seen Prior to Tracking 1st request Various ThresholdsDatacenters 2 1Average Latency of AutomatedClassification3 seconds 8 minutes#cassandra13
    7. 7. Should you Rewrite?• How To Survive a Ground-Up Rewrite WithoutLosing Your Sanity[1] – Joel Spolsky• Past engineering decisions preventingimplementation of new business requirements• New threats smarter and more targeted[1]http://onstartups.com/tabid/3339/bid/97052/How-To-Survive-a-Ground-Up-Rewrite-Without-Losing-Your-Sanity.aspx#cassandra13
    8. 8. Evolving Legacy Systems• Even good developers can write sloppy code• Too much duct tape– Most layers applied around the database#cassandra13
    9. 9. Hitting the Reset Button• Plan for continuous failure• Easily Scalable• No Single Point of Failure – that you know of• Many smaller boxes vs. one monolithic box#cassandra13
    10. 10. Whiteboard to Reality• Get technical buy-in from all parties• Migrate and rewrite in stages– Business requirements forced hybrid period with theold and new systems operated in parallel#cassandra13
    11. 11. #cassandra13
    12. 12. Cassandra is Not…1. Direct MySQL replacement2. Magic bullet to solve everything#cassandra13
    13. 13. Migrating• Painful• Painful• Painful• Tons of rewriting• Tons of regressions• Did I say painful?#cassandra13
    14. 14. So Why Migrate?• C* is the best option for persistence tier• Business success motivation• Don‟t let your database hold you back#cassandra13
    15. 15. Lessons Learned (the good)• Carefully defining data model up front• Creating a flexible systems architecture thatadapts well to changes during implementation• Seriously – “Measure twice, cut once.”#cassandra13
    16. 16. Lessons Learned (the bad)• Consider migration and delivery requirementsfrom the very beginning• Adjust expectations – didn‟t expect relying onlegacy systems for so long• Make syncing data between systems a priority#cassandra13
    17. 17. Tips1. Define requirements early2. Start with the queries3. Think differently regarding reads4. Syncing and migrating data5. Don‟t use C* as a queue6. Estimate capacity7. Automate, Automate, Automate8. Some maintenance required#cassandra13
    18. 18. 1. Define Requirements Early• What kind of queries will your application make?• Do you need ordered results for all of yourrows?• What is your read load? Write load?#cassandra13
    19. 19. 2. Start with the Queries• C* != “#dontneedtothinkaboutmyschema”• Counters and Composites• Optimize for use case– Don‟t be afraid of writes. Storage is cheap.– Optimize to reduce the number of tombstones#cassandra13
    20. 20. 3. Think Differently Regarding Reads• Do you really need all that data at once?• mysql> SELECT * FROM mysupercooltable WHEREfoo = ‘bar’;– Slow, but eventually will work• cqlsh> SELECT * FROM myreallybigcf WHERE foo= ‘bar’;– Won‟t work. Expect RPC timeout exceptions on reads generallyafter ~10,000 rows even with paging• Our solutions:– ElasticSearch– Hadoop/Pig#cassandra13
    21. 21. 4. Syncing and Migrating Data• Sync and migration scripts – take more seriouslythan production code• Design sync to be continuous with both systemsrunning in parallel during migration• Prioritize the sync#cassandra13
    22. 22. 5. Don‟t use C* as a Queue• Cassandra anti-patterns: Queues and queue-likedatasets[2] – Aleksey Yeschenko• Tombstones + read performance• Our solution:– Kafka (multiple publisher, multiple consumer durablequeue)[2]http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets#cassandra13
    23. 23. 6. Estimate Capacity• Don‟t forget the Java heap (8GB Max)• Plan capacity – today and future• Stress Tool – profile node and multiply• MySQL hardware != Cassandra hardware• New bottlenecks thanks to C* being soawesome?• I/O still an important concern with C*#cassandra13
    24. 24. 7. Automate, Automate, Automate• Love your inner Ops self. Distributed systemsmove complexity to operations.• Puppet or something similar (really)• Learn CCM earlier rather than later– www.github.com/pcmanus/ccm#cassandra13
    25. 25. 8. Some Maintenance Required• Repairs & Cleanup ops– automate and run frequently• Rolling restart meet rollingrepair• Learn jconsole• Solution:– Jolokia (JMX via HTTP)#cassandra13
    26. 26. Where is Barracuda Today?• 2 years in production with Cassandra• Definitely the right choice for our persistence tier• 2 product lines on C* based system and anothermajor product in beta• Achieved “real-time” response#cassandra13
    27. 27. 2.0 and Beyond• Thrift -> CQL• CQL helps the MySQL to C* migration– Easier to comprehend / grasp• Everyone understands SELECT * FROM cf WHEREkey = „foo‟;• CAS and other 2.0 features make C* an evenbetter replacement option for MySQL#cassandra13
    28. 28. C* Community• Supercalifragilisticexpialidocious community!• Riak, HBase, Oracle are other options. How istheir dev community?• Great client support. Great people. Greatmotivated developers.• IRC: #cassandra on freenode• Mailing List: user@cassandra.apache.org#cassandra13

    ×