Running Cassandra on Amazon EC2


Published on

What are the challenges of running Apache Cassandra on Amazon EC2? Is it a good idea?

In this presentation, we explore reasons for and against running the distributed database Cassandra on EC2. We look at the I/O performance of EC2 and

Published in: Technology, Business
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Running Cassandra on Amazon EC2

  1. 1. @cassandralondon
  2. 2. Thanks
  3. 3. Reminder Next meetup Wednesday 8th December Jake Luciani will be giving a talk on "Lucandra" (a Cassandra backend for Lucene open source search software)
  4. 4. Quick intro to Cassandra • Decentralized • Fault-tolerant • Tunable consistency • Elasticity
  5. 5. This talk Why consider EC2? What are the challenges of running Cassandra on EC2? Is it a good idea?
  6. 6. Cassandra design decisions Cassandra designed to run on many commodity servers It is designed to deal with unreliable hardware and networks
  7. 7. Why consider EC2? On demand instances “frees you from the costs and complexities of planning, purchasing, and maintaining hardware and transforms what are commonly large fixed costs into much smaller variable costs”
  8. 8. Why consider EC2? Multiple “Availability Zones” in multiple regions (US East, US West, Ireland and Singapore)
  9. 9. Writing to Cassandra 1. Write added to local log on target machine 2. Memtable updated 3. Memtable flushed to disk as data files (SSTable plus SSTable Index) 4. Eventually data files are compacted IO IO IO
  10. 10. Reading from Cassandra 1. Read from any node 2. Partitioner 3. Wait for R responses 4. Wait for N – R responses in the background and perform read repair IO IO
  11. 11. Reading from Cassandra Reads from multiple SSTables The application use-case will affect performance and what the bottleneck is (totally random reads being worst case) IO
  12. 12. The challenges Getting good enough I/O performance Not a huge number of resources on the Internet (new and shiny) Some minor setup and monitoring challenges (documentation is available)
  13. 13. EC2 I/O performance Ephemeral or EBS; low, moderate or high I/O performance indicators “other resources like the network and the disk subsystem are shared among instances… when a resource is under-utilized you will often be able to consume a higher share of that resource”
  14. 14. EBS or ephemeral? Jonathan Ellis recently on mailing list: “we recommend using raid0 ephemeral disks on EC2 with L or XL instance sizes for better i/o performance.” http://cassandra-user-incubator-apache- tp5615829p5615889.html
  15. 15. EBS or ephemeral? Amazon suggest EBS is better: “Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage”
  16. 16. “The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases. You can also attach multiple volumes to an instance and stripe across the volumes. This is one way to improve I/O rates, especially if your application performs a lot of random access across your data set.”
  17. 17. EC2 I/O benchmark Throughput measured using dd Seek measured using seeker.c Software RAID uses mdadm
  18. 18. Which is better? EBS has better throughput, ephemeral better for random seeks Generic benchmarks aren’t great – depends on your use case Warning: EC2 performance not consistent
  19. 19. EC2 Cassandra benchmark Read and write TPS Benchmarks carried out by Corey Hulen
  20. 20. Which is better? Corey suggests: “Raid 0 EBS drives are the way to go” “We didn’t notice a difference above the normal EC2 fluctuations when testing for 2 vs 4 drives”
  21. 21. Conclusions Cassandra will run acceptably on EC2, but real HW is better It will depend on your use case – particularly the types of read that you do Real HW may work out cheaper
  22. 22. Conclusions Ephemeral I/O seems to be better than EBS, although EBS has other advantages (doesn’t disappear if you stop the node) Again, it will depend on use case
  23. 23. Conclusions Large nodes are the best bet Small nodes have poor I/O Extra large nodes are probably not worth it (better to have more nodes) http://cassandra-user-incubator-apache- due-to-GC-tp5128481p5131568.html
  24. 24. Questions? Please leave feedback on Follow @cassandralondon on Twitter