Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second

5,359 views

Published on

With the introduction of Amazon Elastic Block Store (EBS) GP2 and recent stability improvements, EBS has gained credibility in the Cassandra world for high performance workloads. By running Cassandra on Amazon EBS, you can run denser, cheaper Cassandra clusters with just as much availability as ephemeral storage instances. This talk walks through a highly detailed use case and configuration guide for a multi PetaByte, million write per second cluster that needs to be high performing and cost efficient. We explore the instance type choices, configuration, and low-level tuning that allowed us to hit 1.3 million writes per second with a replication factor of 3 on just 60 nodes.

Published in: Technology

(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jim Plush Sr Director of Engineering, CrowdStrike Dennis Opacki, Sr Cloud Systems Architect, CrowdStrike October 2015 BDT323 Amazon EBS and Cassandra 1 Million Writes Per Second on 60 Nodes
  2. 2. An Introduction to CrowdStrike We Are CyberSecurity Technology Company We Detect, Prevent And Respond To All Attack Types In Real Time, Protecting Organizations From Catastrophic Breaches We Provide Next Generation Endpoint Protection, Threat Intelligence & Pre &Post IR Services http://www.crowdstrike.com/introduction-to-crowdstrike-falcon-host/
  3. 3. CrowdStrike Scale • Cloud-based endpoint protection • Single customer can generate > 2 TB daily • 500K+ events per second • Multi-petabytes of managed data © 2015. All Rights Reserved.
  4. 4. Truisms??? • HTTPS is too slow to run everywhere • All you need is anti-virus • Never run Cassandra on Amazon EBS © 2015. All Rights Reserved.
  5. 5. © 2015. All Rights Reserved. What is Amazon EBS? EBS data volume EBS data volume /mnt/foo /mnt/bar EC2 Instance  Network mounted hard drive  Ability to snapshot data  Data encryption at rest & in flight
  6. 6. Existing Amazon EBS Assumptions • Jittery I/O a.k.a: Noisy neighbors • Single point of failure in a region • Cost is too damn high • Bad volumes (dd and destroy) © 2015. All Rights Reserved.
  7. 7. A Recent Project: Initial Requirements • 1PB of incoming event data from millions of devices • Modeled as a graph • 1 million writes per second (burst) • Age data out after x days • 95% write 5% read © 2015. All Rights Reserved.
  8. 8. We Tried • Cassandra + Titan • Sharding? • Neo4J • PostgreSQL, MySQL, SQLite • LevelDB/RocksDB © 2015. All Rights Reserved.
  9. 9. We Have to Make This Work Cassandra had the properties we needed Time for a new approach? © 2015. All Rights Reserved. http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html
  10. 10. Number of Machines for 1PB © 2015. All Rights Reserved. 0. 450. 900. 1350. 1800. 2250. I2.xlarge c4.2XL EBS
  11. 11. Yearly Cost for 1PB Cluster © 2015. All Rights Reserved. 0. 4. 8. 12. 16. I2.xlarge-on demand I2.xlarge-reserved c4.2xl - on demand c4.2xl - reserved Millionsof$ With Amazon EBS
  12. 12. Initial Launch Date Tiered Compaction © 2015. All Rights Reserved. …more details by Jeff Jirsa, CrowdStrike Cassandra Summit 2015 - DTCS http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
  13. 13. Initial Launch • Cassandra 2.0.12 (DSE) • m3.2xlarge 8 core • Single 4TB EBS GP2 ~10,000 IOPS • Default tunings © 2015. All Rights Reserved.
  14. 14. Performance Was Terrible • 12 node cluster • ~60K writes per second RF2 • ~10K writes per 8 core box • We went to the experts © 2015. All Rights Reserved.
  15. 15. © 2015. All Rights Reserved. Cassandra Summit 2014 Family Search asked the same question: Where’s the bottleneck? https://www.youtube.com/watch?v=Qfzg7gcSK-g
  16. 16. IOPS Available © 2015. All Rights Reserved. 0. 12500. 25000. 37500. 50000. I2.xlarge c4.2xlarge
  17. 17. © 2015. All Rights Reserved. 1.3K IOPS?
  18. 18. © 2015. All Rights Reserved. IOPS I see you there, but I can’t reach you!
  19. 19. © 2015. All Rights Reserved. The magic gates opened… We hit 1 million writes per second RF3 on 60 nodes
  20. 20. © 2015. All Rights Reserved. Testing Setup
  21. 21. Testing Methodology • Each test run • clean C* instances • old test keyspaces dropped • 13+TBs of data loaded during read testing • 20 C4.4XL Stress Writers each with their own 1BB sequence © 2015. All Rights Reserved.
  22. 22. Cluster Topology © 2015. All Rights Reserved. Stress Node 10 Instances AZ: 1A Stress Nodes 10 Instances AZ: 1B 20 C* Nodes AZ: 1A 20 C* Nodes AZ: 1B 20 C* Nodes AZ: 1C OpsCenter
  23. 23. Amazon EBS © 2015. All Rights Reserved.
  24. 24. Cassandra Stress 2.1.x © 2015. All Rights Reserved. bin/cassandra-stress user duration=100000m cl=ONE profile=/home/ubuntu/summit_stress.yaml ops(insert=1) no-warmup -pop seq=1..1000000000 -mode native cql3 -node 10.10.10.XX -rate threads=1000 -errors ignore
  25. 25. © 2015. All Rights Reserved. PCSTAT - Al Tobey http://www.datastax.com/dev/blog/compaction-improvements-in-cassandra-21 https://github.com/tobert/pcstat
  26. 26. © 2015. All Rights Reserved. Netflix Test - What is C* capable of?
  27. 27. Netflix Test © 2015. All Rights Reserved. 1+ million writes per second RF:3 3+ million local writes per second NICE!
  28. 28. Netflix Test © 2015. All Rights Reserved.
  29. 29. Netflix Test © 2015. All Rights Reserved. No dropped mutations, system healthy at 1.1M after 50 mins
  30. 30. Netflix Test © 2015. All Rights Reserved. I/O util is not peggedCommit disk = steady!
  31. 31. Netflix Test © 2015. All Rights Reserved. Low I/O wait
  32. 32. Netflix Test © 2015. All Rights Reserved. 95th Latency = Reasonable
  33. 33. Netflix Test - Read Fail © 2015. All Rights Reserved. compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'} https://issues.apache.org/jira/browse/CASSANDRA-10249 https://issues.apache.org/jira/browse/CASSANDRA-8894 Data Drive Pegged 
  34. 34. Reading Data • 24-hour read test • over 10 TBs of data in the CF • sustained > 350K reads per second over 24 hours • 1M reads/per sec peak • CL ONE • 12 C4.4XL stress boxes © 2015. All Rights Reserved.
  35. 35. Reading Data © 2015. All Rights Reserved.
  36. 36. Reading Data © 2015. All Rights Reserved.
  37. 37. Reading Data © 2015. All Rights Reserved. Not Pegged 
  38. 38. Reading Data © 2015. All Rights Reserved. 7.2ms 95th latency
  39. 39. 180 less cores (45 less i2.xlarge instances) • C4.4XL vs. i2.XLarge 24 hour test (sans data transfer cost) • Netflix cluster/stress • Cost: ~$6300 • 285 i2.xlarge $0.85 per hour • CrowdStrike cluster/stress with Amazon EBS cost • Cost: ~$2600 • 60 C4.4XL $0.88 per hour VS Netflix Blog Post
  40. 40. • Our test was a single 10K IOPS volume • More/bigger reads? • PIOPS gives you as much throughput as you need • RAID0 multiple Amazon EBS volumes Read Notes with Amazon EBS EBS Data Volume EBS Data Volume /mnt /foo /mnt/bar EC2 Instance
  41. 41. © 2015. All Rights Reserved. What Unlocked Performance
  42. 42. Major Tweaks • Ubuntu HVM types • Enhanced networking • Now faster than PV • Ubuntu distro tuned for cloud workloads • XFS Filesystem © 2015. All Rights Reserved.
  43. 43. Major Tweaks Major Tweaks • Cassandra 2.1 • Java 8 • G1 Garbage Collector © 2015. All Rights Reserved. https://issues.apache.org/jira/browse/CASSANDRA-7486
  44. 44. Major Tweaks • C4.4XL 16 core, EBS Optimized • 4TB, 10,000 IOPS EBS GP2 Encrypted Data Drive • 160MB/s throughput • 1TB 3000 IOPS EBS GP2 Encrypted Commit Log Drive © 2015. All Rights Reserved.
  45. 45. Major Tweaks cassandra-env.sh • MAX_HEAP_SIZE=8G • JVM_OPTS=“$JVM_OPTS —XX:+UseG1GC” • Lots of other minor tweaks in crowdstrike-tools © 2015. All Rights Reserved.
  46. 46. cassandra-env.sh © 2015. All Rights Reserved. Put PID in batch mode Mask CPU0 from the process to reduce context switching Magic From Al Tobey
  47. 47. YAML Settings cassandra.yaml (based on 16 core) • concurrent_reads: 32 • concurrent_writes: 64 • memtable_flush_writers: 8 • trickle_fsync: true • trickle_fsync_interval_in_kb: 1000 • native_transport_max_threads: 256 • concurrent_compactors: 4 © 2015. All Rights Reserved.
  48. 48. cassandra.yaml © 2015. All Rights Reserved. We found a good portion of the CPU load was being used for internode compression which reduced write throughput internode_compression: none
  49. 49. Lessons Learned • Amazon EBS was never the bottleneck during testing, GP2 is legit • Built-in types like list and map come at a performance penalty • 30% hit on our writes using Map type • DTCS is very young (see Jeff Jirsa’s talk) • 2.1 Stress Tool is tricky but great for modeling workloads • How will compression affect your read path? © 2015. All Rights Reserved.
  50. 50. © 2015. All Rights Reserved. Test Your Own! https://github.com/CrowdStrike/cassandra-tools
  51. 51. It’s Just Python launch 20 nodes in us-east-1 • python launch.py launch --nodes=20 —config=c4-ebs-hvm —az=us-east-1a bootstrap the new nodes with C*, RAID/Format disks, etc… • fab -u ubuntu bootstrapcass21:config=c4-highperf run arbitrary commands • fab -u ubuntu cmd:config=c4-highperf,cmd="sudo rm -rf /mnt/cassandra/data/summit_stress" © 2015. All Rights Reserved.
  52. 52. Run Custom Stress Profiles… Multi-Node Support ubuntu@ip-10-10-10.XX:~$ python runstress.py --profile=stress10 —seednode=10.10.10.XX —-threads=50 Going to run: /home/ubuntu/apache-cassandra-2.1.5/tools/bin/cassandra-stress user duration=100000m cl=ONE profile=/home/ubuntu/summit_stress.yaml ops(insert=1,simple=9) no-warmup -pop seq=1..1000000000 -mode native cql3 -node 10.10.10.XX -rate threads=50 -errors ignore © 2015. All Rights Reserved. ubuntu@ip-10-10-10.XX:~$ python runstress.py --profile=stress10 --seednode=10.10.10.XX --threads=50 Going to run: /home/ubuntu/apache-cassandra-2.1.5/tools/bin/cassandra-stress user duration=100000m cl=ONE profile=/home/ubuntu/summit_stress.yaml ops(insert=1,simple=9) no-warmup -pop seq=1000000001..2000000000 -mode native cql3 -node 10.10.10.XX -rate threads=50 -errors ignore export NODENUM=1 export NODENUM=2
  53. 53. • ~3 months on our Amazon EBS–based cluster • Hundreds of TBs of graph data and growing in C* • Billions of vertices/edges • Changing perceptions? • DataStax - Planning an Amazon EC2 cluster Where Are We Today?
  54. 54. Al Tobey’s Tuning Guide for Cassandra 2.1 https://tobert.github.io/pages/als-cassandra-21-tuning- guide.html Resources
  55. 55. Special Thanks To Leif Jackson Marcus King Alan Hannan Jeff Jirsa © 2015. All Rights Reserved. • Al Tobey • Nick Panahi • J.B. Langston • Marcus Eriksson • Iian Finlayson • Dani Traphagen
  56. 56. Amazon EBS Heading Into 2016 © 2015. All Rights Reserved.
  57. 57. 4TB (10k IOPS) GP2 I/O Hit? Not enough to phase C*
  58. 58. © 2015. All Rights Reserved. So why the hate for Amazon EBS?
  59. 59. © 2015. All Rights Reserved. • Used instance-store image and ephemeral drives • Painful to stop/start instances, resize • Couldn’t avoid scheduled maintenance (i.e., Reboot-a-palooza) • Encryption required shenanigans Following the Crowd – Trust Issues
  60. 60. © 2015. All Rights Reserved. • We still had failures • Now we get to rebuild from scratch Guess What
  61. 61. © 2015. All Rights Reserved. What do you mean my volume is “stuck”? • April 2011 – Netflix, Reddit, and Quora • October 2012 – Reddit, Imgur, Heroku • August 2013 – Vine, Airbnb Amazon EBS’s Troubled Childhood
  62. 62. © 2015. All Rights Reserved. http://techblog.netflix.com/2011/04/lessons- netflix-learned-from-aws-outage.html Spread services across multiple regions Test failure scenarios regularly (Chaos Monkey) Make Cassandra databases more resilient by avoiding Amazon EBS Kiss of Death
  63. 63. © 2015. All Rights Reserved. Amazon moves quickly and quietly: • March 2011 – New Amazon EBS GM • July 2012 – Provisioned IOPs • May 2014 – Native encryption • Jun 2014 – GP2 (game changer) • Mar 2015 – 16TB / 10K GP2/ 20K PIOPS Redemption
  64. 64. © 2015. All Rights Reserved. • Prioritized Amazon EBS availability and consistency beyond features and functionality • Compartmentalized the control plane – removed cross-AZ dependencies for running volumes • Simplified workflows to favor sustained operation • Tested and simulated via TLA+/PlusCal - better understood corner cases • Dedicated a large fraction of engineering resources to reliability and performance Redemption
  65. 65. © 2015. All Rights Reserved. Amazon EBS team targets 99.999% availability exceeding expectations Reliability
  66. 66. © 2015. All Rights Reserved. • In past 12 months, zero Amazon EBS– related failures • Thousands of GP2 data volumes (~2PB data) • Transitioning all systems to Amazon EBS root drives • Moved all data stores to Amazon EBS (C*, Kafka, Elasticsearch, Postgres, etc.) CrowdStrike Today
  67. 67. © 2015. All Rights Reserved. • Select a region with >2 AZs (e.g., us-east-1 or us-west-2) • Use Amazon EBS GP2 or PIOPs storage • Separate volumes for data and commit logs Staying Safe - Architecture
  68. 68. © 2015. All Rights Reserved. • Use Amazon EBS volume monitoring • Pre-warm Amazon EBS volumes? • Schedule snapshots for consistent backups Staying Safe - Ops
  69. 69. © 2015. All Rights Reserved. • Challenge assumptions • Stay current on AWS blog • Talk with your peers Most Importantly http://aws.amazon.com/ebs/nosql/
  70. 70. Remember to complete your evaluations! BDT323
  71. 71. Thank you! @jimplush @opacki

×