Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts


Published on

A talk from Denis Sheahan on Netflix's Cassandra Architecture and Open Source efforts. Presented on 28 March 2012 in Cassandra Europe.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Find numbers some day
  • Send to John C once finished
  • Remove cluster names, hopefully find one without purpleMight want another slide with the clsuter details, if ready by 3/27
  • Compaction is continually  happening after only a few seconds.  Logs show it flushes the Memtable every 10 - 15 secondsLogs show a minor compaction every 30 seconds or soYou can see this also in the iostat data there are both reads and writes going to disk, the majority are writes.Stress command linejava -jar stress.jar  -d "144 node ids"   -e ONE -n 27000000  -l 3 -i 1  -t 200 -p 7102  -o INSERT  -c 10 -rSo its writing 10 columns per row, keyid randomly chosen from 27 million idsThirty clients talk to the first 144 nodes and 30 talk to the second 144For the Insert we write three replicas which is specified in the keyspaceKeyspace: Keyspace1:  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy  Durable Writes: true    Options: [us-east:3]  Column Families:    ColumnFamily: Standard1      Key Validation Class: org.apache.cassandra.db.marshal.BytesType      Default column value validator: org.apache.cassandra.db.marshal.BytesType      Columns sorted by: org.apache.cassandra.db.marshal.BytesType      Row cache size / save period in seconds: 0.0/0      Key cache size / save period in seconds: 200000.0/14400      Memtable thresholds: 1.7671875/1440/128 (millions of ops/minutes/MB)      GC grace seconds: 864000      Compaction min/max thresholds: 4/32      Read repair chance: 0.0      Replicate on write: true
  • Cross AZ traffic calculation – per node average 23640+19770 = 43410 KB/s288 nodes times 3600 = 45007 GB/hour2/3s = 30000 GB/hour, $0.01/GB = $30010 minute test run = 300/6 = $50Slides error, test driver was m2.4xl not m1.xlTest driver TX 250 Mbit/s = 31MBytes/s, RX 35 Kbit/s = 4.3 Mbytes/s60 x 35MB/s * 3600 = 7.5TB/hrEach write is about 400 bytes on diskCreating the tests is heavily dependent on AWS and the fact that we can only launch 96 at a time.Looking at the AWS, Linux and Cassandra logs for 288 wayI kicked off the first ASG from 0->96 at 00:08:0227 seconds later the first Linux box was booted at SatOct 22 00:08:29 UTC 2011Last Linux (number 288) bootedat SatOct 22 01:09:51 UTC 2011Last Cassandracame online at  01:12:41 about 3 minuteslaterSo just overanhour to getthisbad boy up.  Most ofthetimewaswaitingforthenodes to jointhe cluster.  I waitedforall 96 to joinbeforestartingthenext AZAWS claimsittook 4 minutes and 40 seconds to launchthe 96 instancesThisisprettyconsistentacrossthe 3 AZs.  So about 15 - 16 minutesofthe1 hourwas AWSItseems to takeabout 1 minute 30 seconds to bootthe Linux instancesNote in launchingthe 96 therewerefailures / retries in all 3 AZsus-east-1a had 9 failuresus-east-1c had 1 failureus-east-1d also had 9 failuresRun timesvaried a bit mostlybased on how long I couldsustaintheloadwiththenumberofclientswriting 27 millionrecords.  In Cassandra stress youcannotspecifyanelapsedtime, just a totalnumberoftransactions.  Italsodecaysirregularly as threadsterminate48 waysustainedloadfor 570 seconds96 waysustainedloadfor 550 seconds144 waysustainedfor 660 seconds288 waysustainedfor 780 seconds
  • Complete connection pool abstractionQueries and mutations wrapped in objects created by the Keyspace implementation making it possible to retry failed operations.  This varies from other connection pool implementations on which the operation is created on a specific connection and must be completely redone if it fails.Simplified serialization via method overloading.  The low level thrift library only understands data that is serialized to a byte array.  Hector requires serializers to be specified for nearly every call.  Astyanax minimizes the places where serializers are specified by using predefined ColumnFamiliy and ColumnPath definitions which specify the serializers.  The API also overloads set and get operation for common data types.The internal library does not log anything.  All internal events are instead ... calls to a ConnectionPoolMonitor interface.  This allows customization of log levels and filtering of repeating events outside of the scope of the connection poolSuper columns will soon be replaced by Composite column names. As such it is recommended to not use super columns at all and to use Composite column names instead. There is some support for super columns in Astyanax but those methods have been deprecated and will eventually be removed.
  • Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts

    1. 1. Cassandra in the Netflix ArchitectureCassandraEU London March 28th, 2012 Denis Sheahan
    2. 2. Agenda• Netflix and the Cloud• Why Cassandra• Cassandra Deployment @ Netflix• Cassandra Operations• Cassandra lessons learned• Scalability• Open Source
    3. 3. Netflix and the Cloud With more than 23 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the worlds leading internet subscription service for enjoying movies and TV series.. is almost 100% deployed on the Amazon CloudSource:
    4. 4. Out-Growing Data Center 37x Growth Jan 2010-Jan 2011DatacenterCapacity
    5. 5. Netflix Deployed on AWSContent Logs Play WWW API CS Video International S3 DRM Sign-Up Metadata Masters CS lookup Device Diagnostics EC2 EMR Hadoop CDN routing Search Config & Actions Movie TV Movie Customer S3 Hive Bookmarks Choosing Choosing Call Log Business Social CDNs Logging Ratings CS Analytics Intelligence Facebook
    6. 6. Netflix AWS Cost
    7. 7. Why Cassandra
    8. 8. Distributed Key-Value Store vs Central SQL Database• Datacenter had a central database DBA• Schema changes require downtime• Cloud has in contrast many key-value data stores – Joins take place in java code – No schema to change, no scheduled downtime
    9. 9. Goals for moving from Netflix DC to the Cloud• Faster – Lower latency than the equivalent datacenter web• Scalable – Avoid needing any more datacenter capacity as subscriber count increases• Available – Substantially higher robustness and availability than datacenter services• Productive – Optimize agility of a large development team with automation and tools
    10. 10. Cassandra• Faster – Low latency, low latency variance• Scalable – Supports running on Amazon EC2 – High and scalable read and write throughput – Support for Multi-region clusters• Available – We value Availability over Consistency – Cassandra Eventually Consistent – Supports Amazon Availability Zones – Data integrity checks and repairs – Online Snapshot Backup, Restore/Rollback• Productive – We want FOSS + Support
    11. 11. Cassandra Deployment
    12. 12. Netflix Cassandra Use Cases• Many different profiles – Read heavy environments with a strict SLA – Batch Write environments (70 rows per batch) also serving low latency Reads – Read Modify Write environments with large rows – Write only environments with rapidly increasing data sets – Any many more….
    13. 13. How much we use Cassandra30 Number of production clusters12 Number of multi-region clusters3 Max regions, one cluster65 Total TB of data across all clusters472 Number of Cassandra nodes72/28 Largest Cassandra cluster (nodes/data in TB)6k/250k Max read/writes per second
    14. 14. DeploymentArchitecture APIAWS EC2 Front End Load Balancer Discovery Service API Proxy Load Balancer Component API Services memcache d Cassandra EC2 Internal Disks Backup S3
    15. 15. High Availability Deployment• Fact of life – EC2 instances die• We store 3 local replicas in 3 different Cassandra nodes – One replica per EC2 Availability Zone (AZ)• Minimum Cluster configuration is 6 nodes, 2 per AZ – Single instance failure still leaves at least one node in each AZ• Use Local Quorum for writes• Use Quorum One for reads• Entire cluster replicated in Multi-region deployments• AWS Availability Zones – Separate buildings – Separate power etc. – Fairly close together
    16. 16. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Cassandra •Disks •Zone A• Client Writes to Cassandra Cassandra If a node goes nodes and Zones •Disks •Disks offline, hinted handoff• Nodes return ack to •Zone C •Zone A completes the write client when the node comes• Data written to Token back up. internal commit log Aware disks (no more than Clients Requests can choose to 10 seconds later) Cassandra Cassandra wait for one node, a •Disks •Disks quorum, or all nodes to •Zone C •Zone B ack the write Cassandra SSTable disk writes and •Disks compactions occur •Zone B asynchronously
    17. 17. Extending to Multi-Region In production for UK/Ireland support• Create cluster in EU• Backup US cluster to S3• Restore backup in EU Cassandra • Disks • Zone A Cassandra • Disks • Zone A• Local repair EU cluster Cassandra • Disks Cassandra • Disks Cassandra Cassandra• • Disks • Disks Global repair/join • Zone C • Zone A • Zone C • Zone A US EU Clients Clients Cassandra Cassandra Cassandra Cassandra • Disks • Disks • Disks • Disks • Zone C • Zone B • Zone C • Zone B Cassandra Cassandra • Disks • Zone B • Disks • Zone B S3
    18. 18. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum• Client writes to local replicas If a node or region goes offline, hinted handoff• Local write acks returned to completes the write when the node comes back up. Client which continues when Nightly global compare and repair jobs ensure 2 of 3 local nodes are everything stays consistent. committed• Local coordinator writes to Local Remote remote coordinator. Cassandra• Cassandra When data arrives, remote • Disks • Zone A • Disks • Zone A coordinator node acks Cassandra Cassandra Cassandra Cassandra• • Disks • Disks Remote co-ordinator sends • Zone C • Zone A • Disks • Zone C • Disks • Zone A data to other remote zones US EU• Remote nodes ack to local Clients Clients Cassandra Cassandra Cassandra Cassandra coordinator • Disks • Zone C • Disks • Zone B • Disks • Zone C • Disks • Zone B• Data flushed to internal Cassandra Cassandra • Disks commit log disks (no more • Zone B • Disks • Zone B than 10 seconds later)
    19. 19. Priam – Cassandra Automation Available at• Open Source Tomcat Code running as a sidecar on each Cassandra node. Deployed as a separate rpm• Zero touch auto-configuration• Token allocation and assignment including multi- region• Broken node replacement and ring expansion• Full and incremental backup/restore to/from S3• Metrics collection and forwarding via JMX
    20. 20. Cassandra Backup/Restore• Full Backup Cassandra – Time based snapshot Cassandra Cassandra – SSTable compress -> S3• Incremental Cassandra Cassandra – SSTable write triggers S3 compressed copy to S3 Cassandra Backup Cassandra• Archive – Copy cross region Cassandra Cassandra• Restore Cassandra Cassandra – Full restore or create new Ring from Backup A
    21. 21. Cassandra Operations
    22. 22. Consoles, Monitors and Explorers• Netflix Application Console (NAC) – Primary AWS provisioning/config interface• EPIC Counters – Primary method for issue resolution• Dashboards• Cassandra Explorer – Browse clusters, keyspaces, column families• AWS Usage Analyzer – Breaks down costs by application and resource
    23. 23. Cloud Deployment Model Elastic Load Auto Scaling Balancer Group Instances Security GroupLaunchConfiguration Amazon Machine Image
    24. 24. NAC• Netflix Application Console (NAC) is Netflix’s primary tool in both Prod and Test to: • Create and Destroy Applications • Create and Destroy Auto Scaling Groups (ASGs) • Scale Instances up and down within an ASG and manage auto- scaling • Manage launch configs and AMIs•
    25. 25. NAC
    26. 26. Cassandra Explorer• Kiosk mode – no alerting• High level cluster status (thift, gossip)• Warns on a small set of metrics 27
    27. 27. Epic• Netflix-wide monitoring and alerting tool based on RRD• Priam sends all JMX data to Epic• Very useful for finding specific issues 28
    28. 28. Dashboards• Next level cluster details • Throughput • Latency, Gossip status, Maintenance operations • Trouble indicators• Useful for finding anomalies• Most investigations start here 29
    29. 29. Things we monitor• Cassandra – Throughput, Latency, Compactions , Repairs – Pending threads, Dropped operations – Backup failures – Recent restarts – Schema changes• System – Disk space, Disk throughput, Load average• Errors and exceptions in Cassandra, System and Tomcat log files 30
    30. 30. Cassandra AWS Pain Points• Compactions cause spikes, esp. on read-heavy systems – Affects clients (hector, astyanax) – Throttling in newer Cassandra versions helps• Repairs are toxic to performance• Disk performance on Cloud instances and its impact on SSTable count• Memory requirements due to filesystem cache• Compression unusable in our environment• Multi-tenancy performance unpredictable• Java Heap size and OOM issues
    31. 31. Lessons learned• In EC2 best to choose instances that are not multi- tenant• Better to compact on our terms and not Cassandra’s. Take nodes out of service for major compactions• Size memtable flushes for optimizing compactions – Helps when writes are uniformly distributed, easier to determine flush patterns – Best to optimize flushes based on memtable size, not time – Makes minor compactions smoother 32
    32. 32. Lessons Learned (cont)• Key and row caches – Left unbounded can chew up jvm memory needed for normal work – Latencies will spike as the jvm needs to fight for memory – Off-heap row cache still maintains data structures on- heap• mmap() as in-memory cache – When process terminated, mmap pages are added to the free list
    33. 33. Lessons Learned (cont)• Sharding – If a single row has many gets/mutates, the nodes holding it will become hot spots – If a row grows too large, it won’t fit into memory • Problem for reads, compactions, and repairs • Some of our indices ran afoul of this• For more info see Jason Brown’s slides Cassandra from the trenches
    34. 34. Scalability
    35. 35. Scalability Testing• Cloud Based Testing – frictionless, elastic – Create/destroy any sized cluster in minutes – Many test scenarios run in parallel• Test Scenarios – Internal app specific tests – Simple “stress” tool provided with Cassandra• Scale test, keep making the cluster bigger – Check that tooling and automation works… – How many ten column row writes/sec can we do?
    36. 36. Scale-Up Linearity Client Writes/s by node count – Replication Factor = 3 1200000 1099837 1000000 800000Transactions 600000 537172 400000 366828 200000 174373 0 0 50 100 150 200 250 300 350 EC2 Instances
    37. 37. Measured at the Cassandra Server Throughput 3.3 Million writes/secCassandra Writes per second Elapsed time seconds
    38. 38. Response time 0.014msCassandra Response time Elapsed time seconds
    39. 39. Per Node Activity Per Node 48 Nodes 96 Nodes 144 Nodes 288 NodesPer Server Writes/s 10,900 w/s 11,460 w/s 11,900 w/s 11,456 w/sMean Server Latency 0.0117 ms 0.0134 ms 0.0148 ms 0.0139 msMean CPU %Busy 74.4 % 75.4 % 72.5 % 81.5 %Disk Read 5,600 KB/s 4,590 KB/s 4,060 KB/s 4,280 KB/sDisk Write 12,800 KB/s 11,590 KB/s 10,380 KB/s 10,080 KB/sNetwork Read 22,460 KB/s 23,610 KB/s 21,390 KB/s 23,640 KB/sNetwork Write 18,600 KB/s 19,600 KB/s 17,810 KB/s 19,770 KB/s Node specification – Xen Virtual Images, AWS US East, three zones • Cassandra 0.8.6, CentOS, SunJDK6 • AWS EC2 m1 Extra Large – Standard price $ 0.68/Hour • 15 GB RAM, 4 Cores, 1Gbit network • 4 internal disks (total 1.6TB, striped together, md, XFS)
    40. 40. Time is Money 48 nodes 96 nodes 144 nodes 288 nodesWrites Capacity 174373 w/s 366828 w/s 537172 w/s 1,099,837 w/sStorage Capacity 12.8 TB 25.6 TB 38.4 TB 76.8 TBNodes Cost/hr $32.64 $65.28 $97.92 $195.84Test Driver Instances 10 20 30 60Test Driver Cost/hr $20.00 $40.00 $60.00 $120.00Cross AZ Traffic 5 TB/hr 10 TB/hr 15 TB/hr 301 TB/hrTraffic Cost/10min $8.33 $16.66 $25.00 $50.00Setup Duration 15 minutes 22 minutes 31 minutes 662 minutesAWS Billed Duration 1hr 1hr 1 hr 2 hrTotal Test Cost $60.97 $121.94 $182.92 $561.68 1 Estimate two thirds of total network traffic 2 Workaround for a tooling bug slowed setup
    41. 41. Open Source
    42. 42. Open Source @ Netflix• Source at• Binaries at Maven 2116
    43. 43. Cassandra JMeter Plugin• Netflix uses JMeter across the fleet for load testing• JMeter plugin provides a wide range of samplers for Get, Put, Delete and Schema Creation• Used extensively to load data, Cassandra stress tests, feature testing etc.• Described at wiki
    44. 44. Astyanax Available at• Cassandra java client• API abstraction on top of Thrift protocol• “Fixed” Connection Pool abstraction (vs. Hector) – Round robin with Failover – Retry-able operations not tied to a connection – Netflix PaaS Discovery service integration – Host reconnect (fixed interval or exponential backoff) – Token aware to save a network hop – lower latency – Latency aware to avoid compacting/repairing nodes – lower variance• Simplified use of serializers via method overloading (vs. Hector)• ConnectionPoolMonitor interface