Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cassandra in the Netflix       ArchitectureCassandraEU London March 28th, 2012           Denis Sheahan
Agenda•   Netflix and the Cloud•   Why Cassandra•   Cassandra Deployment @ Netflix•   Cassandra Operations•   Cassandra le...
Netflix and the Cloud  With more than 23 million streaming members in the     United States, Canada, Latin America, the Un...
Out-Growing Data Center             http://techblog.netflix.com/2011/02/redesigning-netflix-api.html                      ...
Netflix Deployed on AWSContent      Logs           Play          WWW        API          CS   Video                       ...
Netflix AWS Cost
Why Cassandra
Distributed Key-Value Store vs Central            SQL Database• Datacenter had a central database       DBA• Schema change...
Goals for moving from Netflix DC to             the Cloud• Faster  – Lower latency than the equivalent datacenter web• Sca...
Cassandra• Faster   – Low latency, low latency variance• Scalable   – Supports running on Amazon EC2   – High and scalable...
Cassandra Deployment
Netflix Cassandra Use Cases• Many different profiles  – Read heavy environments with a strict SLA  – Batch Write environme...
How much we use Cassandra30        Number of production clusters12        Number of multi-region clusters3         Max reg...
DeploymentArchitecture APIAWS EC2                              Front End Load Balancer             Discovery              ...
High Availability Deployment• Fact of life – EC2 instances die• We store 3 local replicas in 3 different Cassandra nodes  ...
Astyanax - Cassandra Write Data Flows           Single Region, Multiple Availability Zone, Token Aware                    ...
Extending to Multi-Region                    In production for UK/Ireland support•   Create cluster in EU•   Backup US clu...
Data Flows for Multi-Region Writes           Token Aware, Consistency Level = Local Quorum•   Client writes to local repli...
Priam – Cassandra Automation          Available at http://github.com/netflix• Open Source Tomcat Code running as a sidecar...
Cassandra Backup/Restore• Full Backup                                              Cassandra  – Time based snapshot       ...
Cassandra Operations
Consoles, Monitors and Explorers• Netflix Application Console (NAC)  – Primary AWS provisioning/config interface• EPIC Cou...
Cloud Deployment Model                  Elastic Load Auto Scaling     Balancer Group                                      ...
NAC• Netflix Application Console (NAC) is Netflix’s primary  tool in both Prod and Test to:  • Create and Destroy Applicat...
NAC
Cassandra Explorer• Kiosk mode – no alerting• High level cluster status (thift, gossip)• Warns on a small set of metrics  ...
Epic• Netflix-wide monitoring and alerting tool based on RRD• Priam sends all JMX data to Epic• Very useful for finding sp...
Dashboards• Next level cluster details   • Throughput   • Latency, Gossip status, Maintenance operations   • Trouble indic...
Things we monitor• Cassandra  –   Throughput, Latency, Compactions , Repairs  –   Pending threads, Dropped operations  –  ...
Cassandra AWS Pain Points• Compactions cause spikes, esp. on read-heavy systems   – Affects clients (hector, astyanax)   –...
Lessons learned• In EC2 best to choose instances that are not multi-  tenant• Better to compact on our terms and not Cassa...
Lessons Learned (cont)• Key and row caches  – Left unbounded can chew up jvm memory needed for    normal work  – Latencies...
Lessons Learned (cont)• Sharding  – If a single row has many gets/mutates, the nodes    holding it will become hot spots  ...
Scalability
Scalability Testing• Cloud Based Testing – frictionless, elastic   – Create/destroy any sized cluster in minutes   – Many ...
Scale-Up Linearity                             Client Writes/s by node count – Replication Factor = 3               120000...
Measured at the Cassandra Server                              Throughput 3.3 Million writes/secCassandra Writes per second...
Response time 0.014msCassandra Response time                                Elapsed time seconds
Per Node Activity      Per Node           48 Nodes      96 Nodes        144 Nodes           288 NodesPer Server Writes/s  ...
Time is Money                         48 nodes    96 nodes               144 nodes              288 nodesWrites Capacity  ...
Open Source
Open Source @ Netflix• Source at http://netflix.github.com• Binaries at Maven https://issues.sonatype.org/browse/OSSRH-  2...
Cassandra JMeter Plugin• Netflix uses JMeter across the fleet for  load testing• JMeter plugin provides a wide range of  s...
Astyanax                Available at http://github.com/netflix• Cassandra java client• API abstraction on top of Thrift pr...
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Upcoming SlideShare
Loading in …5
×

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts

16,336 views

Published on

A talk from Denis Sheahan on Netflix's Cassandra Architecture and Open Source efforts. Presented on 28 March 2012 in Cassandra Europe.

Published in: Technology, Education
  • Be the first to comment

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts

  1. 1. Cassandra in the Netflix ArchitectureCassandraEU London March 28th, 2012 Denis Sheahan
  2. 2. Agenda• Netflix and the Cloud• Why Cassandra• Cassandra Deployment @ Netflix• Cassandra Operations• Cassandra lessons learned• Scalability• Open Source
  3. 3. Netflix and the Cloud With more than 23 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the worlds leading internet subscription service for enjoying movies and TV series.. Netflix.com is almost 100% deployed on the Amazon CloudSource: http://ir.netflix.com
  4. 4. Out-Growing Data Center http://techblog.netflix.com/2011/02/redesigning-netflix-api.html 37x Growth Jan 2010-Jan 2011DatacenterCapacity
  5. 5. Netflix Deployed on AWSContent Logs Play WWW API CS Video International S3 DRM Sign-Up Metadata Masters CS lookup Device Diagnostics EC2 EMR Hadoop CDN routing Search Config & Actions Movie TV Movie Customer S3 Hive Bookmarks Choosing Choosing Call Log Business Social CDNs Logging Ratings CS Analytics Intelligence Facebook
  6. 6. Netflix AWS Cost
  7. 7. Why Cassandra
  8. 8. Distributed Key-Value Store vs Central SQL Database• Datacenter had a central database DBA• Schema changes require downtime• Cloud has in contrast many key-value data stores – Joins take place in java code – No schema to change, no scheduled downtime
  9. 9. Goals for moving from Netflix DC to the Cloud• Faster – Lower latency than the equivalent datacenter web• Scalable – Avoid needing any more datacenter capacity as subscriber count increases• Available – Substantially higher robustness and availability than datacenter services• Productive – Optimize agility of a large development team with automation and tools
  10. 10. Cassandra• Faster – Low latency, low latency variance• Scalable – Supports running on Amazon EC2 – High and scalable read and write throughput – Support for Multi-region clusters• Available – We value Availability over Consistency – Cassandra Eventually Consistent – Supports Amazon Availability Zones – Data integrity checks and repairs – Online Snapshot Backup, Restore/Rollback• Productive – We want FOSS + Support
  11. 11. Cassandra Deployment
  12. 12. Netflix Cassandra Use Cases• Many different profiles – Read heavy environments with a strict SLA – Batch Write environments (70 rows per batch) also serving low latency Reads – Read Modify Write environments with large rows – Write only environments with rapidly increasing data sets – Any many more….
  13. 13. How much we use Cassandra30 Number of production clusters12 Number of multi-region clusters3 Max regions, one cluster65 Total TB of data across all clusters472 Number of Cassandra nodes72/28 Largest Cassandra cluster (nodes/data in TB)6k/250k Max read/writes per second
  14. 14. DeploymentArchitecture APIAWS EC2 Front End Load Balancer Discovery Service API Proxy Load Balancer Component API Services memcache d Cassandra EC2 Internal Disks Backup S3
  15. 15. High Availability Deployment• Fact of life – EC2 instances die• We store 3 local replicas in 3 different Cassandra nodes – One replica per EC2 Availability Zone (AZ)• Minimum Cluster configuration is 6 nodes, 2 per AZ – Single instance failure still leaves at least one node in each AZ• Use Local Quorum for writes• Use Quorum One for reads• Entire cluster replicated in Multi-region deployments• AWS Availability Zones – Separate buildings – Separate power etc. – Fairly close together
  16. 16. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Cassandra •Disks •Zone A• Client Writes to Cassandra Cassandra If a node goes nodes and Zones •Disks •Disks offline, hinted handoff• Nodes return ack to •Zone C •Zone A completes the write client when the node comes• Data written to Token back up. internal commit log Aware disks (no more than Clients Requests can choose to 10 seconds later) Cassandra Cassandra wait for one node, a •Disks •Disks quorum, or all nodes to •Zone C •Zone B ack the write Cassandra SSTable disk writes and •Disks compactions occur •Zone B asynchronously
  17. 17. Extending to Multi-Region In production for UK/Ireland support• Create cluster in EU• Backup US cluster to S3• Restore backup in EU Cassandra • Disks • Zone A Cassandra • Disks • Zone A• Local repair EU cluster Cassandra • Disks Cassandra • Disks Cassandra Cassandra• • Disks • Disks Global repair/join • Zone C • Zone A • Zone C • Zone A US EU Clients Clients Cassandra Cassandra Cassandra Cassandra • Disks • Disks • Disks • Disks • Zone C • Zone B • Zone C • Zone B Cassandra Cassandra • Disks • Zone B • Disks • Zone B S3
  18. 18. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum• Client writes to local replicas If a node or region goes offline, hinted handoff• Local write acks returned to completes the write when the node comes back up. Client which continues when Nightly global compare and repair jobs ensure 2 of 3 local nodes are everything stays consistent. committed• Local coordinator writes to Local Remote remote coordinator. Cassandra• Cassandra When data arrives, remote • Disks • Zone A • Disks • Zone A coordinator node acks Cassandra Cassandra Cassandra Cassandra• • Disks • Disks Remote co-ordinator sends • Zone C • Zone A • Disks • Zone C • Disks • Zone A data to other remote zones US EU• Remote nodes ack to local Clients Clients Cassandra Cassandra Cassandra Cassandra coordinator • Disks • Zone C • Disks • Zone B • Disks • Zone C • Disks • Zone B• Data flushed to internal Cassandra Cassandra • Disks commit log disks (no more • Zone B • Disks • Zone B than 10 seconds later)
  19. 19. Priam – Cassandra Automation Available at http://github.com/netflix• Open Source Tomcat Code running as a sidecar on each Cassandra node. Deployed as a separate rpm• Zero touch auto-configuration• Token allocation and assignment including multi- region• Broken node replacement and ring expansion• Full and incremental backup/restore to/from S3• Metrics collection and forwarding via JMX
  20. 20. Cassandra Backup/Restore• Full Backup Cassandra – Time based snapshot Cassandra Cassandra – SSTable compress -> S3• Incremental Cassandra Cassandra – SSTable write triggers S3 compressed copy to S3 Cassandra Backup Cassandra• Archive – Copy cross region Cassandra Cassandra• Restore Cassandra Cassandra – Full restore or create new Ring from Backup A
  21. 21. Cassandra Operations
  22. 22. Consoles, Monitors and Explorers• Netflix Application Console (NAC) – Primary AWS provisioning/config interface• EPIC Counters – Primary method for issue resolution• Dashboards• Cassandra Explorer – Browse clusters, keyspaces, column families• AWS Usage Analyzer – Breaks down costs by application and resource
  23. 23. Cloud Deployment Model Elastic Load Auto Scaling Balancer Group Instances Security GroupLaunchConfiguration Amazon Machine Image
  24. 24. NAC• Netflix Application Console (NAC) is Netflix’s primary tool in both Prod and Test to: • Create and Destroy Applications • Create and Destroy Auto Scaling Groups (ASGs) • Scale Instances up and down within an ASG and manage auto- scaling • Manage launch configs and AMIs• http://www.slideshare.net/joesondow
  25. 25. NAC
  26. 26. Cassandra Explorer• Kiosk mode – no alerting• High level cluster status (thift, gossip)• Warns on a small set of metrics 27
  27. 27. Epic• Netflix-wide monitoring and alerting tool based on RRD• Priam sends all JMX data to Epic• Very useful for finding specific issues 28
  28. 28. Dashboards• Next level cluster details • Throughput • Latency, Gossip status, Maintenance operations • Trouble indicators• Useful for finding anomalies• Most investigations start here 29
  29. 29. Things we monitor• Cassandra – Throughput, Latency, Compactions , Repairs – Pending threads, Dropped operations – Backup failures – Recent restarts – Schema changes• System – Disk space, Disk throughput, Load average• Errors and exceptions in Cassandra, System and Tomcat log files 30
  30. 30. Cassandra AWS Pain Points• Compactions cause spikes, esp. on read-heavy systems – Affects clients (hector, astyanax) – Throttling in newer Cassandra versions helps• Repairs are toxic to performance• Disk performance on Cloud instances and its impact on SSTable count• Memory requirements due to filesystem cache• Compression unusable in our environment• Multi-tenancy performance unpredictable• Java Heap size and OOM issues
  31. 31. Lessons learned• In EC2 best to choose instances that are not multi- tenant• Better to compact on our terms and not Cassandra’s. Take nodes out of service for major compactions• Size memtable flushes for optimizing compactions – Helps when writes are uniformly distributed, easier to determine flush patterns – Best to optimize flushes based on memtable size, not time – Makes minor compactions smoother 32
  32. 32. Lessons Learned (cont)• Key and row caches – Left unbounded can chew up jvm memory needed for normal work – Latencies will spike as the jvm needs to fight for memory – Off-heap row cache still maintains data structures on- heap• mmap() as in-memory cache – When process terminated, mmap pages are added to the free list
  33. 33. Lessons Learned (cont)• Sharding – If a single row has many gets/mutates, the nodes holding it will become hot spots – If a row grows too large, it won’t fit into memory • Problem for reads, compactions, and repairs • Some of our indices ran afoul of this• For more info see Jason Brown’s slides Cassandra from the trenches slideshare.net/netflix
  34. 34. Scalability
  35. 35. Scalability Testing• Cloud Based Testing – frictionless, elastic – Create/destroy any sized cluster in minutes – Many test scenarios run in parallel• Test Scenarios – Internal app specific tests – Simple “stress” tool provided with Cassandra• Scale test, keep making the cluster bigger – Check that tooling and automation works… – How many ten column row writes/sec can we do?
  36. 36. Scale-Up Linearity Client Writes/s by node count – Replication Factor = 3 1200000 1099837 1000000 800000Transactions 600000 537172 400000 366828 200000 174373 0 0 50 100 150 200 250 300 350 EC2 Instances
  37. 37. Measured at the Cassandra Server Throughput 3.3 Million writes/secCassandra Writes per second Elapsed time seconds
  38. 38. Response time 0.014msCassandra Response time Elapsed time seconds
  39. 39. Per Node Activity Per Node 48 Nodes 96 Nodes 144 Nodes 288 NodesPer Server Writes/s 10,900 w/s 11,460 w/s 11,900 w/s 11,456 w/sMean Server Latency 0.0117 ms 0.0134 ms 0.0148 ms 0.0139 msMean CPU %Busy 74.4 % 75.4 % 72.5 % 81.5 %Disk Read 5,600 KB/s 4,590 KB/s 4,060 KB/s 4,280 KB/sDisk Write 12,800 KB/s 11,590 KB/s 10,380 KB/s 10,080 KB/sNetwork Read 22,460 KB/s 23,610 KB/s 21,390 KB/s 23,640 KB/sNetwork Write 18,600 KB/s 19,600 KB/s 17,810 KB/s 19,770 KB/s Node specification – Xen Virtual Images, AWS US East, three zones • Cassandra 0.8.6, CentOS, SunJDK6 • AWS EC2 m1 Extra Large – Standard price $ 0.68/Hour • 15 GB RAM, 4 Cores, 1Gbit network • 4 internal disks (total 1.6TB, striped together, md, XFS)
  40. 40. Time is Money 48 nodes 96 nodes 144 nodes 288 nodesWrites Capacity 174373 w/s 366828 w/s 537172 w/s 1,099,837 w/sStorage Capacity 12.8 TB 25.6 TB 38.4 TB 76.8 TBNodes Cost/hr $32.64 $65.28 $97.92 $195.84Test Driver Instances 10 20 30 60Test Driver Cost/hr $20.00 $40.00 $60.00 $120.00Cross AZ Traffic 5 TB/hr 10 TB/hr 15 TB/hr 301 TB/hrTraffic Cost/10min $8.33 $16.66 $25.00 $50.00Setup Duration 15 minutes 22 minutes 31 minutes 662 minutesAWS Billed Duration 1hr 1hr 1 hr 2 hrTotal Test Cost $60.97 $121.94 $182.92 $561.68 1 Estimate two thirds of total network traffic 2 Workaround for a tooling bug slowed setup
  41. 41. Open Source
  42. 42. Open Source @ Netflix• Source at http://netflix.github.com• Binaries at Maven https://issues.sonatype.org/browse/OSSRH- 2116
  43. 43. Cassandra JMeter Plugin• Netflix uses JMeter across the fleet for load testing• JMeter plugin provides a wide range of samplers for Get, Put, Delete and Schema Creation• Used extensively to load data, Cassandra stress tests, feature testing etc.• Described at https://github.com/Netflix/CassJMeter/ wiki
  44. 44. Astyanax Available at http://github.com/netflix• Cassandra java client• API abstraction on top of Thrift protocol• “Fixed” Connection Pool abstraction (vs. Hector) – Round robin with Failover – Retry-able operations not tied to a connection – Netflix PaaS Discovery service integration – Host reconnect (fixed interval or exponential backoff) – Token aware to save a network hop – lower latency – Latency aware to avoid compacting/repairing nodes – lower variance• Simplified use of serializers via method overloading (vs. Hector)• ConnectionPoolMonitor interface

×