Summary of past Cassandra benchmarks performed by Netflix and description of how Netflix uses Cassandra interspersed with a live demo automated using Jenkins and Jmeter that created two 12 node Cassandra clusters from scratch on AWS, one with regular disks and one with SSDs. Both clusters were scaled up to 24 nodes each during the demo.
Axa Assurance Maroc - Insurer Innovation Award 2024
Cassandra Performance and Scalability on AWS
1. Cassandra Performance and
Scalability on AWS
August 8th, 2012
Adrian Cockcroft
@adrianco #netflixcloud #cassandra12
http://www.linkedin.com/in/adriancockcroft
14. Scalability from 48 to 288 nodes on AWS
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Client Writes/s by node count – Replication Factor = 3
1200000
1099837
1000000
800000
Used 288 of m1.xlarge
600000 4 CPU, 15 GB RAM, 8 ECU
537172 Cassandra 0.86
400000 Benchmark config only
366828 existed for about 1hr
200000 174373
0
0 50 100 150 200 250 300 350
15. Blah Blah Blah
(I’m skipping all the cloud intro etc. Netflix
runs in the cloud, if you hadn’t figured that
out already you aren’t paying attention and
should go read slideshare.net/netflix)
16. “Some people skate to the puck,
I skate to where the puck is going to be”
Wayne Gretzky
21. Major Front End Services
• Non-member Web Site www.netflix.com
– Marketing driven, sign up flow, SOX/PCI scope
• Member Web Site movies.netflix.com
– Personalization driven
• CDNs for delivering bulk video/audio
– Netflix CDN: openconnect.netflix.com
• API for external and device user interfaces
– Mostly private APIs, public API docs at developer.netflix.com
• API for controlling video playback
– DRM, QoS management, Bookmarks
22. Netflix Deployed on AWS
2009 2009 2010 2010 2010 2011
Content Logs Play WWW API CS
Content S3 International
DRM Sign-Up Metadata
Management Terabytes CS lookup
EC2 Diagnostics &
EMR CDN routing Search Device Config
Encoding Actions
S3 Movie TV Movie Customer Call
Hive & Pig Bookmarks
Petabytes Choosing Choosing Log
Business Social
Logging Ratings CS Analytics
Intelligence Facebook
CDNs
ISPs
Terabits
Customers
23. Cassandra Instance Architecture
Linux Base AMI (CentOS)
Priam
Cassandra
Manager
Token
Java7
Management, Bac
kups, Autoscaling
Tomcat/Java7 AppDynamics
Monitoring
appagent
monitoring Cassandra 1.09
Log rotation
AppDynamics GC and thread
machineagent dump logging
Etc.
24. Priam – Cassandra Automation
Available at http://github.com/netflix
• Netflix Platform Tomcat Code
• Zero touch auto-configuration
• State management for Cassandra JVM
• Token allocation and assignment
• Broken node auto-replacement
• Full and incremental backup to S3
• Restore sequencing from S3
• Grow/Shrink Cassandra “ring”
25. Astyanax
Available at http://github.com/netflix
• Features
– Complete abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
• Recipes
– Distribute row lock (without zookeeper)
– Multi-DC row lock
– Uniqueness constraint
– Multi-row uniqueness constraint
– Large file storage
32. High Availability
• Cassandra stores 3 local copies, 1 per zone
– Synchronous access, durable, highly available
– Read/Write One fastest, use for fire and forget
– Read/Write Quorum 2 of 3, use for read-after-write
• AWS Availability Zones
– Separate buildings
– Separate power etc.
– Fairly close together
33. “Traditional” Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Not Token Aware
Cassandra
•Disks
2•Zone A 2
4 2
1. Client Writes to any Cassandra 3 3Cassandra If a node goes
Cassandra Node •Disks5 •Disks 5 offline, hinted handoff
2. Coordinator Node •Zone C 1 •Zone B completes the write
replicates to nodes when the node comes
Non Token
and Zones back up.
3. Nodes return ack to Aware
coordinator Clients Requests can choose to
4. Coordinator returns Cassandra Cassandra wait for one node, a
ack to client •Disks •Disks quorum, or all nodes to
5. Data written to •Zone B •Zone C ack the write
internal commit log 3
disk (no more than Cassandra SSTable disk writes and
10 seconds later) •Disks 5 compactions occur
•Zone A
asynchronously
34. Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Cassandra
•Disks
•Zone A
1. Client Writes to Cassandra 2 2Cassandra If a node goes
nodes and Zones •Disks3 •Disks 3 offline, hinted handoff
2. Nodes return ack to •Zone C 1 •Zone B completes the write
client Token when the node comes
3. Data written to back up.
internal commit log Aware
disks (no more than Clients Requests can choose to
10 seconds later) Cassandra Cassandra wait for one node, a
•Disks •Disks quorum, or all nodes to
•Zone B •Zone C ack the write
2
Cassandra SSTable disk writes and
•Disks 3 compactions occur
•Zone A
asynchronously
35. Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
1. Client writes to local replicas If a node or region goes offline, hinted handoff
2. Local write acks returned to completes the write when the node comes back up.
Client which continues when Nightly global compare and repair jobs ensure
2 of 3 local nodes are everything stays consistent.
committed
3. Local coordinator writes to
remote coordinator. 100+ms latency
Cassandra Cassandra
4. When data arrives, remote • Disks
• Zone A
• Disks
• Zone A
coordinator node acks and Cassandra 2 2
Cassandra Cassandra 4Cassandra
6
• Disks • Disks 6 3 5• Disks6 4 Disks6
copies to other remote zones • Zone C
1
• Zone B • Zone C
•
• Zone B
4
5. Remote nodes ack to local US EU
coordinator Clients Clients
Cassandra 2
Cassandra Cassandra Cassandra
6. Data flushed to internal • Disks
• Zone B
• Disks
• Zone C
6 • Disks
• Zone B
• Disks
• Zone C
commit log disks (no more Cassandra 5
6Cassandra
• Disks
than 10 seconds later) • Zone A
• Disks
• Zone A
36. Extending to Multi-Region
Added production UK/Ireland support with no downtime
Minimize impact on original cluster using bulk backup move
1. Create cluster in EU Take a Boeing 737 on a domestic flight, upgrade it to a
747 by adding more engines, fuel and bigger wings
2. Backup US cluster to S3
and fly it to Europe without landing it on the way…
3. Restore backup in EU
4. Local repair EU cluster
5. Global repair/join
Cassandra
100+ms latency Cassandra 1
• Disks • Disks
• Zone A • Zone A
Cassandra Cassandra Cassandra Cassandra
• Disks • Disks • Disks • Disks
• Zone C • Zone B • Zone C • Zone B
US 5 EU
Clients Clients
Cassandra Cassandra Cassandra Cassandra
• Disks • Disks • Disks • Disks
• Zone B • Zone C • Zone B • Zone C
Cassandra Cassandra
• Disks • Disks
• Zone A
3 • Zone A
4
2
S3
37. Cassandra Backup
• Full Backup Cassandra
Cassandra Cassandra
– Time based snapshot
– SSTable compress -> S3 Cassandra Cassandra
• Incremental S3
Backup
Cassandra Cassandra
– SSTable write triggers
compressed copy to S3
Cassandra Cassandra
• Archive Cassandra Cassandra
– Copy cross region
A
38. ETL for Cassandra
• Data is de-normalized over many clusters!
• Too many to restore from backups for ETL
• Solution – read backup files using Hadoop
• Aegisthus
– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html
– High throughput raw SSTable processing
– Re-normalizes many clusters to a consistent view
– Extract, Transform, then Load into Teradata
39. Netflix Open Source Strategy
• Release PaaS Components git-by-git
– Source at github.com/netflix – we build from it…
– Intros and techniques at techblog.netflix.com
– Blog post or new code every few weeks
• Motivations
– Give back to Apache licensed OSS community
– Motivate, retain, hire top engineers
– “Peer pressure” code cleanup, external contributions
40. Open Source Projects and Posts
Legend
Github / Techblog Priam Exhibitor Servo and Autoscaling
Cassandra as a Service Zookeeper as a Service Scripts
Apache Contributions
Astyanax Honu
Curator
Techblog Post Cassandra client for Log4j streaming to
Zookeeper Patterns
Java Hadoop
Coming Soon
EVCache
CassJMeter Circuit Breaker
Memcached as a
Cassandra test suite Robust service pattern
Service
Cassandra Asgard
Eureka / Discovery
Multi-region EC2 AutoScaleGroup based
Service Directory
datastore support AWS console
Aegisthus Archaius
Chaos Monkey
Hadoop ETL for Dynamics Properties
Robustness verification
Cassandra Service
41. Chaos Monkey
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
• Computers (Datacenter or AWS) randomly die
– Fact of life, but too infrequent to test resiliency
• Test to make sure systems are resilient
– Allow any instance to fail without customer impact
• Chaos Monkey hours
– Monday-Friday 9am-3pm random instance kill
• Application configuration option
– Apps now have to opt-out from Chaos Monkey
43. Roadmap for 2012
• More resiliency and improved availability
• More automation, orchestration
• “Hardening” the platform, code clean-up
• Lower latency for web services and devices
• IPv6 – running now, see techblog for details
• More open sourced components
• Las Vegas in November - AWS Re:Invent
45. Disclaimers
• We didn’t have time to tune the demo
• These are the plots from the live demo run
• Run’s need to be longer to get to steady state
• Data size only reached around 5GB per node
• Plenty of “I wonder why it did that” remains
• It’s a fair comparison, but not the best absolute
performance possible for this workload and
configuration
• When you remove the IO bottleneck, the next
few bottlenecks appear…
46. Activity during the talk 10:30-11:30
Custom AppDynamics dashboard showing CPU and IOPS per node
47. Jmeter Plots
• Plots are the output of the Jenkins build
• Each instance has its own set of plots
• Each availability zone has its own summary plots
• One of the three zone summary plots is compared for
each metric
• Plot collection is currently duplicated as we are
transitioning from “Epic” to “Atlas”
56. Takeaway
Netflix has built and deployed a scalable global platform based on
Cassandra and AWS.
Key components of the Netflix PaaS are being released as Open Source
projects so you can build your own custom PaaS.
If you like lots of SSD’s come and work for us….
http://github.com/Netflix
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/adriancockcroft
@adrianco #netflixcloud #cassandra12
Editor's Notes
Complete connection pool abstractionQueries and mutations wrapped in objects created by the Keyspace implementation making it possible to retry failed operations. This varies from other connection pool implementations on which the operation is created on a specific connection and must be completely redone if it fails.Simplified serialization via method overloading. The low level thrift library only understands data that is serialized to a byte array. Hector requires serializers to be specified for nearly every call. Astyanax minimizes the places where serializers are specified by using predefined ColumnFamiliy and ColumnPath definitions which specify the serializers. The API also overloads set and get operation for common data types.The internal library does not log anything. All internal events are instead ... calls to a ConnectionPoolMonitor interface. This allows customization of log levels and filtering of repeating events outside of the scope of the connection poolSuper columns will soon be replaced by Composite column names. As such it is recommended to not use super columns at all and to use Composite column names instead. There is some support for super columns in Astyanax but those methods have been deprecated and will eventually be removed.