• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adrian Cockcroft
 

C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adrian Cockcroft

on

  • 3,529 views

Netflix has updated and added new tools and benchmarks for Cassandra in the last year. In this talk we will cover the latest additions and recipes for the Astyanax Java client, updates to Priam to ...

Netflix has updated and added new tools and benchmarks for Cassandra in the last year. In this talk we will cover the latest additions and recipes for the Astyanax Java client, updates to Priam to support Cassandra 1.2 Vnodes, plus newly released and upcoming tools that are all part of the NetflixOSS platform. Following on from the Cassandra on SSD on AWS benchmark that was run live during the 2012 Summit, we've been benchmarking a large write intensive multi-region cluster to see how far we can push it. Cassandra is the data storage and global replication foundation for the Cloud Native architecture that runs Netflix streaming for 36 Million users. Netflix is also offering a Cloud Prize for open source contributions to NetflixOSS, and there are ten categories including Best Datastore Integration and Best Contribution to Performance Improvements, with $10K cash and $5K of AWS credits for each winner. We'd like to pay you to use our free software!

Statistics

Views

Total Views
3,529
Views on SlideShare
3,515
Embed Views
14

Actions

Likes
11
Downloads
0
Comments
1

4 Embeds 14

https://twitter.com 11
http://moderation.local 1
http://tweetedtimes.com 1
http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • How many GB/node is stored in Cassandra and what resource utilization in terms of CPU/Memory/IO do you see on the Cassandra nodes during peak traffic?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
  • Hive – thin metadata layer on top of S3Used for ad-hoc analytics (Ursula for merge ETL)HiveQL gets compiled into set of MR jobs (1 -> many)Is a CLI – runs on the gateways, not like a relational DB server, or a service that the query gets shipped toPig – used for ETL (can create DAGs, workflows for Hadoop processes)Pig scripts also get compiled into MR jobsJava – straight up Hadoop, not for the faint of heart. Some recommendation algorithms are in Hadoop.Python/Java – UDFsApplications such as Sting use the tools on some gateway to access all the various componentsNext – focus on two key components: Data & Clusters

C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adrian Cockcroft C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adrian Cockcroft Presentation Transcript

  • Netflix Open Source Tools andBenchmarks for CassandraJune 2013Adrian Cockcroft@adrianco #cassandra13 @NetflixOSShttp://www.linkedin.com/in/adriancockcroft
  • Cloud NativeGlobal ArchitectureNetflixOSS Components
  • Cloud Native
  • Where time to market wins bigMaking a land-grabDisrupting competitors (OODA)Anything delivered as web services
  • How Soon?Code features in days instead of monthsGet hardware in minutes instead of weeksIncident response in seconds instead of hours
  • Tipping the BalanceUtopia Dystopia
  • A new engineering challengeConstruct a highly agile and highlyavailable service from ephemeral andoften broken components
  • Inspiration
  • "Genius is one percent inspiration and ninety-nine percent perspiration."Thomas A. Edison
  • Perspiration…A Cloud Native Open Source PlatformSee netflix.github.com
  • Netflix Platform EvolutionBleeding EdgeInnovationCommonPatternSharedPattern2009-2010 2011-2012 2013-2014Netflix started out several years ahead of theindustry, but it’s becoming commoditized now
  • Establish our solutionsas Best Practices /StandardsHire, Retain and EngageTop EngineersBuild up NetflixTechnology BrandBenefit from a sharedecosystemGoals
  • Your perspiration…Boosting the @NetflixOSS EcosystemSee netflix.github.com
  • JudgesAino CorryProgram Chair for Qcon/GOTOMartin FowlerChief Scientist ThoughtworksSimon WardleyStrategistYury IzrailevskyVP Cloud NetflixWerner VogelsCTO Amazon Joe WeinmanSVP Telx, Author “Cloudonomics”
  • What do you win?One winner in each of the 10 categoriesTicket and expenses to attend AWSRe:Invent 2013 in Las VegasA Trophy
  • EntrantsNetflixEngineeringSix Judges WinnersNominationsConforms toRulesWorkingCodeCommunityTractionCategoriesRegistrationOpenedMarch 13GithubApacheLicensedContributionsGithubClose EntriesSeptember 15GithubAwardCeremonyDinnerNovemberAWSRe:InventTen PrizeCategories$10K cash$5K AWSAWSRe:InventTicketsTrophy
  • Netflix StreamingA Cloud Native Application based onan open source platform
  • Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?
  • How Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocations
  • Amazon Video 1.31%(18x Prime)(25x Prime)Nov2012StreamingBandwidthMarch2013MeanBandwidth+39% 6mo
  • Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)Start HerememcachedCassandraWeb serviceS3 bucketPersonalization movie group choosers(for US, Canada and Latam)Each icon isthree to a fewhundredinstancesacross threeAWS zones
  • Component Micro-ServicesTest With Chaos Monkey, Latency Monkey
  • Three Balanced Availability ZonesTest with Chaos GorillaCassandra and Evcache ReplicasZone ACassandra and Evcache ReplicasZone BCassandra and Evcache ReplicasZone CLoad Balancers
  • Triple Replicated PersistenceCassandra maintenance affects individual replicasCassandra and Evcache ReplicasZone ACassandra and Evcache ReplicasZone BCassandra and Evcache ReplicasZone CLoad Balancers
  • Isolated RegionsCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load Balancers
  • Failure Modes and EffectsFailure Mode Probability Current Mitigation PlanApplication Failure High Automatic degraded responseAWS Region Failure Low Switch traffic between regionsAWS Zone Failure Medium Continue to run on 2 out of 3 zonesDatacenter Failure Medium Migrate more functions to cloudData store failure Low Restore from S3 backupsS3 failure Low Restore from remote archiveUntil we got really good at mitigating high and mediumprobability failures, the ROI for mitigating regionalfailures didn’t make sense. Working on it now.
  • Highly Available StorageA highly scalable, available and durabledeployment pattern based on ApacheCassandra
  • Single Function Micro-Service PatternOne keyspace, replaces a single table or materialized viewSingle function CassandraCluster Managed by PriamBetween 6 and 144 nodesStateless Data Access REST ServiceAstyanax Cassandra ClientOptionalDatacenterUpdate FlowMany Different Single-Function REST ClientsAppdynamics Service Flow VisualizationEach icon represents a horizontally scaled service of three tohundreds of instances deployed over three availability zonesOver 50 Cassandra clustersOver 1000 nodesOver 30TB backupOver 1M writes/s/cluster
  • Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu)Optional Apachefrontend,memcached, non-java appsMonitoringLog rotation to S3AppDynamicsmachineagentEpic/AtlasJava (JDK 6 or 7)AppDynamicsappagentmonitoringGC and threaddump loggingTomcatApplication war file, base servlet,platform, client interface jars,AstyanaxHealthcheck, status servlets, JMXinterface, Servo autoscale
  • Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu)Tomcat and Priamon JDKHealthcheck, StatusMonitoringAppDynamicsmachineagentEpic/AtlasJava (JDK 7)AppDynamicsappagentmonitoringGC and threaddump loggingCassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit logand SSTables
  • Priam – Cassandra AutomationAvailable at http://github.com/netflix• Netflix Platform Tomcat Code• Zero touch auto-configuration• State management for Cassandra JVM• Token allocation and assignment• Broken node auto-replacement• Full and incremental backup to S3• Restore sequencing from S3• Grow/Shrink Cassandra “ring”
  • Priam for C* 1.2 Vnodes• Prototype work started by Jason Brown• Completed– Restructured Priam for Vnode management• ToDo– Re-think SSTable backup/restore strategy
  • Cloud Native Big DataSize the cluster to the dataSize the cluster to the questionsNever wait for space or answers
  • Netflix DataovenData WarehouseOver 2 PetabytesUrsulaAegisthusData PipelinesFrom cloudServices~100 BillionEvents/dayFrom C*Terabytes ofDimensiondataHadoop Clusters – AWS EMR1300 nodes 800 nodes Multiple 150 nodes NightlyRDSMetadataGatewaysTools
  • ETL for Cassandra• Data is de-normalized over many clusters!• Too many to restore from backups for ETL• Solution – read backup files using Hadoop• Aegisthus– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html– High throughput raw SSTable processing– Re-normalizes many clusters to a consistent view– Extract, Transform, then Load into Teradata
  • Global ArchitectureLocal Client Traffic to CassandraSynchronous Replication Across ZonesAsynchronous Replication Across Regions
  • Astyanax Cassandra Client for JavaAvailable at http://github.com/netflix• Features– Abstraction of connection pool from RPC protocol– Fluent Style API– Operation retry with backoff– Token aware– Batch manager– Many useful recipes– New: Entity Mapper based on JPA annotations
  • Recipes• Distributed row lock (without needing zookeeper)• Multi-region row lock• Uniqueness constraint• Multi-row uniqueness constraint• Chunked and multi-threaded large file storage• Reverse index search• All rows query• Durable message queue• Contributed: High cardinality reverse index
  • Astyanax Futures• Maintain backwards compatibility• Wrapper for C* 1.2 Netty driver• More CQL support• NetflixOSS Cloud Prize Ideas– DynamoDB Backend?– More recipes?
  • Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token AwareTokenAwareClientsCassandra•Disks•Zone ACassandra•Disks•Zone BCassandra•Disks•Zone CCassandra•Disks•Zone ACassandra•Disks•Zone BCassandra•Disks•Zone C1. Client Writes to localcoordinator2. Coodinator writes toother zones3. Nodes return ack4. Data written tointernal commit logdisks (no more than10 seconds later)If a node goes offline,hinted handoffcompletes the writewhen the node comesback up.Requests can choose towait for one node, aquorum, or all nodes toack the writeSSTable disk writes andcompactions occurasynchronously144423332
  • Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local QuorumUSClientsCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone CCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone C1. Client writes to local replicas2. Local write acks returned toClient which continues when2 of 3 local nodes arecommitted3. Local coordinator writes toremote coordinator.4. When data arrives, remotecoordinator node acks andcopies to other remote zones5. Remote nodes ack to localcoordinator6. Data flushed to internalcommit log disks (no morethan 10 seconds later)If a node or region goes offline, hinted handoffcompletes the write when the node comes back up.Nightly global compare and repair jobs ensureeverything stays consistent.EUClientsCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone CCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone C6556 644416662223100+ms latency
  • Scalability from 48 to 288 nodes on AWShttp://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html17437336682853717210998370200000400000600000800000100000012000000 50 100 150 200 250 300 350Client Writes/s by node count – Replication Factor = 3Used 288 of m1.xlarge4 CPU, 15 GB RAM, 8 ECUCassandra 0.86Benchmark config onlyexisted for about 1hr
  • Cassandra Disk vs. SSD BenchmarkSame Throughput, Lower Latency, Half Cost
  • 2013 - Cross Region Use Cases• Geographic Isolation– US to Europe replication of subscriber data– Read intensive, low update rate– Production use since late 2011• Redundancy for regional failover– US East to US West replication of everything– Includes write intensive data, high update rate– Testing now
  • ValidationLoad2013 - Benchmarking Global CassandraWrite intensive test of cross region capacity16 x hi1.4xlarge SSD nodes per zone = 96 totalCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-West-2 Region - OregonCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East-1 Region - VirginiaTestLoadTestLoadInter-Zone Traffic1 Million writesCL.ONE1 Million readsCL.ONE with noData lossInter-Region TrafficS3
  • Copying 18TB from East to WestCassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodesThanks to boundary.com for these network analysis plots
  • Inter Region Traffic TestVerified at desired capacity, no problems, 339 MB/s, 83ms latency
  • Ramp Up Load Until It Breaks!Unmodified tuning, dropping client data at 1.93GB/s inter region trafficSpare CPU, IOPS, Network, just need some Cassandra tuning for more
  • Managing Multi-Region AvailabilityCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersUltraDNSDynECTDNSAWSRoute53Denominator – manage traffic via multiple DNS providersDenominator
  • How does it all fit together?
  • Example Application – RSS Reader
  • GithubNetflixOSSSourceAWSBase AMIMavenCentralCloudbeesJenkinsAminatorBakeryDynaslaveAWS BuildSlavesAsgard(+ Frigga)ConsoleAWSBaked AMIsOdinOrchestrationAPIAWSAccountContinuous Build and Deployment
  • AWS AccountAsgard ConsoleArchaiusConfig ServiceCross region Priam C*PytheasDashboardsAtlasMonitoringGenie, LipstickHadoop ServicesAWS UsageCost MonitoringMultiple AWS RegionsEureka RegistryExhibitor ZKEdda HistorySimian ArmyZuul Traffic Mgr3 AWS ZonesApplication ClustersAutoscale GroupsInstancesPriamCassandraPersistent StorageEvcacheMemcachedEphemeral StorageNetflixOSS Services Scope
  • •Baked AMI – Tomcat, Apache, your code•Governator – Guice based dependency injection•Archaius – dynamic configuration properties client•Eureka - service registration clientInitialization•Karyon - Base Server for inbound requests•RxJava – Reactive pattern•Hystrix/Turbine – dependencies and real-time status•Ribbon - REST Client for outbound callsServiceRequests•Astyanax – Cassandra client and pattern library•Evcache – Zone aware Memcached client•Curator – Zookeeper patterns•Denominator – DNS routing abstractionData Access•Blitz4j – non-blocking logging•Servo – metrics export for autoscaling•Atlas – high volume instrumentationLoggingNetflixOSS Instance Libraries
  • •CassJmeter – Load testing for Cassandra•Circus Monkey – Test account reservation rebalancing•Gcviz – Garbage collection visualizationTest Tools•Janitor Monkey – Cleans up unused resources•Efficiency Monkey•Doctor Monkey•Howler Monkey – Complains about AWS limitsMaintenance•Chaos Monkey – Kills Instances•Chaos Gorilla – Kills Availability Zones•Chaos Kong – Kills Regions•Latency Monkey – Latency and error injectionAvailability•Security Monkey – security group and S3 bucket permissions•Conformity Monkey – architectural pattern warningsSecurityNetflixOSS Testing and Automation
  • Dashboards with Pytheas (Explorers)http://techblog.netflix.com/2013/05/announcing-pytheas.html• Cassandra Explorer– Browse clusters, keyspaces, column families• Base Server Explorer– Browse service endpoints configuration, perf• Anything else you want to build
  • Cassandra Clusters
  • AWS Usage (coming soon)Reservation-aware cost monitoring and reporting
  • More Use CasesMoreFeaturesBetter portabilityHigher availabilityEasier to deployContributions from end usersContributions from vendorsWhat’s Coming Next?
  • Functionality and scale now, portability comingMoving from parts to a platform in 2013Netflix is fostering a cloud native ecosystemRapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
  • TakeawayNetflixOSS makes it easier for everyone to become Cloud Native@adrianco #cassandra13 @NetflixOSS
  • Slideshare NetflixOSS Details• Lightning Talks Feb S1E1– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks• Asgard In Depth Feb S1E1– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house• Lightning Talks March S1E2– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-roadmap• Security Architecture– http://www.slideshare.net/jason_chan/• Cost Aware Cloud Architectures – with Jinesh Varia of AWS– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-varia-aws-and-adrian-cockroft-netflix