Active Active - C* Behind the Scenes at Netflix
Upcoming SlideShare
Loading in...5
×
 

Active Active - C* Behind the Scenes at Netflix

on

  • 1,508 views

As more and more businesses move from enterprise IT solutions to web scale cloud solutions to cater to the growing customer needs, they need to be innovative and find ways the applications and ...

As more and more businesses move from enterprise IT solutions to web scale cloud solutions to cater to the growing customer needs, they need to be innovative and find ways the applications and infrastructures would to scale rapidly and be highly available.

High availability is an important requirement for any online business and trying to architect around failures and expecting infrastructure to fail and even then be highly available is the key to success. One such effort here at Netflix was the Active-Active implementation where we provided region resiliency. This presentation would discuss the brief overview of the active-active implementation and how it leveraged Cassandra’s architecture in the backend to achieve its goal. It will cover our journey though A-A from Cassandra’s perspective, the data validation we did to prove the backend would work without impacting customer experience. The various problems we ran into like long repair times and gc grace settings. Our lessons learnt and what would we do differently next time around?

Statistics

Views

Total Views
1,508
Views on SlideShare
736
Embed Views
772

Actions

Likes
1
Downloads
21
Comments
0

9 Embeds 772

http://planetcassandra.org 716
https://twitter.com 30
http://feedly.com 7
http://www.google.com 6
http://planetcassandra.prakashinfotech.com 6
http://www.slideee.com 3
http://reader.faltering.com 2
http://www.newsblur.com 1
https://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Active Active - C* Behind the Scenes at Netflix Active Active - C* Behind the Scenes at Netflix Presentation Transcript

  • ABOUT NETFLIX
  • NETFLIX
  • ACTIVE - ACTIVE View slide
  • WHAT IS ACTIVE-ACTIVE Also called dual active, it is a phrase used to describe a network of independent processing nodes where each node has access to a replicated database giving each node access and usage of single application. In an active-active system all requests are load balanced across all available processing capacity, Where a failure occurs on a node, another node in the network takes its place. View slide
  • DOES AN INSTANCE FAIL? • It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  • DOES A ZONE FAIL? • Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone • Test with Chaos Gorilla
  • DOES A REGION FAIL? • Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issues • Test with Chaos Kong
  • EVERYTHING FAILS… EVENTUALLY • Keep your services running by embracing isolation and redundancy • Construct a highly agile and highly available service from ephemeral and assumed broken components
  • ISOLATION • Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not affect functionality / operations
  • REDUNDANCY • Make more than one (of pretty much everything) • Specifically, distribute services across Availability Zones and regions
  • HISTORY: X-MAS EVE 2012 • Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the production ELB state data”
  • ACTIVE-ACTIVE ARCHITECTURE
  • THE PROCESS
  • IDENTIFYING CLUSTERS FOR AA
  • SNITCH CHANGES EC2Snitch EC2MultiRegionSnitch Uses Private IPs Uses Public IPs
  • PRIAM.MULTIREGION.ENABLE =TRUE tcp 7101-7101 [ ] [10.190.21.36/32, 10.232.200.17/32, 10.33.573.26/32, 10.20.151.165/32, 10.226.99.46/32, 10.244.143.193/32] tcp 7103-7103 [ ] [54.196.221.136/32, 54.202.200.217/32, 54.203.57.226/32, 54.205.151.165/32, 54.226.99.46/32, 54.244.143.193/32]
  • SPIN UP NODES IN NEW REGION us-east-1 us-west-2 APP
  • UPDATE KEYSPACE Update keyspace <keyspace> with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {us-east : 3, us-west-2 : 3}; Existing region and replication factor New region and replication factor
  • REBUILD NEW REGION Run – nodetool rebuild us-east-1 on all us-west-2 nodes
  • RUN NODETOOL REPAIR
  • VALIDATION
  • BENCHMARKING GLOBAL CASSANDRA WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION CAPACITY 16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL 192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING CASSANDRA IN 20 MINUTES Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-West-2 Region - Oregon Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East-1 Region - Virginia Test Load Test Load Validation Load Interzone Traffic 1 Million Writes CL.ONE (Wait for One Replica to ack) 1 Million Reads after 500 ms CL.ONE with No Data Loss Interregional Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  • TEST FOR THUNDERING HERD
  • TEST FOR RETRIES FAILURE RETRY
  • KEY METRICS USED • 99 /95 th Read Latency (Client & C*) • Dropped Metrics on C* • Exceptions on C* • Heap Usage on C* • CPU Usage (Client & C*) • Threads Pending on C*
  • CONFIGURATION FOR TEST • 24 Node C* SSDs • 220 Client instances • 70+ Jmeter Instances
  • C* IOPS
  • TOTAL READ IOPS TOTAL WRITE IOPS
  • 95th LATENCY 99th LATENCY
  • CHECK FOR CEILING
  • NETWORK PARTITION us-east-1 us-west-2
  • TAKEAWAYS
  • REPAIRS AFTER EXTENSION ARE PAINFUL !!
  • TIME TO REPAIR DEPENDS ON • Number of regions • Number of replicas • Data size • Amount of entropy
  • ADJUST GC_GRACE AFTER EXTENSION • Column Family Setting • Defined in seconds • Default 10 days • Tweak gc_grace settings to accommodate time taken to repair • BEWARE of deleted columns
  • RUNBOOK
  • PLAN FOR CAPACITY
  • CONSISTENCY LEVEL • Check the client for consistency level setting • In a Multiregional cluster QUORUM <> LOCAL_QUORUM • Recommended consistency levels LOCAL_ONE (CASSANDRA-6202) for reads and LOCAL_QUORUM for writes • For region resiliency avoid – ALL or QUORUM calls
  • HOW DO WE KNOW IT WORKS? CREATE CHAOS!!
  • Benchmark … Time Consuming But worth it!