SlideShare a Scribd company logo
Details:
Building Using The NetflixOSS
Architecture
May 2013
Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
Architectures for High Availability
Cassandra Storage and Replication
NetflixOSS Components
Component Micro-Services
Test With Chaos Monkey, Latency Monkey
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Active-Active Using Denominator
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional failures
didn’t make sense. Working on Active-Active in 2013.
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
• Dynamic configuration options - Archaius
– http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Chaos Monkey
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
• Computers (Datacenter or AWS) randomly die
– Fact of life, but too infrequent to test resiliency
• Test to make sure systems are resilient
– Kill individual instances without customer impact
• Latency Monkey (coming soon)
– Inject extra latency and error return codes
Monkeys
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
Edda
AWS
Instances,
ASGs, etc.
Eureka
Services
metadata
AppDynamics
Request flow
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}
Platform Outage Taxonomy
Classify and name the different types
of things that can go wrong
YOLO
Zone Failure Modes
• Power Outage
– Instances lost, ephemeral state lost
– Clean break and recovery, fail fast, “no route to host”
• Network Outage
– Instances isolated, state inconsistent
– More complex symptoms, recovery issues, transients
• Dependent Service Outage
– Cascading failures, misbehaving instances, human errors
– Confusing symptoms, recovery issues, byzantine effects
Zone Power Failure
• June 29, 2012 AWS US-East - The Big Storm
– http://aws.amazon.com/message/67457/
– http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
• Highlights
– One of 10+ US-East datacenters failed generator startup
– UPS depleted -> 10min power outage for 7% of instances
• Result
– Netflix lost power to most of a zone, evacuated the zone
– Small/brief user impact due to errors and retries
Zone Failure Modes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Zone Power Outage
Zone Network Outage
Zone Dependent
Service Outage
Regional Failure Modes
• Network Failure Takes Region Offline
– DNS configuration errors
– Bugs and configuration errors in routers
– Network capacity overload
• Control Plane Overload Affecting Entire Region
– Consequence of other outages
– Lose control of remaining zones infrastructure
– Cascading service failure, hard to diagnose
Regional Control Plane Overload
• April 2011 – “The big EBS Outage”
– http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
– Human error during network upgrade triggered cascading failure
– Zone level failure, with brief regional control plane overload
• Netflix Infrastructure Impact
– Instances in one zone hung and could not launch replacements
– Overload prevented other zones from launching instances
– Some MySQL slaves offline for a few days
• Netflix Customer Visible Impact
– Higher latencies for a short time
– Higher error rates for a short time
– Outage was at a low traffic level time, so no capacity issues
Regional Failure Modes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Regional Network Outage
Control Plane Overload
Dependent Services Failure
• June 29, 2012 AWS US-East - The Big Storm
– Power failure recovery overloaded EBS storage service
– Backlog of instance startups using EBS root volumes
• ELB (Load Balancer) Impacted
– ELB instances couldn’t scale because EBS was backlogged
– ELB control plane also became backlogged
• Mitigation Plans Mentioned
– Multiple control plane request queues to isolate backlog
– Rapid DNS based traffic shifting between zones
Application Routing Failure
June 29, 2012 AWS US-East - The Big Storm
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Zone Power Outage
Applications not using
Zone-aware routing kept
trying to talk to dead
instances and timing out
Eureka service directory failed to mark down
dead instances due to a configuration error
Effect: higher latency and errors
Mitigation: Fixed config, and made
zone aware routing the default
Dec 24th 2012
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Partial Regional ELB Outage
• ELB (Load Balancer) Impacted
– ELB control plane database state accidentally corrupted
– Hours to detect, hours to restore from backups
• Mitigation Plans Mentioned
– Tighter process for access to control plane
– Better zone isolation
Global Failure Modes
• Software Bugs
– Externally triggered (e.g. leap year/leap second)
– Memory leaks and other delayed action failures
• Global configuration errors
– Usually human error
– Both infrastructure and application level
• Cascading capacity overload
– Customers migrating away from a failure
– Lack of cross region service isolation
Global Software Bug Outages
• AWS S3 Global Outage in 2008
– Gossip protocol propagated errors worldwide
– No data loss, but service offline for up to 9hrs
– Extra error detection fixes, no big issues since
• Microsoft Azure Leap Day Outage in 2012
– Bug failed to generate certificates ending 2/29/13
– Failure to launch new instances for up to 13hrs
– One line code fix.
• Netflix Configuration Error in 2012
– Global property updated to broken value
– Streaming stopped worldwide for ~1hr until we changed back
– Fix planned to keep history of properties for quick rollback
Global Failure Modes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Cascading Capacity Overload
Software Bugs and Global
Configuration Errors
Capacity Demand Migrates
“Oops…”
Denominator
Portable DNS management
API Models (varied
and mostly broken)
DNS Vendor Plug-in
Common Model
Use Cases
Edda, Multi-
Region
Failover
Denominator
AWS Route53
IAM Key Auth
REST
DynECT
User/pwd
REST
UltraDNS
User/pwd
SOAP
Etc…
Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
Highly Available Storage
A highly scalable, available and
durable deployment pattern
Micro-Service Pattern
One keyspace, replaces a single table or materialized view
Single function Cassandra
Cluster Managed by Priam
Between 6 and 72 nodes
Stateless Data Access REST Service
Astyanax Cassandra Client
Optional
Datacenter
Update Flow
Many Different Single-Function REST Clients
Appdynamics Service Flow Visualization
Each icon represents a horizontally scaled service of three to
hundreds of instances deployed over three availability zones
Stateless Micro-Service Architecture
Linux Base AMI (CentOS or Ubuntu)
Optional
Apache
frontend,
memcached,
non-java apps
Monitoring
Log rotation
to S3
AppDynamics
machineagent
Epic/Atlas
Java (JDK 6 or 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging
Tomcat
Application war file, base
servlet, platform, client
interface jars, Astyanax
Healthcheck, status
servlets, JMX interface,
Servo autoscale
Astyanax
Available at http://github.com/netflix
• Features
– Complete abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
• Recipes
– Distributed row lock (without zookeeper)
– Multi-DC row lock
– Uniqueness constraint
– Multi-row uniqueness constraint
– Chunked and multi-threaded large file storage
Initializing Astyanax
// Configuration either set in code or nfastyanax.properties
platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY
netflix.environment=test
default.astyanax.readConsistency=CL_QUORUM
default.astyanax.writeConsistency=CL_QUORUM
MyCluster.MyKeyspace.astyanax.servers=127.0.0.1
// Must initialize platform for discovery to work
NFLibraryManager.initLibrary(PlatformManager.class, props, false, true);
NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);
// Open a keyspace instance
Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
Astyanax Query Example
Paginate through all columns in a row
ColumnList<String> columns;
int pageize = 10;
try {
RowQuery<String, String> query = keyspace
.prepareQuery(CF_STANDARD1)
.getKey("A")
.setIsPaginating()
.withColumnRange(new RangeBuilder().setMaxSize(pageize).build());
while (!(columns = query.execute().getResult()).isEmpty()) {
for (Column<String> c : columns) {
}
}
} catch (ConnectionException e) {
}
Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Token
Aware
Clients
Cassandra
•Disks
•Zone A
Cassandra
•Disks
•Zone B
Cassandra
•Disks
•Zone C
Cassandra
•Disks
•Zone A
Cassandra
•Disks
•Zone B
Cassandra
•Disks
•Zone C
1. Client Writes to local
coordinator
2. Coodinator writes to
other zones
3. Nodes return ack
4. Data written to
internal commit log
disks (no more than
10 seconds later)
If a node goes offline,
hinted handoff
completes the write
when the node comes
back up.
Requests can choose to
wait for one node, a
quorum, or all nodes to
ack the write
SSTable disk writes and
compactions occur
asynchronously
1
4
4
4
2
3
3
3
2
Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
US
Clients
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
1. Client writes to local replicas
2. Local write acks returned to
Client which continues when
2 of 3 local nodes are
committed
3. Local coordinator writes to
remote coordinator.
4. When data arrives, remote
coordinator node acks and
copies to other remote zones
5. Remote nodes ack to local
coordinator
6. Data flushed to internal
commit log disks (no more
than 10 seconds later)
If a node or region goes offline, hinted handoff
completes the write when the node comes back up.
Nightly global compare and repair jobs ensure
everything stays consistent.
EU
Clients
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
6
5
5
6 6
4
4
4
1
6
6
6
2
2
2
3
100+ms latency
Cassandra Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
Tomcat and
Priam on JDK
Healthcheck,
Status
Monitoring
AppDynamics
machineagent
Epic/Atlas
Java (JDK 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging
Cassandra Server
Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk
holding Commit log and SSTables
Priam – Cassandra Automation
Available at http://github.com/netflix
• Netflix Platform Tomcat Code
• Zero touch auto-configuration
• State management for Cassandra JVM
• Token allocation and assignment
• Broken node auto-replacement
• Full and incremental backup to S3
• Restore sequencing from S3
• Grow/Shrink Cassandra “ring”
ETL for Cassandra
• Data is de-normalized over many clusters!
• Too many to restore from backups for ETL
• Solution – read backup files using Hadoop
• Aegisthus
– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html
– High throughput raw SSTable processing
– Re-normalizes many clusters to a consistent view
– Extract, Transform, then Load into Teradata
Cloud Architecture Patterns
Where do we start?
Datacenter to Cloud Transition Goals
• Faster
– Lower latency than the equivalent datacenter web pages and API calls
– Measured as mean and 99th percentile
– For both first hit (e.g. home page) and in-session hits for the same user
• Scalable
– Avoid needing any more datacenter capacity as subscriber count increases
– No central vertically scaled databases
– Leverage AWS elastic capacity effectively
• Available
– Substantially higher robustness and availability than datacenter services
– Leverage multiple AWS availability zones
– No scheduled down time, no central database schema to change
• Productive
– Optimize agility of a large development team with automation and tools
– Leave behind complex tangled datacenter code base (~8 year old architecture)
– Enforce clean layered interfaces and re-usable components
Datacenter Anti-Patterns
What do we currently do in the
datacenter that prevents us from
meeting our goals?
Rewrite from Scratch
Not everything is cloud specific
Pay down technical debt
Robust patterns
Netflix Datacenter vs. Cloud Arch
Central SQL Database Distributed Key/Value NoSQL
Sticky In-Memory Session Shared Memcached Session
Chatty Protocols Latency Tolerant Protocols
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service Patterns
Fat Complex Objects Lightweight Serializable Objects
Components as Jar Files Components as Services
Tangled Service Interfaces
• Datacenter implementation is exposed
– Oracle SQL queries mixed into business logic
• Tangled code
– Deep dependencies, false sharing
• Data providers with sideways dependencies
– Everything depends on everything else
Anti-pattern affects productivity, availability
Untangled Service Interfaces
Two layers:
• SAL - Service Access Library
– Basic serialization and error handling
– REST or POJO’s defined by data provider
• ESL - Extended Service Library
– Caching, conveniences, can combine several SALs
– Exposes faceted type system (described later)
– Interface defined by data consumer in many cases
Service Interaction Pattern
Sample Swimlane Diagram
NetflixOSS Details
• Platform entities and services
• AWS Accounts and access management
• Upcoming and recent NetflixOSS components
• In-depth on NetflixOSS components
Basic Platform Entities
• AWS Based Entities
– Instances and Machine Images, Elastic IP Addresses
– Security Groups, Load Balancers, Autoscale Groups
– Availability Zones and Geographic Regions
• NetflixOS Specific Entities
– Applications (registered services)
– Clusters (versioned Autoscale Groups for an App)
– Properties (dynamic hierarchical configuration)
Core Platform Services
• AWS Based Services
– S3 storage, to 5TB files, parallel multipart writes
– SQS – Simple Queue Service. Messaging layer.
• Netflix Based Services
– EVCache – memcached based ephemeral cache
– Cassandra – distributed persistent data store
Security Architecture
• Instance Level Security baked into base AMI
– Login: ssh only allowed via portal (not between instances)
– Each app type runs as its own userid app{test|prod}
• AWS Security, Identity and Access Management
– Each app has its own security group (firewall ports)
– Fine grain user roles and resource ACLs
• Key Management
– AWS Keys dynamically provisioned, easy updates
– High grade app specific key management support
AWS Accounts
Accounts Isolate Concerns
• paastest – for development and testing
– Fully functional deployment of all services
– Developer tagged “stacks” for separation
• paasprod – for production
– Autoscale groups only, isolated instances are terminated
– Alert routing, backups enabled by default
• paasaudit – for sensitive services
– To support SOX, PCI, etc.
– Extra access controls, auditing
• paasarchive – for disaster recovery
– Long term archive of backups
– Different region, perhaps different vendor
Reservations and Billing
• Consolidated Billing
– Combine all accounts into one bill
– Pooled capacity for bigger volume discounts
http://docs.amazonwebservices.com/AWSConsolidatedBilling/1.0/AWSConsolidatedBillingGuide.html
• Reservations
– Save up to 71% on your baseline load
– Priority when you request reserved capacity
– Unused reservations are shared across accounts
Cloud Access Gateway
• Datacenter or office based
– A separate VM for each AWS account
– Two per account for high availability
– Mount NFS shared home directories for developers
– Instances trust the gateway via a security group
• Manage how developers login to cloud
– Access control via ldap group membership
– Audit logs of every login to the cloud
– Similar to awsfabrictasks ssh wrapper
http://readthedocs.org/docs/awsfabrictasks/en/latest/
Cloud Access Control
www-
prod
• Userid wwwprod
Dal-
prod
• Userid dalprod
Cass-
prod
• Userid cassprod
Cloud Access
ssh Gateway
Security groups don’t allow
ssh between instances
developers
AWS Usage (coming soon)
for test, carefully omitting any $ numbers…
Dashboards with Pytheas (Explorers)
http://techblog.netflix.com/2013/05/announcing-pytheas.html
• Cassandra Explorer
– Browse clusters, keyspaces, column families
• Base Server Explorer
– Browse service endpoints configuration, perf
• Anything else you want to build…
Cassandra Explorer
Cassandra Explorer
Cassandra Clusters
Bubble Chart
Slideshare NetflixOSS Details
• Lightning Talks Feb S1E1
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks
• Asgard In Depth Feb S1E1
– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house
• Lightning Talks March S1E2
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-
roadmap
• Security Architecture
– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
• Cost Aware Cloud Architectures – with Jinesh Varia of AWS
– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-
varia-aws-and-adrian-cockroft-netflix
Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS – Amazon Web Services (common name for Amazon cloud)
• AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus application code)
• EC2 – Elastic Compute Cloud
– Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configurations.
– Instance – a running computer system. Ephemeral, when it is de-allocated nothing is kept.
– Reserved Instances – pre-paid to reduce cost for long term usage
– Availability Zone – datacenter with own power and cooling hosting cloud instances
– Region – group of Avail Zones – US-East, US-West, EU-Eire, Asia-Singapore, Asia-Japan, SA-Brazil, US-Gov
• ASG – Auto Scaling Group (instances booting from the same AMI)
• S3 – Simple Storage Service (http access)
• EBS – Elastic Block Storage (network disk filesystem can be mounted on an instance)
• RDS – Relational Database Service (managed MySQL master and slaves)
• DynamoDB/SDB – Simple Data Base (hosted http based NoSQL datastore, DynamoDB replaces SDB)
• SQS – Simple Queue Service (http based message queue)
• SNS – Simple Notification Service (http and email based topics and messages)
• EMR – Elastic Map Reduce (automatically managed Hadoop cluster)
• ELB – Elastic Load Balancer
• EIP – Elastic IP (stable IP address mapping assigned to instance or ELB)
• VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs)
• DirectConnect – secure pipe from AWS VPC to external datacenter
• IAM – Identity and Access Management (fine grain role based security keys)

More Related Content

What's hot

Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
Adrian Cockcroft
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
Amazon Web Services
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
Adrian Cockcroft
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Adrian Cockcroft
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
Adrian Cockcroft
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
Amazon Web Services
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
Adrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Amazon Web Services
 
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
Amazon Web Services
 
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 TutorialCloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
Stuart Charlton
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
Ruslan Meshenberg
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Sid Anand
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
CA API Management
 
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
Amazon Web Services
 
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015 Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Mariano Gonzalez
 
AWS Re:Invent - Optimizing Costs with AWS
AWS Re:Invent -  Optimizing Costs with AWSAWS Re:Invent -  Optimizing Costs with AWS
AWS Re:Invent - Optimizing Costs with AWS
Coburn Watson
 
Asgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudAsgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the Cloud
Joe Sondow
 
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...
Amazon Web Services
 

What's hot (20)

Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
 
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
 
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 TutorialCloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
 
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
 
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015 Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
 
AWS Re:Invent - Optimizing Costs with AWS
AWS Re:Invent -  Optimizing Costs with AWSAWS Re:Invent -  Optimizing Costs with AWS
AWS Re:Invent - Optimizing Costs with AWS
 
Asgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudAsgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the Cloud
 
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...
 

Viewers also liked

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
Adrian Cockcroft
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
Adrian Cockcroft
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Companies financial result updated on 21 mar 2017
Companies financial result updated on 21 mar 2017Companies financial result updated on 21 mar 2017
Companies financial result updated on 21 mar 2017
RAFI SECURITIES (PVT.)LTD.
 
Customer Experience Management
Customer Experience ManagementCustomer Experience Management
Clpp Torres Carora
Clpp Torres CaroraClpp Torres Carora
Clpp Torres Carora
guestadd51a
 
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017
BookNet Canada
 
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017
Eduardo Sánchez González
 
Dossiers noirs va 4191
Dossiers noirs va 4191Dossiers noirs va 4191
Dossiers noirs va 4191
Michel Bruley
 
Careggi Smart Hospital - candidatura premio forum pa 2017
Careggi Smart Hospital - candidatura premio forum pa 2017Careggi Smart Hospital - candidatura premio forum pa 2017
Careggi Smart Hospital - candidatura premio forum pa 2017
CareggiMobile
 

Viewers also liked (13)

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Companies financial result updated on 21 mar 2017
Companies financial result updated on 21 mar 2017Companies financial result updated on 21 mar 2017
Companies financial result updated on 21 mar 2017
 
Customer Experience Management
Customer Experience ManagementCustomer Experience Management
Customer Experience Management
 
Clpp Torres Carora
Clpp Torres CaroraClpp Torres Carora
Clpp Torres Carora
 
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017
 
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017
 
Dossiers noirs va 4191
Dossiers noirs va 4191Dossiers noirs va 4191
Dossiers noirs va 4191
 
Careggi Smart Hospital - candidatura premio forum pa 2017
Careggi Smart Hospital - candidatura premio forum pa 2017Careggi Smart Hospital - candidatura premio forum pa 2017
Careggi Smart Hospital - candidatura premio forum pa 2017
 

Similar to Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
DataStax Academy
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
Ruslan Meshenberg
 
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Amazon Web Services
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
MassTLC
 
Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13
Milind Waikul
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
AWS re:Invent 2013 Recap
AWS re:Invent 2013 RecapAWS re:Invent 2013 Recap
AWS re:Invent 2013 Recap
Barry Jones
 
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
Amazon Web Services
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
Amazon Web Services
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
Eran Gampel
 
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
Amazon Web Services
 
.NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time....NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time...
Michele Leroux Bustamante
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
DataStax Academy
 
Percona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWSPercona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWS
Pythian
 
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Amazon Web Services
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
bloomreacheng
 
OpenStack Dragonflow shenzhen and Hangzhou meetups
OpenStack Dragonflow shenzhen and Hangzhou  meetupsOpenStack Dragonflow shenzhen and Hangzhou  meetups
OpenStack Dragonflow shenzhen and Hangzhou meetups
Eran Gampel
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
Nati Shalom
 
AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...
AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...
AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...
Amazon Web Services
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
Amazon Web Services Korea
 

Similar to Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2) (20)

C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
 
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
AWS re:Invent 2013 Recap
AWS re:Invent 2013 RecapAWS re:Invent 2013 Recap
AWS re:Invent 2013 Recap
 
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
 
.NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time....NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time...
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
 
Percona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWSPercona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWS
 
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
OpenStack Dragonflow shenzhen and Hangzhou meetups
OpenStack Dragonflow shenzhen and Hangzhou  meetupsOpenStack Dragonflow shenzhen and Hangzhou  meetups
OpenStack Dragonflow shenzhen and Hangzhou meetups
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...
AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...
AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 

More from Adrian Cockcroft

Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
Adrian Cockcroft
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Adrian Cockcroft
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
Adrian Cockcroft
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
Adrian Cockcroft
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
Adrian Cockcroft
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
Adrian Cockcroft
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
Adrian Cockcroft
 

More from Adrian Cockcroft (11)

Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
 

Recently uploaded

Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
FODUU
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

  • 1. Details: Building Using The NetflixOSS Architecture May 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  • 2. Architectures for High Availability Cassandra Storage and Replication NetflixOSS Components
  • 3. Component Micro-Services Test With Chaos Monkey, Latency Monkey
  • 4. Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 5. Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 6. Isolated Regions Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  • 7. Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Active-Active Using Denominator AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Working on Active-Active in 2013.
  • 8. Application Resilience Run what you wrote Rapid detection Rapid Response
  • 9. Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up • Dynamic configuration options - Archaius – http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
  • 10. Resilient Design – Hystrix, RxJava http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  • 11. Chaos Monkey http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html • Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency • Test to make sure systems are resilient – Kill individual instances without customer impact • Latency Monkey (coming soon) – Inject extra latency and error return codes
  • 12. Monkeys Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html Edda AWS Instances, ASGs, etc. Eureka Services metadata AppDynamics Request flow
  • 13. Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" … }
  • 14. Platform Outage Taxonomy Classify and name the different types of things that can go wrong
  • 15. YOLO
  • 16. Zone Failure Modes • Power Outage – Instances lost, ephemeral state lost – Clean break and recovery, fail fast, “no route to host” • Network Outage – Instances isolated, state inconsistent – More complex symptoms, recovery issues, transients • Dependent Service Outage – Cascading failures, misbehaving instances, human errors – Confusing symptoms, recovery issues, byzantine effects
  • 17. Zone Power Failure • June 29, 2012 AWS US-East - The Big Storm – http://aws.amazon.com/message/67457/ – http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html • Highlights – One of 10+ US-East datacenters failed generator startup – UPS depleted -> 10min power outage for 7% of instances • Result – Netflix lost power to most of a zone, evacuated the zone – Small/brief user impact due to errors and retries
  • 18. Zone Failure Modes Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers Zone Power Outage Zone Network Outage Zone Dependent Service Outage
  • 19. Regional Failure Modes • Network Failure Takes Region Offline – DNS configuration errors – Bugs and configuration errors in routers – Network capacity overload • Control Plane Overload Affecting Entire Region – Consequence of other outages – Lose control of remaining zones infrastructure – Cascading service failure, hard to diagnose
  • 20. Regional Control Plane Overload • April 2011 – “The big EBS Outage” – http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html – Human error during network upgrade triggered cascading failure – Zone level failure, with brief regional control plane overload • Netflix Infrastructure Impact – Instances in one zone hung and could not launch replacements – Overload prevented other zones from launching instances – Some MySQL slaves offline for a few days • Netflix Customer Visible Impact – Higher latencies for a short time – Higher error rates for a short time – Outage was at a low traffic level time, so no capacity issues
  • 21. Regional Failure Modes Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers Regional Network Outage Control Plane Overload
  • 22. Dependent Services Failure • June 29, 2012 AWS US-East - The Big Storm – Power failure recovery overloaded EBS storage service – Backlog of instance startups using EBS root volumes • ELB (Load Balancer) Impacted – ELB instances couldn’t scale because EBS was backlogged – ELB control plane also became backlogged • Mitigation Plans Mentioned – Multiple control plane request queues to isolate backlog – Rapid DNS based traffic shifting between zones
  • 23. Application Routing Failure June 29, 2012 AWS US-East - The Big Storm Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers Zone Power Outage Applications not using Zone-aware routing kept trying to talk to dead instances and timing out Eureka service directory failed to mark down dead instances due to a configuration error Effect: higher latency and errors Mitigation: Fixed config, and made zone aware routing the default
  • 24. Dec 24th 2012 Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers Partial Regional ELB Outage • ELB (Load Balancer) Impacted – ELB control plane database state accidentally corrupted – Hours to detect, hours to restore from backups • Mitigation Plans Mentioned – Tighter process for access to control plane – Better zone isolation
  • 25. Global Failure Modes • Software Bugs – Externally triggered (e.g. leap year/leap second) – Memory leaks and other delayed action failures • Global configuration errors – Usually human error – Both infrastructure and application level • Cascading capacity overload – Customers migrating away from a failure – Lack of cross region service isolation
  • 26. Global Software Bug Outages • AWS S3 Global Outage in 2008 – Gossip protocol propagated errors worldwide – No data loss, but service offline for up to 9hrs – Extra error detection fixes, no big issues since • Microsoft Azure Leap Day Outage in 2012 – Bug failed to generate certificates ending 2/29/13 – Failure to launch new instances for up to 13hrs – One line code fix. • Netflix Configuration Error in 2012 – Global property updated to broken value – Streaming stopped worldwide for ~1hr until we changed back – Fix planned to keep history of properties for quick rollback
  • 27. Global Failure Modes Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers Cascading Capacity Overload Software Bugs and Global Configuration Errors Capacity Demand Migrates “Oops…”
  • 28. Denominator Portable DNS management API Models (varied and mostly broken) DNS Vendor Plug-in Common Model Use Cases Edda, Multi- Region Failover Denominator AWS Route53 IAM Key Auth REST DynECT User/pwd REST UltraDNS User/pwd SOAP Etc… Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
  • 29. Highly Available Storage A highly scalable, available and durable deployment pattern
  • 30. Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Cluster Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client Optional Datacenter Update Flow Many Different Single-Function REST Clients Appdynamics Service Flow Visualization Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones
  • 31. Stateless Micro-Service Architecture Linux Base AMI (CentOS or Ubuntu) Optional Apache frontend, memcached, non-java apps Monitoring Log rotation to S3 AppDynamics machineagent Epic/Atlas Java (JDK 6 or 7) AppDynamics appagent monitoring GC and thread dump logging Tomcat Application war file, base servlet, platform, client interface jars, Astyanax Healthcheck, status servlets, JMX interface, Servo autoscale
  • 32. Astyanax Available at http://github.com/netflix • Features – Complete abstraction of connection pool from RPC protocol – Fluent Style API – Operation retry with backoff – Token aware • Recipes – Distributed row lock (without zookeeper) – Multi-DC row lock – Uniqueness constraint – Multi-row uniqueness constraint – Chunked and multi-threaded large file storage
  • 33. Initializing Astyanax // Configuration either set in code or nfastyanax.properties platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY netflix.environment=test default.astyanax.readConsistency=CL_QUORUM default.astyanax.writeConsistency=CL_QUORUM MyCluster.MyKeyspace.astyanax.servers=127.0.0.1 // Must initialize platform for discovery to work NFLibraryManager.initLibrary(PlatformManager.class, props, false, true); NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false); // Open a keyspace instance Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
  • 34. Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginating() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } } } catch (ConnectionException e) { }
  • 35. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Token Aware Clients Cassandra •Disks •Zone A Cassandra •Disks •Zone B Cassandra •Disks •Zone C Cassandra •Disks •Zone A Cassandra •Disks •Zone B Cassandra •Disks •Zone C 1. Client Writes to local coordinator 2. Coodinator writes to other zones 3. Nodes return ack 4. Data written to internal commit log disks (no more than 10 seconds later) If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compactions occur asynchronously 1 4 4 4 2 3 3 3 2
  • 36. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum US Clients Cassandra • Disks • Zone A Cassandra • Disks • Zone B Cassandra • Disks • Zone C Cassandra • Disks • Zone A Cassandra • Disks • Zone B Cassandra • Disks • Zone C 1. Client writes to local replicas 2. Local write acks returned to Client which continues when 2 of 3 local nodes are committed 3. Local coordinator writes to remote coordinator. 4. When data arrives, remote coordinator node acks and copies to other remote zones 5. Remote nodes ack to local coordinator 6. Data flushed to internal commit log disks (no more than 10 seconds later) If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent. EU Clients Cassandra • Disks • Zone A Cassandra • Disks • Zone B Cassandra • Disks • Zone C Cassandra • Disks • Zone A Cassandra • Disks • Zone B Cassandra • Disks • Zone C 6 5 5 6 6 4 4 4 1 6 6 6 2 2 2 3 100+ms latency
  • 37. Cassandra Instance Architecture Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Healthcheck, Status Monitoring AppDynamics machineagent Epic/Atlas Java (JDK 7) AppDynamics appagent monitoring GC and thread dump logging Cassandra Server Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and SSTables
  • 38. Priam – Cassandra Automation Available at http://github.com/netflix • Netflix Platform Tomcat Code • Zero touch auto-configuration • State management for Cassandra JVM • Token allocation and assignment • Broken node auto-replacement • Full and incremental backup to S3 • Restore sequencing from S3 • Grow/Shrink Cassandra “ring”
  • 39. ETL for Cassandra • Data is de-normalized over many clusters! • Too many to restore from backups for ETL • Solution – read backup files using Hadoop • Aegisthus – http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html – High throughput raw SSTable processing – Re-normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata
  • 41. Datacenter to Cloud Transition Goals • Faster – Lower latency than the equivalent datacenter web pages and API calls – Measured as mean and 99th percentile – For both first hit (e.g. home page) and in-session hits for the same user • Scalable – Avoid needing any more datacenter capacity as subscriber count increases – No central vertically scaled databases – Leverage AWS elastic capacity effectively • Available – Substantially higher robustness and availability than datacenter services – Leverage multiple AWS availability zones – No scheduled down time, no central database schema to change • Productive – Optimize agility of a large development team with automation and tools – Leave behind complex tangled datacenter code base (~8 year old architecture) – Enforce clean layered interfaces and re-usable components
  • 42. Datacenter Anti-Patterns What do we currently do in the datacenter that prevents us from meeting our goals?
  • 43. Rewrite from Scratch Not everything is cloud specific Pay down technical debt Robust patterns
  • 44. Netflix Datacenter vs. Cloud Arch Central SQL Database Distributed Key/Value NoSQL Sticky In-Memory Session Shared Memcached Session Chatty Protocols Latency Tolerant Protocols Tangled Service Interfaces Layered Service Interfaces Instrumented Code Instrumented Service Patterns Fat Complex Objects Lightweight Serializable Objects Components as Jar Files Components as Services
  • 45. Tangled Service Interfaces • Datacenter implementation is exposed – Oracle SQL queries mixed into business logic • Tangled code – Deep dependencies, false sharing • Data providers with sideways dependencies – Everything depends on everything else Anti-pattern affects productivity, availability
  • 46. Untangled Service Interfaces Two layers: • SAL - Service Access Library – Basic serialization and error handling – REST or POJO’s defined by data provider • ESL - Extended Service Library – Caching, conveniences, can combine several SALs – Exposes faceted type system (described later) – Interface defined by data consumer in many cases
  • 48. NetflixOSS Details • Platform entities and services • AWS Accounts and access management • Upcoming and recent NetflixOSS components • In-depth on NetflixOSS components
  • 49. Basic Platform Entities • AWS Based Entities – Instances and Machine Images, Elastic IP Addresses – Security Groups, Load Balancers, Autoscale Groups – Availability Zones and Geographic Regions • NetflixOS Specific Entities – Applications (registered services) – Clusters (versioned Autoscale Groups for an App) – Properties (dynamic hierarchical configuration)
  • 50. Core Platform Services • AWS Based Services – S3 storage, to 5TB files, parallel multipart writes – SQS – Simple Queue Service. Messaging layer. • Netflix Based Services – EVCache – memcached based ephemeral cache – Cassandra – distributed persistent data store
  • 51. Security Architecture • Instance Level Security baked into base AMI – Login: ssh only allowed via portal (not between instances) – Each app type runs as its own userid app{test|prod} • AWS Security, Identity and Access Management – Each app has its own security group (firewall ports) – Fine grain user roles and resource ACLs • Key Management – AWS Keys dynamically provisioned, easy updates – High grade app specific key management support
  • 53. Accounts Isolate Concerns • paastest – for development and testing – Fully functional deployment of all services – Developer tagged “stacks” for separation • paasprod – for production – Autoscale groups only, isolated instances are terminated – Alert routing, backups enabled by default • paasaudit – for sensitive services – To support SOX, PCI, etc. – Extra access controls, auditing • paasarchive – for disaster recovery – Long term archive of backups – Different region, perhaps different vendor
  • 54. Reservations and Billing • Consolidated Billing – Combine all accounts into one bill – Pooled capacity for bigger volume discounts http://docs.amazonwebservices.com/AWSConsolidatedBilling/1.0/AWSConsolidatedBillingGuide.html • Reservations – Save up to 71% on your baseline load – Priority when you request reserved capacity – Unused reservations are shared across accounts
  • 55. Cloud Access Gateway • Datacenter or office based – A separate VM for each AWS account – Two per account for high availability – Mount NFS shared home directories for developers – Instances trust the gateway via a security group • Manage how developers login to cloud – Access control via ldap group membership – Audit logs of every login to the cloud – Similar to awsfabrictasks ssh wrapper http://readthedocs.org/docs/awsfabrictasks/en/latest/
  • 56. Cloud Access Control www- prod • Userid wwwprod Dal- prod • Userid dalprod Cass- prod • Userid cassprod Cloud Access ssh Gateway Security groups don’t allow ssh between instances developers
  • 57. AWS Usage (coming soon) for test, carefully omitting any $ numbers…
  • 58. Dashboards with Pytheas (Explorers) http://techblog.netflix.com/2013/05/announcing-pytheas.html • Cassandra Explorer – Browse clusters, keyspaces, column families • Base Server Explorer – Browse service endpoints configuration, perf • Anything else you want to build…
  • 63. Slideshare NetflixOSS Details • Lightning Talks Feb S1E1 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks • Asgard In Depth Feb S1E1 – http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house • Lightning Talks March S1E2 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and- roadmap • Security Architecture – http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned • Cost Aware Cloud Architectures – with Jinesh Varia of AWS – http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh- varia-aws-and-adrian-cockroft-netflix
  • 64. Amazon Cloud Terminology Reference See http://aws.amazon.com/ This is not a full list of Amazon Web Service features • AWS – Amazon Web Services (common name for Amazon cloud) • AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus application code) • EC2 – Elastic Compute Cloud – Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configurations. – Instance – a running computer system. Ephemeral, when it is de-allocated nothing is kept. – Reserved Instances – pre-paid to reduce cost for long term usage – Availability Zone – datacenter with own power and cooling hosting cloud instances – Region – group of Avail Zones – US-East, US-West, EU-Eire, Asia-Singapore, Asia-Japan, SA-Brazil, US-Gov • ASG – Auto Scaling Group (instances booting from the same AMI) • S3 – Simple Storage Service (http access) • EBS – Elastic Block Storage (network disk filesystem can be mounted on an instance) • RDS – Relational Database Service (managed MySQL master and slaves) • DynamoDB/SDB – Simple Data Base (hosted http based NoSQL datastore, DynamoDB replaces SDB) • SQS – Simple Queue Service (http based message queue) • SNS – Simple Notification Service (http and email based topics and messages) • EMR – Elastic Map Reduce (automatically managed Hadoop cluster) • ELB – Elastic Load Balancer • EIP – Elastic IP (stable IP address mapping assigned to instance or ELB) • VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs) • DirectConnect – secure pipe from AWS VPC to external datacenter • IAM – Identity and Access Management (fine grain role based security keys)