Your SlideShare is downloading. ×
@atseitlin
Netflix Cloud Platform
Netflix's evolution in the cloud
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@at...
@atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
nearly 38 million
members in 40
c...
@atseitlin
Original Content
@atseitlin
Critical Acclaim
@atseitlin
A complex distributed system
@atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
...
@atseitlin
Highly Available Architecture
Micro-services, redundancy,
resiliency
@atseitlin
Web Server Dependencies Flow
Home page business transaction
Start Here
memcached
Cassandra
Web service
S3 bucke...
@atseitlin
Component Micro-Services
Test With Chaos Monkey, Latency Monkey
@atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and E...
@atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
...
@atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandr...
@atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic d...
@atseitlin
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Fail often
@atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fai...
@atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your s...
@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, ...
@atseitlin
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
@atseitlin
Made possible in the cloud
APIs, Elasticity, Efficiency
@atseitlin
APIs
• Control everything (start, terminate, scale)
• Inject failure
• Monitor & audit
• Automate operations
@atseitlin
Elasticity
• Capacity planning replaced with forecasting
• Dynamic load-based auto-scaling
• New data centers a...
@atseitlin
Efficiency
• ~10x trough to peak ratio. Fill trough with
batch workloads
• Optimize machine class for each serv...
@atseitlin
Coming soon to a cloud near you
Billing & Payments, Big Data &
Analytics, SaaS
@atseitlin
Billing & Payments
• PCI compliance
• Privacy & security
• Intermediate step of cache in the cloud
@atseitlin
Big Data & Analytics
• On deck for cloud migration
• ETL already in cloud with EMR (Hadoop)
• Many cloud altern...
@atseitlin
Corporate system moving to SaaS
• Email (Exchange->Google Apps)
• Expense Management (Concur->Workday)
• Docume...
@atseitlin
@atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Serv...
@atseitlin
@atseitlin
Our Current Catalog of Releases
Free code available at http://netflix.github.com
@atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Many, man...
@atseitlin
Takeaways
Netflix has built and deployed a scalable global and highly available Platform as a
Service and opene...
@atseitlin
Thank you!
Any questions?
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin
Upcoming SlideShare
Loading in...5
×

MassTLC Cloud Summit Keynote

1,271

Published on

My keynote at the MassTLC Cloud Summit on Oct 8th on the Netflix architecture and future in the cloud

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,271
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "MassTLC Cloud Summit Keynote"

  1. 1. @atseitlin Netflix Cloud Platform Netflix's evolution in the cloud Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin
  2. 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with nearly 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1] http://ir.netflix.com/
  3. 3. @atseitlin Original Content
  4. 4. @atseitlin Critical Acclaim
  5. 5. @atseitlin A complex distributed system
  6. 6. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
  7. 7. @atseitlin Highly Available Architecture Micro-services, redundancy, resiliency
  8. 8. @atseitlin Web Server Dependencies Flow Home page business transaction Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
  9. 9. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
  10. 10. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  11. 11. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  12. 12. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  13. 13. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  14. 14. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
  15. 15. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
  16. 16. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
  17. 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
  18. 18. @atseitlin Asgard http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  19. 19. @atseitlin Made possible in the cloud APIs, Elasticity, Efficiency
  20. 20. @atseitlin APIs • Control everything (start, terminate, scale) • Inject failure • Monitor & audit • Automate operations
  21. 21. @atseitlin Elasticity • Capacity planning replaced with forecasting • Dynamic load-based auto-scaling • New data centers at the click of a button
  22. 22. @atseitlin Efficiency • ~10x trough to peak ratio. Fill trough with batch workloads • Optimize machine class for each service • Highly available red/black deployments
  23. 23. @atseitlin Coming soon to a cloud near you Billing & Payments, Big Data & Analytics, SaaS
  24. 24. @atseitlin Billing & Payments • PCI compliance • Privacy & security • Intermediate step of cache in the cloud
  25. 25. @atseitlin Big Data & Analytics • On deck for cloud migration • ETL already in cloud with EMR (Hadoop) • Many cloud alternatives but not yet as mature as the old guard
  26. 26. @atseitlin Corporate system moving to SaaS • Email (Exchange->Google Apps) • Expense Management (Concur->Workday) • Document sharing (File Servers->Box) • Goal is 100% SaaS
  27. 27. @atseitlin
  28. 28. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
  29. 29. @atseitlin
  30. 30. @atseitlin Our Current Catalog of Releases Free code available at http://netflix.github.com
  31. 31. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Many, many more jobs.netflix.com
  32. 32. @atseitlin Takeaways Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) The Cloud enables elasticity, efficiency and fine-grained control via APIs Credit cards, Big Data, and rest of corporate systems are next to move to the Cloud http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/atseitlin @atseitlin @NetflixOSS
  33. 33. @atseitlin Thank you! Any questions? Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin

×