Ne#lix Cloud Architecture Qcon Tokyo April 12, 2011 Adrian Cockcro< @adrianco #ne#lixcloud h?p://slideshare.net/adrianco acockcro<@ne#lix.com
Who, Why, What Ne#lix in the Cloud Cloud Challenges and Learnings Systems and OperaJons Architecture
Ne#lix Inc. With more than 20 million subscribers in the United States and Canada, Ne9lix, Inc. is the world’s leading Internet subscripAon service for enjoying movies and TV shows. InternaAonal Expansion We plan to expand into an addiAonal market in the second half of 2011… If the second market meets our expectaAons… we will conAnue to invest and expand aggressively in 2012. Source: h?p://ir.ne#lix.com
Unlimited streaming for $7.99/month, large and growing catalog of movies and TV
Adrian Cockcro< • Director, Architecture for Cloud Systems, Ne#lix Inc. – Previously Director for PersonalizaJon Pla#orm • DisJnguished Availability Engineer, eBay Inc. 2004-‐7 – Founding member of eBay Research Labs • DisJnguished Engineer, Sun Microsystems Inc. 1988-‐2004 – 2003-‐4 Chief Architect High Performance Technical CompuJng – 2001 Author: Capacity Planning for Web Services – 1999 Author: Resource Management – 1995 & 1998 Author: Sun Performance and Tuning – 1996 Japanese EdiJon of Sun Performance and Tuning • SPARC & Solaris ( )
Why is Ne#lix Talking about Cloud?
Ne#lix is Path-‐finding The Cloud ecosystem is evolving very fast Share with and learn from the cloud community
We want to use clouds, not build them Cloud technology should be a commodity Public cloud and open source for agility and scale
Why Use Cloud? For Be?er Business Agility For Unpredictable Business Growth
Data Center Ne#lix could not build new datacenters fast enough Capacity growth is acceleraJng, unpredictable Product launch spikes -‐ iPhone, Wii, PS3, XBox
20 Million Customers 2010-‐Q3 year/year +52% Total and +145% Streaming 25 20 15 10 5 0 2009Q2 2009Q3 2009Q4 2010Q1 2010Q2 2010Q3 2010Q4 Source: h?p://ir.ne#lix.com
Out-‐Growing Data Center h?p://techblog.ne#lix.com/2011/02/redesigning-‐ne#lix-‐api.html 37x Growth Jan 2010-‐Jan 2011 Datacenter Capacity
Ne#lix.com is now ~100% Cloud Account sign-‐up is currently being moved to cloud All internaJonal product will be cloud based USA specific logisJcs remains in the Datacenter
Leverage AWS Scale “the biggest public cloud” AWS investment in tooling and automaJon Use many AWS zones for high availability, scalability AWS skills are most common on resumes…
Leverage AWS Feature Set “the market leader” EC2, S3, SDB, SQS, EBS, EMR, ELB, ASG, IAM, RDB, VPC… h?p://aws.amazon.com/jp
Amazon Cloud Terminology See http://aws.amazon.com/jp for Japanese This is not a full list of Amazon Web Service features• AWS – Amazon Web Services (common name for Amazon cloud) • AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus applicaJon code) • EC2 – ElasJc Compute Cloud – Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configuraJons. – Instance – a running computer system. Ephemeral, when it is de-‐allocated nothing is kept. – Reserved Instances – pre-‐paid to reduce cost for long term usage – Availability Zone – datacenter with own power and cooling hosJng cloud instances – Region – group of Availability Zones – US-‐East, US-‐West, EU-‐Eire, Asia-‐Singapore, Asia-‐Japan • ASG – Auto Scaling Group (instances booJng from the same AMI) • S3 – Simple Storage Service (h?p access) • EBS – ElasJc Block Storage (network disk filesystem can be mounted on an instance) • RDB – RelaJonal Data Base (managed MySQL master and slaves) • SDB – Simple Data Base (hosted h?p based NoSQL data store) • SQS – Simple Queue Service (h?p based message queue) • SNS – Simple NoJficaJon Service (h?p and email based topics and messages) • EMR – ElasJc Map Reduce (automaJcally managed Hadoop cluster) • ELB – ElasJc Load Balancer • EIP – ElasJc IP (stable IP address mapping assigned to instance or ELB) • VPC – Virtual Private Cloud (extension of enterprise datacenter network into cloud) • IAM – IdenJty and Access Management (fine grain role based security keys)
“The cloud lets its users focus on delivering differenAaAng business value instead of wasAng valuable resources on the undifferen)ated heavy li0ing that makes up most of IT infrastructure.” Werner Vogels Amazon CTO
We want to use clouds, we don’t have Jme to build them Public cloud for agility and scale AWS because they are big enough to allocate thousands of instances per hour when we need to
Ne#lix EC2 Instances per Account (summer 2010, producJon is much higher now…) “Many Thousands” Content Encoding Test and ProducJon Log Analysis “Several Months”
Ne#lix Deployed on AWS Content Logs Play WWW API Video S3 DRM Search Metadata Masters EMR CDN Movie Device EC2 Hadoop rouJng Choosing Config TV Movie S3 Hive Bookmarks RaJngs Choosing Business Mobile CDN Logging Similars Intelligence iPhone
Cloud Encoding Pipeline Encode S3 Encode S3 Movie Master Network S3 Copy to CDN Stream Studios Ne#lix Master Mezza-‐ Mezza-‐ to 50+ Origin Origin Tapes Upload nine files CDN to TV nine files Licensed content is provided to Ne#lix as high quality master tapes Many formats are reduced to a single high quality mezzanine format on S3 Individual formats and speeds are encoded in over 50 combinaJons Many formats for older and newer hardware and various game consoles Many speeds from mobile through standard and high definiJon StaJc files are copied to each Content Delivery Network’s “origin server” CDNs migrate files to “edge servers” near the end user Files stream to PC/Mac/iPad or TV over HTTP using “range get” to move chunks
Cloud Architecture
Product Trade-‐off User Experience ImplementaJon Consistent Development Experience complexity OperaJonal Low Latency complexity
Ne#lix Cloud Goals • Faster – Lower latency than the equivalent datacenter web pages and API calls – Measured as mean and 99th percenJle – For both first hit (e.g. home page) and in-‐session hits for the same user • Scalable – Avoid needing any more datacenter capacity as subscriber count increases – No central verJcally scaled databases – Leverage AWS elasJc capacity effecJvely • Available – SubstanJally higher robustness and availability than datacenter services – Leverage mulJple AWS availability zones – No scheduled down Jme, no central database schema to change • ProducJve – OpJmize agility of a large development team with automaJon and tools – Leave behind complex tangled datacenter code base (~8 year old architecture) – Enforce clean layered interfaces and re-‐usable components
Old Datacenter vs. New Cloud Arch Central SQL Database Distributed Key/Value NoSQL SJcky In-‐Memory Session Shared Memcached Session Cha?y Protocols Latency Tolerant Protocols Tangled Service Interfaces Layered Service Interfaces Instrumented Code Instrumented Service Pa?erns Fat Complex Objects Lightweight Serializable Objects Components as Jar Files Components as Services
Learnings • Datacenter oriented tools don’t work – Ephemeral instances – High rate of change – Need too much hand-‐holding and manual setup • Cloud Tools Don’t Scale for Enterprise – Too many tools are “Startup” oriented – Built our own tools for 1000’s of instances – Drove vendors to be dynamic, scale, add APIs • Un-‐modified Datacenter Apps are Fragile – Too many datacenter oriented assumpJons – We re-‐wrote our code base! – (We re-‐write it conJnuously anyway)
Ne#lix Systems Architecture
API AWS EC2 Front End Load Balancer Discovery Service API Proxy API etc. Load Balancer Component API SQS Services Oracl e Oracle Oracle memcached memcached ReplicaJon EBS Ne?lix S3 Data Center AWS Storage SimpleDB
Database MigraJon • Why SimpleDB? – No DBA’s in the cloud, Amazon hosted service – Work started two years ago, fewer viable opJons – Worked with Amazon to speed up and scale SimpleDB • AlternaJves? – Rolling out Cassandra as “upgrade” from SimpleDB – Need several opJons to match use cases well • Detailed NoSQL and SimpleDB Advice – Sid Anand -‐ QConSF Nov 5th – Ne#lix’ TransiJon to High Availability Storage Systems – Blog -‐ h?p://pracJcalcloudcompuJng.com/ – Download Paper PDF -‐ h?p://bit.ly/bhOTLu
Cloud OperaJons Model Driven Architecture Capacity Planning & Monitoring
Tools and AutomaJon • Developer and Build Tools – Jira, Perforce, Eclipse, Jeeves, Ivy, ArJfactory – Builds, creates .war file, .rpm, bakes AMI and launches • Custom Ne#lix ApplicaJon Console – AWS Features at Enterprise Scale (hide the AWS security keys!) – Auto Scaler Group is unit of deployment to producJon • Open Source + Support – Apache, Tomcat, Cassandra, Hadoop, OpenJDK, CentOS • Monitoring Tools – Keynote – service monitoring and alerJng – AppDynamics – Developer focus for cloud h?p://appdynamics.com – EpicNMS – flexible data collecJon and plots h?p://epicnms.com – Nimso< NMS – ITOps focus for Datacenter + Cloud alerJng
Model Driven Architecture • Datacenter PracJces – Lots of unique hand-‐tweaked systems – Hard to enforce pa?erns • Model Driven Cloud Architecture – Perforce/Ivy/Jeeves based builds for everything – Every producJon instance is a pre-‐baked AMI – Every applicaJon is managed by an Autoscaler No excep)ons, every change is a new AMI
High Availability Zones • Each zone is a separate datacenter – Private power, cooling, network connecJons – Located close together for low latency • ASG Instances are distributed over 3 zones • Data wri?en to one zone appears in all zones • Ne#lix can survive total failure of one zone – Increase capacity of exisJng zones by 50% – Small or zero downJme
Region MigraJon (Ne#lix is working to have this in place during 2011, for internaJonal roll-‐out and disaster recovery) • Data is backed up into a different cloud region – Cloud bandwidth is much higher than Datacenter • Restore to a new region – “A few hours” to load data and create databases • Create model driven architecture – “A few hours” to create service instances and test • Send traffic to new region – Setup DNS records and start customer service
Model Driven ImplicaJons • Automated “Least Privilege” Security – Tightly specified security groups – Fine grain IAM keys to access AWS resources – Performance tools security and integraJon • Model Driven Performance Monitoring – Hundreds of instances appear in a few minutes… – Tools have to “garbage collect” dead instances
Ne#lix App Console
Auto Scale Group ConfiguraJon
Capacity Planning & Monitoring
Capacity Planning in Clouds (a few things have changed…) • Capacity is expensive • Capacity takes Jme to buy and provision • Capacity only increases, can’t be shrunk easily • Capacity comes in big chunks, paid up front • Planning errors can cause big problems • Systems are clearly defined assets • Systems can be instrumented in detail • Depreciate assets over 3 years (reservaJons!)
Monitoring Issues • Problem – Too many tools, each with a good reason to exist – Hard to get an integrated view of a problem – Too much manual work building dashboards – Tools are not discoverable, views are not filtered • SoluJon – Get vendors to add deep linking URLs and APIs – IntegraJon “portal” Jes everything together – Underlying dependency database – Dynamic portal generaJon, relevant data, all tools
Data Sources • External URL availability and latency alerts and reports – Keynote External TesJng • Stress tesJng -‐ SOASTA • Ne#lix REST calls – Chukwa to DataOven with GUID transacJon idenJfier Request Trace Logging • Generic HTTP – AppDynamics service Jer aggregaJon, end to end tracking • Tracers and counters – log4j, tracer central, Chukwa to DataOven ApplicaJon logging • Trackid and Audit/Debug logging – DataOven, Appdynamics GUID cross reference • ApplicaJon specific real Jme – Nimso<, Appdynamics, Epic JMX Metrics • Service and SLA percenJles – Nimso<, Appdynamics, Epic,logged to DataOven • Stdout logs – S3 – DataOven, Nimso< alerJng Tomcat and Apache logs • Standard format Access and Error logs – S3 – DataOven, Nimso< AlerJng • Garbage CollecJon – Nimso<, Appdynamics JVM • Memory usage, call stacks, resource/call -‐ AppDynamics • system CPU/Net/RAM/Disk metrics – AppDynamics, Epic, Nimso< AlerJng Linux • SNMP metrics – Epic, Network flows -‐ FasJp • Load balancer traffic – Amazon Cloudwatch, SimpleDB usage stats AWS • System configuraJon -‐ CPU count/speed and RAM size, overall usage -‐ AWS
Integrated Dashboards
Dashboards Architecture • Integrated Dashboard View – Single web page containing content from many tools – Filtered to highlight most “interesJng” data • Relevance Controller – Drill in, add and remove content interacJvely – Given an applicaJon, alert or problem area, dynamically build a dashboard relevant to your role and needs • Dependency and Incident Model – Model Driven -‐ Interrogates tools and AWS APIs – Document store to capture dependency tree and states
Dashboard Prototype (not everything is integrated yet)
AppDynamics How to look deep inside your cloud applicaJons • AutomaJc Monitoring – Base AMI bakes in all monitoring tools – Outbound calls only – no discovery/polling issues – InacJve instances removed a<er a few days • Incident Alarms (deviaJon from baseline) – Business TransacJon latency and error rate – Alarm thresholds discover their own baseline – Email contains URL to Incident Workbench UI
Using AppDynamics (simple example from early 2010)
Point Finger and Assess Impact (an async S3 write was slow, no big deal)
Monitoring Summary • Broken datacenter oriented tools is a big problem • IntegraJng many different tools – They are not designed to be integrated – We have “persuaded” vendors to add APIs • If you can’t see deep inside your app, you’re L
Wrap Up
ImplicaJons for IT OperaJons • Cloud is run by developer organizaJon – Our IT department is Amazon Cloud • Cloud capacity is much bigger than Datacenter – Datacenter oriented IT staffing is flat – We have no IT staff working on cloud – We have moved 3 people out of IT to write code • TradiJonal IT Roles are going away – Don’t need SA, DBA, Storage, Network admins
Next Few Years… • “System of Record” moves to Cloud (now) – Master copies of data live only in the cloud, with backups – Cut the datacenter to cloud replicaJon link • InternaJonal Expansion – Global Clouds (later in 2011) – Rapid deployments to new markets • Cloud StandardizaJon? – Cloud features and APIs should be a commodity not a differenJator – DifferenJate on scale and quality of service – CompeJJon also drives cost down – Higher resilience and scalability We would prefer to be an insignificant customer in a giant cloud
Takeaway Ne9lix is path-‐finding the use of public AWS cloud to replace in-‐house IT for non-‐trivial applicaAons with hundreds of developers and thousands of systems. acockcro<@ne#lix.com h?p://www.linkedin.com/in/adriancockcro< @adrianco #ne#lixcloud
1–1 of 1 previous next