1. Performance Architecture for Cloud March 7, 2011 Adrian Cockcro: @adrianco #ne=lixcloud #ccevent h@p://www.linkedin.com/in/adriancockcro: acockcro:@ne=lix.com
2. Who, Why, What Ne=lix in the Cloud Cloud Performance Challenges Performance Architecture and Tools
3. Ne=lix.com is now ~100% Cloud See h@p://techblog.ne=lix.com Detailed SlideShare presentaQon : Ne=lix on Cloud h@p://slideshare.net/adrianco We have 25 minutes -‐ not half a day to discuss everything!
4. A Nice Problem To Have… h@p://techblog.ne=lix.com/2011/02/redesigning-‐ne=lix-‐api.html 37x Growth Jan 2010-‐Jan 2011
5. Data Center We stopped building our own datacenters Capacity growth is acceleraQng, unpredictable Product launch spikes -‐ iPhone, Wii, PS3, XBox
6. We want to use clouds, we don’t have Qme to build them Public cloud for agility and scale AWS because they are big enough to allocate thousands of instances per hour for us
7. Ne=lix EC2 Instances per Account (summer 2010, producQon is up ~3x now…) “Many Thousands” Content Encoding Test and ProducQon Log Analysis “Several Months”
8. AWS Performance? Mostly good, be@er than expected over-‐all • The Good – Large EC2 Instance types (esp. the m2 range) – Internal disk performance – Network performance within and between Availability Zones – Robustness and scalability of S3, SQS • The Bad – ElasQc Load Balancer has too many limitaQons – SimpleDB needs memcached front end, too many limitaQons at Terabyte scale • The Ugly – EBS performance is slow and inconsistent, we avoid it
9. Learnings • Datacenter oriented tools don’t work – Ephemeral instances – High rate of change – Need too much hand-‐holding and manual setup • Cloud Tools Don’t Scale for Enterprise – Too many tools are “Startup” oriented – Built our own tools for 1000’s of instances – Drove vendors to be dynamic, scale, add APIs • “fork-‐li:ed” apps are fragile – Too many datacenter oriented assumpQons – We re-‐wrote our code base! – (We re-‐write it conQnuously anyway)
11. Model Driven Architecture • Datacenter PracQces – Lots of unique hand-‐tweaked systems – Hard to enforce pa@erns • Model Driven Cloud Architecture – Perforce/Ivy/Hudson based builds for everything – Every producQon instance is a pre-‐baked AMI – Every applicaQon is managed by an Autoscaler No excep(ons, every change is a new AMI
12. Model Driven ImplicaQons • Automated “Least Privilege” Security – Tightly speciﬁed security groups – Fine grain IAM keys to access AWS resources – Performance tools security and integraQon • Model Driven Performance Monitoring – Hundreds of instances appear in a few minutes… – Tools have to “garbage collect” dead instances
13. Capacity Planning & Metrics
14. What is Capacity Planning? • We care about – CPU, Memory, Network and Disk resources consumed – ApplicaQon response Qmes • We need to know – how much of each resource we are using now – how much will we use in the future – how much headroom we have to handle higher loads • We want to understand – how headroom varies – how it relates to response Qmes and throughput
15. Capacity Planning in Clouds (a few things have changed…) • Capacity is expensive • Capacity takes Qme to buy and provision • Capacity only increases, can’t be shrunk easily • Capacity comes in big chunks, paid up front • Planning errors can cause big problems • Systems are clearly deﬁned assets • Systems can be instrumented in detail • Depreciate assets over 3 years (reservaQons!)
16. OK, so just give me the data! Throughput – not hard Response Time – mean+2xSD? %iles? UQlizaQon….
17. UQlizaQon “UQlizaQon is virtually useless as a metric” CMG 2006 Paper by Adrian Cockcro: VirtualizaQon is a DOS a@ack on Capacity Planning…
18. What would you say if you were asked: Q: That system is slow, how busy is it? A: I have no idea… A: The graph in this tool looks about 50% A: But the graph in this other tool is 65% A: Amazon CloudWatch says 82% A: Linux says us sy ni id wa st L A: Why do you want to know? A: I’m sorry, you don’t understand your quesQon….
19. Whats the problem with UQlizaQon? • CPU Capacity – Varying capacity due to mulQ-‐tenancy – Non-‐idenQcal servers or CPUs (check /proc/cpuinfo) – Non-‐linear capacity due to hyperthreading etc. • Measurement Errors – Monitoring tools that ignore “stolen Qme” (all of them) – Mechanisms with built in bias (clock Qck counQng) – Pla=orm and release speciﬁc changes in metrics Every tool shows a diﬀerent value for the same metric!
20. Performance Tools Architecture
21. Monitoring Issues • Problem – Too many tools, each with a good reason to exist – Hard to get an integrated view of a problem – Too much manual work building dashboards – Tools are not discoverable, views are not ﬁltered • SoluQon – Get vendors to add deep linking URLs and APIs – IntegraQon “portal” Qes everything together – Underlying dependency database – Dynamic portal generaQon, relevant data, all tools
22. Data Sources • External URL availability and latency alerts and reports – Keynote External TesQng • Stress tesQng -‐ SOASTA • Ne=lix REST calls – Chukwa to DataOven with GUID transacQon idenQﬁer Request Trace Logging • Generic HTTP – AppDynamics service Qer aggregaQon, end to end tracking • Tracers and counters – log4j, tracer central, Chukwa to DataOven ApplicaQon logging • Trackid and Audit/Debug logging – DataOven, Appdynamics GUID cross reference • ApplicaQon speciﬁc real Qme – Nimso:, Appdynamics, Epic JMX Metrics • Service and SLA percenQles – Nimso:, Appdynamics, Epic,logged to DataOven • Stdout logs – S3 – DataOven, Nimso: alerQng Tomcat and Apache logs • Standard format Access and Error logs – S3 – DataOven, Nimso: AlerQng • Garbage CollecQon – Nimso:, Appdynamics JVM • Memory usage, call stacks, resource/call -‐ AppDynamics • system CPU/Net/RAM/Disk metrics – AppDynamics, Epic, Nimso: AlerQng Linux • SNMP metrics – Epic, Network ﬂows -‐ FasQp • Load balancer traﬃc – Amazon Cloudwatch, SimpleDB usage stats AWS • System conﬁguraQon -‐ CPU count/speed and RAM size, overall usage -‐ AWS
23. Integrated Dashboards
24. Dashboards Architecture • Integrated Dashboard View – Single web page containing content from many tools – Filtered to highlight most “interesQng” data • Relevance Controller – Drill in, add and remove content interacQvely – Given an applicaQon, alert or problem area, dynamically build a dashboard relevant to your role and needs • Dependency and Incident Model – Model Driven -‐ Interrogates tools and AWS APIs – Document store to capture dependency tree and states
25. Dashboard Prototype (not everything is integrated yet)
26. AppDynamics How to look deep inside your cloud applicaQons • AutomaQc Monitoring – Base AMI bakes in all monitoring tools – Outbound calls only – no discovery/polling issues – InacQve instances removed a:er a few days • Incident Alarms (deviaQon from baseline) – Business TransacQon latency and error rate – Alarm thresholds discover their own baseline – Email contains URL to Incident Workbench UI
27. Using AppDynamics (simple example from early 2010)
28. Switch to Snapshot View Pick a slow call graph
29. InteracQons for this Snapshot Click to view call graph
30. Point Finger and Assess Impact (an async S3 write was slow, no big deal)
31. Summary • Performance of AWS Systems isn’t an issue • Broken datacenter tools and metrics is the issue! • IntegraQng too many diﬀerent tools – They are not designed to be integrated – Did I menQon that I hate ﬂash based user interfaces? – We have “persuaded” vendors to add APIs • If you can’t see deep inside your app, you’re L QuesQons? Job ApplicaQons? @adrianco #ne=lixcloud #ccevent