Your SlideShare is downloading. ×
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Invent 2013
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Invent 2013

1,889
views

Published on

(Presented by Stackdriver) Key decisions related to architecture, tools, processes, and even team composition can have a dramatic effect on the human effort required to operate distributed …

(Presented by Stackdriver) Key decisions related to architecture, tools, processes, and even team composition can have a dramatic effect on the human effort required to operate distributed applications on AWS. If you make the wrong decisions on in these areas, you spend your days, nights, weekends, and vacations dealing with issues and noise. If you make the right decisions, you and your team can focus on building customer value, and your time away from work is spent… not working.
Stackdriver and Smugmug describe the seven most important practices that world-class operations teams employ to minimize operational overhead, highlighting real-world examples to illustrate the importance of each.

Published in: Technology, Business

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,889
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
68
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Seven Steps to a Peaceful Life on AWS Andrew Shieh SmugMug @shandrew Friday, November 15, 13 Philip Jacob Stackdriver @whirlycott
  • 2. Friday, November 15, 13
  • 3. Friday, November 15, 13
  • 4. Stuff we have in common ✓ ✓ ✓ ✓ ✓ ‣ Years of AWS experience Success and failure with many lessons learned Both using Stackdriver for infrastructure monitoring Lots of data Philosophically aligned on how to run on AWS Superheroes Friday, November 15, 13
  • 5. Friday, November 15, 13
  • 6. CLOUD HYPE Peak of Expectations DevOps Nirvana Operational Enlightenment Transition to Distributed Systems Lure of Elasticity Friday, November 15, 13 TIME
  • 7. STEPS Friday, November 15, 13
  • 8. Friday, November 15, 13
  • 9. 1: Apply lean production principles Friday, November 15, 13
  • 10. Release all the time: continuous improvement Friday, November 15, 13
  • 11. Make it frictionless Friday, November 15, 13
  • 12. $ stack deploy Friday, November 15, 13
  • 13. Friday, November 15, 13
  • 14. 2: Choose the right instance type Friday, November 15, 13
  • 15. Factors to Consider CPU Network Disk I/O Workload Cost Tools to help you decide vmstat iostat sar R Excel Stackdriver + agent Friday, November 15, 13
  • 16. 1. la rg e$ m 1. sm al m l$ 1. m ed iu m c1 $ .m ed iu m $ c1 .x la rg e$ t1 .m icr o$ m 1. xla rg e$ m 2. xla rg e$ m 2. 2x la rg e$ m 2. 4x la rg e$ m 3. xla rg e$ m 3. 2x la rg e$ cc 2. 8x la rg e$ hi 1. 4x la rg e$ cg 1. 4x la rg e$ hs 1. 8x la rg e$ cc 1. 4x la rg e$ cr 1. 8x la rg e$ m Distribu=on$of$EC2$Instance$Usage$ 21%$ 20%$ 12%$ 11%$ 9%$ 7%$ Friday, November 15, 13 7%$ 3%$ 2%$ 2%$ 2%$ 1%$ 1%$ 0%$ 0%$ 0%$ 0%$ 0%$
  • 17. + EC2 Friday, November 15, 13
  • 18. 3: Use configuration management Friday, November 15, 13
  • 19. Friday, November 15, 13
  • 20. 4: Choose the right monitoring solution Friday, November 15, 13
  • 21. Friday, November 15, 13
  • 22. Rapid Setup Friday, November 15, 13 Full-stack AWS Integration Cluster-aware Intelligent
  • 23. 5: Design effective alerting policies Friday, November 15, 13
  • 24. Simple rules for confidently waking up ops@ at 3am 1. Something had better be broken (or close to it) for the customer 2. The broken thing should be as obvious as possible 3. It should be clear what action I can take to make the situation better Friday, November 15, 13 Customers seeing huge spike in 5XX errors Code deploy to web cluster one hour ago Revert!
  • 25. 6: Architect for high availability Friday, November 15, 13
  • 26. Apache Zookeeper Friday, November 15, 13 Elastic Load Balancing Amazon RDS
  • 27. Cloud Integration System Agents Workers Workers Workers Custom Metrics Agents Agents Agents API API API Data Ingestion Elastic Load Balancing w/ haproxy DNS Load Balancing 1 Load Balancing 2 Load Balancing n Cell-1 GW Cell-2 GW Cell-n GW MQ MQ MQ I A I A I A F F F Online Analysis Archival Serving Q 1 S3 Anomaly n 2 Cassandra Health 3 Batch Aggregation Web/Mobile Correlation Friday, November 15, 13 Trending o UI UI Localized failure Identical dimensions Easy to reason Network partitions ok
  • 28. Handling failure Resilience Avoid it Mask it Minimize it Recover quickly Cluster AZ Tolerance Friday, November 15, 13 Region
  • 29. 7: Think holistically about quality assurance Friday, November 15, 13
  • 30. AUTOSCALING + AUTOMATION + CONTINUOUS INTEGRATION + DEVOPS GOVERNANCE + ELASTICITY + PROGRAMMABLE INFRASTRUCTURE = CONSTANT CHANGE Friday, November 15, 13
  • 31. You cannot pre-test every change So You need to be really good at detecting issues Very quickly Friday, November 15, 13
  • 32. Monitoring is a key part of quality assurance for dynamic systems But monitoring tools need to be intelligent Distributed sensors Cloud-aware Anomaly detection Synthetic transactions Friday, November 15, 13
  • 33. • • Friday, November 15, 13 Training Recommended reading: Systemantics (aka The Systems Bible) High Scalability (http://highscalability.com/) James Hamilton’s blog (http://perspectives.mvdirona.com/) • • •
  • 34. Visit us at http://www.smugmug.com/ Friday, November 15, 13
  • 35. Visit us at booth 315! Friday, November 15, 13