Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HERDING CATS IN THE CLOUD
MAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD
Dewey Sasser
Consulting Cloud Architect...
ABOUT THIS TALK
Public Clouds can give developers unprecedented levels
of power
“With Great Power Comes Great Responsibili...
ABOUT DEWEY
Distributed Application Developer for 20 years
Doing build/release/software process for about that long
Accide...
ABOUT THE COMPANY
Company Policy: don’t talk for the company
Therefore, these slides don't mention The Company.
There is n...
ROADMAP
What we Did
What we're Doing
What we might do Next
WE'RE COMING FROM...
Traditionally MMOs in colo
Windows (ugh!) based servers
All in cloud now: mobile, cloud, Docker, Mong...
GOALS
100% uptime: players want to play
No more: Patch days, "Down for maintenance"
Profit ( = revenue – cost)
SCALE
$100ks of monthly spend
Many hundreds of instances
Around 500TB of monthly transfer
Peak to 12k tps (for a single ti...
USAGE/LOAD PATTERN
Traditional SAS assumes starting small and scaling. Scaling
quickly is a problem, but a good problem.
G...
PLATFORMS
Swarm pattern
Pods of services
Python/NGINX
Batch Processing pattern
Vertica
Elastic Map/Reduce
Work Queue (Kafk...
PROCESS/SOCIAL APPROACH
Must be (people) scalable
Working on 3 new games at any one time
Still supporting old games
Suppor...
POLICIES ARE GREAT, BUT...
They change over time
Are hard to get exactly right up front
Always have exceptions
The space o...
POWER TO THE PEOPLE (OR DEVELOPERS)
Don't gate productivity on fine points of arbitrary policies
Keep responsibility with ...
APPROACH
Cloud Environment
Multiple accounts (~ 2 dozen right now)
1 central services account
1 account per title
All envi...
DEV TEAMS RESPONSIBLE FOR...
Developing, validating, deploying and running their games
Responding to production issues
PRO...
CENTRAL "CLOUD SERVICES" TEAM
“Owns”
Metrics, Monitoring, Alerting
Enables use of central services & good practices
Compos...
OWNERSHIP/RESPONSIBILITY
Clearly align authority and responsibility.
If a Dev is getting up in the middle of the night to ...
GREAT, HOW?
CRITICAL TOOL: RULES & WORKFLOWS
Custom developed rules/workflow system
Rules are small, stateless snippets of Python code...
CRITICAL TOOL: RULES & WORKFLOWS
Runtime is HIGHLY privileged – keep it tight!
This tool can destroy the world – but it
ac...
USER ACCESS CONTROL
Automate user management/creation from source in GIT
Define membership rules as intersection of desire...
USER ACCESS CONTROL
Don't try for least privilege – you won't get it right and it will be different tomorrow
There are a s...
USER ACCESS CONTROL NG
Federation? Yes, but there are issues
SSO? Likewise
We'll probably go to a SAML based federated MFA...
NETWORK ACCESS CONTROL
VPN into the cloud
Bastion hosts
Private VPCs
Shared root keys
Yup, shared.
No user management on i...
COST CONTROL
It's a thing.
It's a really BIG THING!
COST CONTROL
Tagging policy
Owner (who to go to)
Environment (Dev, Prod, QA, …)
Project (Cost Center – DO NOT USE THIS FOR...
WHAT YOU CARE ABOUT WITH
COSTS (AWS SPECIFIC)
Reserved Instances
Go for about 80% of always on – Leave room to optimize
Pe...
ACTUALLY HERDING THE CATS
Devops Working Group
Senior engineers
No managers: If you can't put hands on a keyboard to fix s...
ACTUALLY HERDING THE CATS
Central Cloud Team
Is ½ service organization and ½ cloud owner
Be nice, or the cats will go away...
RESULTS
PROs
Maximizes velocity, agility
Scalable
Can try out different working
patterns
CONs
Inconsistent
Have to be care...
RESOURCES
AWS Enterprise Support
Expensive, but good
Cloud based services – lots of options here
OpEX, not CapEX (except f...
TOOLS
Automated Rules/Workflow
Github Enterprise – it's Github that makes your security geeks happy.
Docker
Quay (Private ...
NEXT STEPS
Cloud Services Liaisons
Send a member of cloud central to each team's sprint planning
"Lunch and Learn"
Goes bo...
LESSONS LEARNED
Start when you're small – fixing the problem
after the fact is much harder
Automate everything, even when ...
QUESTIONS?
PHOTO CREDITS
• https://www.flickr.com/photos/pelican/6180235561
• https://www.youtube.com/watch?v=puijCrETsrY
• https://w...
Upcoming SlideShare
Loading in …5
×

Herding cats in the Cloud

564 views

Published on

Maintaining operational sanity when developers have operational responsibility. (Devops!)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Herding cats in the Cloud

  1. 1. HERDING CATS IN THE CLOUD MAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD Dewey Sasser Consulting Cloud Architect Algined Software
  2. 2. ABOUT THIS TALK Public Clouds can give developers unprecedented levels of power “With Great Power Comes Great Responsibility” You must structure your development and production deployment process to use this power well How do we do this? Experience from a large deployment
  3. 3. ABOUT DEWEY Distributed Application Developer for 20 years Doing build/release/software process for about that long Accidentally doing devops out of self-defense Wandered in operations about 5 years ago Built some private cloud for dev Built some private cloud for prod Starting architecting using public cloud for everything
  4. 4. ABOUT THE COMPANY Company Policy: don’t talk for the company Therefore, these slides don't mention The Company. There is no information here that is not otherwise publicly available. Whoever it is, I don't speak for them Major Gaming Company, multiple AAA titles History in MMOs All in on mobile now
  5. 5. ROADMAP What we Did What we're Doing What we might do Next
  6. 6. WE'RE COMING FROM... Traditionally MMOs in colo Windows (ugh!) based servers All in cloud now: mobile, cloud, Docker, MongoDB, Phoenix Servers, Chaos Monkey, (...other popular buzzwords)
  7. 7. GOALS 100% uptime: players want to play No more: Patch days, "Down for maintenance" Profit ( = revenue – cost)
  8. 8. SCALE $100ks of monthly spend Many hundreds of instances Around 500TB of monthly transfer Peak to 12k tps (for a single title) Around 1 PB of storage Approximately 5 billion I/Os monthly
  9. 9. USAGE/LOAD PATTERN Traditional SAS assumes starting small and scaling. Scaling quickly is a problem, but a good problem. Games are weird Peak usage is release day, it tails off after that You must be able to scale out of the gate. Users that cannot use it the first day will often never be back!
  10. 10. PLATFORMS Swarm pattern Pods of services Python/NGINX Batch Processing pattern Vertica Elastic Map/Reduce Work Queue (Kafka) NoSQL (MongoDB – ugh!) Gaming Platform CoreOS/Docker Strong Phoenix Server pattern
  11. 11. PROCESS/SOCIAL APPROACH Must be (people) scalable Working on 3 new games at any one time Still supporting old games Supporting services for the larger company Don't create a bottleneck “I'm waiting for a VM”. Bad process. No biscuit. There are too many controls to get least privilege right! Validation, not prevention (WHAT???)
  12. 12. POLICIES ARE GREAT, BUT... They change over time Are hard to get exactly right up front Always have exceptions The space of AWS permissions is HUGE. Permutations are deadly. So...measure what you care about. What you care about will change over time. Trust...and verify
  13. 13. POWER TO THE PEOPLE (OR DEVELOPERS) Don't gate productivity on fine points of arbitrary policies Keep responsibility with dev team domain expertise put the pain where the control is Stuff gets automated!!!
  14. 14. APPROACH Cloud Environment Multiple accounts (~ 2 dozen right now) 1 central services account 1 account per title All environments in different VPCs (Dev, QA, Perf, Staging, Prod)
  15. 15. DEV TEAMS RESPONSIBLE FOR... Developing, validating, deploying and running their games Responding to production issues PRODUCTION cost control
  16. 16. CENTRAL "CLOUD SERVICES" TEAM “Owns” Metrics, Monitoring, Alerting Enables use of central services & good practices Composable components used by the teams Native packaging -- make it easy Manages good practices Their job is to be cloud experts But they're not the only ones in the company LOTS of conversation! Automates everything non-project specific New account creation, ...
  17. 17. OWNERSHIP/RESPONSIBILITY Clearly align authority and responsibility. If a Dev is getting up in the middle of the night to fix something, they have to have full power to fix it. On a related note, that means the teams get approval control over a great deal
  18. 18. GREAT, HOW?
  19. 19. CRITICAL TOOL: RULES & WORKFLOWS Custom developed rules/workflow system Rules are small, stateless snippets of Python code that trigger workflows But can be company public and extensible by pull request Workflows are potentially long running, stateful operations that trigger list of changes. Can also be company public, but tighter controls around changes. Changes can be reviewed manually or automatically.
  20. 20. CRITICAL TOOL: RULES & WORKFLOWS Runtime is HIGHLY privileged – keep it tight! This tool can destroy the world – but it actually keeps it running. (you have everything automated to recreate the world, right?)
  21. 21. USER ACCESS CONTROL Automate user management/creation from source in GIT Define membership rules as intersection of desired group and account characteristics (MFA anyone?) Rules/Workflow enforces MFA. Central team doesn't have to Remove your MFA, get demoted to “User”
  22. 22. USER ACCESS CONTROL Don't try for least privilege – you won't get it right and it will be different tomorrow There are a small number of access levels and people are sorted into those levels per account User ReadOnly (Manager) Finance Developer DevOps FullAdmin
  23. 23. USER ACCESS CONTROL NG Federation? Yes, but there are issues SSO? Likewise We'll probably go to a SAML based federated MFA gateway We might go to AD based access
  24. 24. NETWORK ACCESS CONTROL VPN into the cloud Bastion hosts Private VPCs Shared root keys Yup, shared. No user management on individual nodes Cattle, not cats
  25. 25. COST CONTROL It's a thing. It's a really BIG THING!
  26. 26. COST CONTROL Tagging policy Owner (who to go to) Environment (Dev, Prod, QA, …) Project (Cost Center – DO NOT USE THIS FOR AUTOMATIOLN!) Enforce tagging by rules/workflow process Measure compliance, escalate to GM Kill off instances that don't comply With lots of warning Now tools will give good data CloudHealth (there are others)
  27. 27. WHAT YOU CARE ABOUT WITH COSTS (AWS SPECIFIC) Reserved Instances Go for about 80% of always on – Leave room to optimize Periodically review it and move RIs Turn off developer systems overnight – small but significant. Stay on current generation (instance type and OS) Better performance/$, results in lower $ Pay attention to traffic – inter AZ as well as outbound. Compression! Do cost estimates based on loads – have guidelines
  28. 28. ACTUALLY HERDING THE CATS Devops Working Group Senior engineers No managers: If you can't put hands on a keyboard to fix something going wrong, this is not the place for you Things are brought up, opinions are formed. Don’t attribute to individuals. Discuss cross-cutting needs GREAT place for the central cloud team to mine for new work
  29. 29. ACTUALLY HERDING THE CATS Central Cloud Team Is ½ service organization and ½ cloud owner Be nice, or the cats will go away and ignore you. The cats are your scouts and your customers. Listen to them so you know what's important.
  30. 30. RESULTS PROs Maximizes velocity, agility Scalable Can try out different working patterns CONs Inconsistent Have to be careful about responsibilities You always have some weeds in the garden You're always trying to keep up with developers But at least you know it And you're not in the way
  31. 31. RESOURCES AWS Enterprise Support Expensive, but good Cloud based services – lots of options here OpEX, not CapEX (except for Ris?) Metrics (Librato) Cost Exploration (CloudHealth)
  32. 32. TOOLS Automated Rules/Workflow Github Enterprise – it's Github that makes your security geeks happy. Docker Quay (Private Docker Hub) Jenkins (for network Cron) Chef (not much use any more)
  33. 33. NEXT STEPS Cloud Services Liaisons Send a member of cloud central to each team's sprint planning "Lunch and Learn" Goes both ways -- NOT just the cloud central team More Policy Automation!!!
  34. 34. LESSONS LEARNED Start when you're small – fixing the problem after the fact is much harder Automate everything, even when you don't “have” to – it makes things easier to change Have a Central Services Team to deal with cross- cutting concerns Put the power in the hands of people who can make things better
  35. 35. QUESTIONS?
  36. 36. PHOTO CREDITS • https://www.flickr.com/photos/pelican/6180235561 • https://www.youtube.com/watch?v=puijCrETsrY • https://www.flickr.com/photos/afu007/2398217277 • https://www.flickr.com/photos/jurvetson/5419597546 • https://commons.wikimedia.org/wiki/File:Catch_cats_3.JPG • https://pixabay.com/en/photos/pet/?cat=industry • https://commons.wikimedia.org/wiki/File:White_Cat_and_a_mouse.jpg • https://www.flickr.com/photos/dan4th/2839915202 • https://pixabay.com/en/cat-annoyed-mauzen-teeth-stress-1370024/ • https://et.wikipedia.org/wiki/Pilt:PR_Siriuksen_EeroCurl_ACS_ds_09_24_1.JPG • https://commons.wikimedia.org/wiki/File:PR_Siriuksen_EeroCurl_ACS_ds_09_24_2.JPG • https://commons.wikimedia.org/wiki/File:Tunnel_cat_(6414878527).jpg • https://www.flickr.com/photos/petsadviser-pix/8652859754 • https://commons.wikimedia.org/wiki/File:Antu_mongodb.svg • https://www.flickr.com/photos/michael-broad/4642745499 • http://maxpixel.freegreatpicture.com/Cat-Animal-Kennel-Cats-Eyes-Cute-Cat-Animals- 269047 • http://maxpixel.freegreatpicture.com/Cat-Kitty-Kitten-Cute-Pipe-Curious-Tube-Feline- 568593 • https://www.flickr.com/photos/santamonicamtns/16613805934 • http://maxpixel.freegreatpicture.com/Surprise-Kitten-Kittens-Cat-Money-Animals-Pet- 602944 • https://www.pexels.com/photo/animals-cat-pets-7792/ • https://commons.wikimedia.org/wiki/File:Cat_into_the_box.jpg • https://commons.wikimedia.org/wiki/File:White_cat_over_water_2012.jpg • https://pixabay.com/en/black-cat-reading-white-paper-33843/ • https://pixabay.com/en/photos/hidden/ • https://www.flickr.com/photos/editor/1195653047 • https://en.wikipedia.org/wiki/File:Exponential_Decay_Function.png • Other Photos by Chris Williams, Dewey Sasser, and Jennifer Moore All photos found by Google Images marked for commercial reuse, or by personal permission

×