StartOps: Growing an ops team from 1 founder


Published on

Bootstrapped startups don't have the luxury of a full team of ops engineers available to respond to issues 24/7, so how can you survive on your own? This talk will tell the story of how to run your infrastructure as a single founder through to growing that into a team of on call engineers. It will include some interesting war stories as well as tips and suggestions for how to run ops at a startup.

Presented at DevOpsDays London 2013 by David Mytton.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

StartOps: Growing an ops team from 1 founder

  1. 1. StartOps: Growing an ops team from 1 founder- Lot of knowledge online but it usually assumes you have a team, lots of time and money- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achievethat- Tips and tools to help along the way- Use my own company and gratuitous photos of Japan to illustrate the point
  2. 2. David MyttonWoop Japan!
  3. 3. Bootstrapping sometimes means leaving things to the last minute.Photo: First tip- Limited resources, people, time
  4. 4. April 2009- Quick development- Experience with PHP + MySQL- Slicehost was cheap- Problems with MySQL so moved to MongoDB
  5. 5. Why?• Replication
  6. 6. Why?• Replication• Official drivers
  7. 7. Why?• Replication• Official drivers• Easy deployment
  8. 8. Why?• Replication• Official drivers• Easy deployment• Fast out of the box (sort of)1 = changes to WriteConcern
  9. 9. david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_miscdavid@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time
  10. 10. MongoDB at Server Density•27 nodes
  11. 11. MongoDB at Server Density•27 nodes•17TB data per month
  12. 12. MongoDB at Server DensityQueues Primary data storeTime series
  13. 13. It also means trying to find the quickest way. david@asriel ~: scp david@stelmaria:~/local/local.11 . local.11 100% 2047MB 6.8MB/s 05:01- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet
  14. 14. 1d, 1h, 58m11.22MB/s
  15. 15. Hacking traveling• Roaming is expensive- Wifi hotspot- Prepaid SIM- Euro data cap
  16. 16. Hacking traveling•Starbucks free wifi + power
  17. 17. Hacking traveling• Travel light- Buying things locally
  18. 18. Hacking traveling• Don’t update- Like no deploy Friday- Server updates- Local OS updates
  19. 19. Let other people help- Summer 2009 moved to several managed servers with Rackspace.
  20. 20. Let other people help• Managed hosts- Rackspace managed hosting- Softlayer charge $1/ticket
  21. 21. Let other people help• Managed hosts• Support contracts- Depending on the level of support you buy- Expensive- Are ways to work around that; getting involved with projects
  22. 22. Outsourcing- Engineers terrible at valuing their own time- “Why pay for something I can build/install/configure myself?”- Can pay a trusted company/individual to do things- Lots of little things that need doing- Examples
  23. 23. OutsourcingService access list- List of services employees have access to- Revoking credentials- Adding new users- Password management
  24. 24. OutsourcingPCI certification- Paperwork / checklist
  25. 25. OutsourcingCDN research- Paperwork / checklist
  26. 26. OutsourcingIs it time consuming?
  27. 27. OutsourcingIs it time consuming?Boring?
  28. 28. OutsourcingIs it time consuming?Boring?Measurable improvement?
  29. 29. 2010 - 2011And then there were 3- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.- More than 1 then you start having to think properly
  30. 30. Dealing with humans- As much as we’d like an API to life, managing human issues become important for scaling
  31. 31. Dealing with humansAutomate as much as possible- You want to remove humans from as much as possible- Prevents mistakes, makes things easier and faster- Keeps a log of what was happened- Ideally you only want to ever manually to something once- Even with just 1 person, setting up config management is a minimum
  32. 32. Dealing with humansSilo’d information- Small team so usually 1 person responsible for a lot of code- Not reasonable to have to ask that person every time there’s a problem with that bit
  33. 33. Dealing with humansUp to date docs- Every component should be fully documented- Consider appliance manuals with the troubleshooting tables they have at the back- Table of potential failures and how to deal with them- Vendor contact information- Team contact information- Have someone responsible for keeping them up to date
  34. 34. Dealing with humansChecklists- Stolen from the Checklist Manifesto / airline industry- Any manual steps, however trivial, should be checklisted- Failover, backup recovery, incident handling
  35. 35. Dealing with humansForce scripting- Takes a bit of extra time but the ROI is massive- Disallow direct access to things e.g. database queries- Better to push a button and get a guaranteed result than risk mistakes
  36. 36. 2012 - 2013Growing to 12- 12, 11 of which are technical- Now have the luxury of being able to spread things out- Proper on call schedule
  37. 37. Dealing with humansOn-call- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone
  38. 38. Dealing with humansOn-call 1) Ops engineer- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting
  39. 39. Dealing with humansOn-call 1) Ops engineer 2) All engineers- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday
  40. 40. Dealing with humansOn-call 1) Ops engineer 2) All engineers 3) Ops engineer- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needsadditional systems expertise
  41. 41. Dealing with humansOn-call 1) Ops engineer 2) All engineers 3) Ops engineer 4) Others- Next month we’re launching a major new product into beta- Support from design / frontend engineering- Have to press a button to get them involved
  42. 42. Dealing with humansOff-call- Responders to an incident get next 24 hours off-call- Social issues to deal with
  43. 43. Dealing with humansOn-call CEO- I receive push notifications + e-mails for all outages
  44. 44. Dealing with humansUptime reporting- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime
  45. 45. Dealing with humansSocial issues- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?
  46. 46. Dealing with humansBackup responder- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity
  47. 47. Dealing with outagesExpected- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them
  48. 48. Dealing with outagesCommunication Externally- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of whatis happening right now- Generally Amazon and Heroku are good and go into more detail
  49. 49. Dealing with outagesCommunication Internally- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situationroom- Faster than typing
  50. 50. Dealing with outagesReally test your vendors- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them
  51. 51. Dealing with outagesSimulations- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building thingsfirst
  52. 52. You want your own team- The only ones who care the most- Know the most- Can fix things fastest
  53. 53. Monitoring toolsServer Density
  54. 54. Japan!
  55. 55. David Mytton @davidmyttondavid@serverdensity.comwww.serverdensity.comWoop Japan!