Your SlideShare is downloading. ×
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

StartOps: Growing an ops team from 1 founder

655

Published on

Bootstrapped startups don't have the luxury of a full team of ops engineers available to respond to issues 24/7, so how can you survive on your own? This talk will tell the story of how to run your …

Bootstrapped startups don't have the luxury of a full team of ops engineers available to respond to issues 24/7, so how can you survive on your own? This talk will tell the story of how to run your infrastructure as a single founder through to growing that into a team of on call engineers. It will include some interesting war stories as well as tips and suggestions for how to run ops at a startup.

Presented at DevOpsDays London 2013 by David Mytton.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
655
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. StartOps: Growing an ops team from 1 founder- Lot of knowledge online but it usually assumes you have a team, lots of time and money- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achievethat- Tips and tools to help along the way- Use my own company and gratuitous photos of Japan to illustrate the point
  • 2. David MyttonWoop Japan!
  • 3. Bootstrapping sometimes means leaving things to the last minute.Photo: dannychoo.com- First tip- Limited resources, people, time
  • 4. April 2009- Quick development- Experience with PHP + MySQL- Slicehost was cheap- Problems with MySQL so moved to MongoDB
  • 5. Why?• Replication
  • 6. Why?• Replication• Official drivers
  • 7. Why?• Replication• Official drivers• Easy deployment
  • 8. Why?• Replication• Official drivers• Easy deployment• Fast out of the box (sort of)1 = changes to WriteConcern
  • 9. david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_miscdavid@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time
  • 10. MongoDB at Server Density•27 nodes
  • 11. MongoDB at Server Density•27 nodes•17TB data per month
  • 12. MongoDB at Server DensityQueues Primary data storeTime series
  • 13. It also means trying to find the quickest way. david@asriel ~: scp david@stelmaria:~/local/local.11 . local.11 100% 2047MB 6.8MB/s 05:01- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet
  • 14. 1d, 1h, 58m11.22MB/s
  • 15. Hacking traveling• Roaming is expensive- Wifi hotspot- Prepaid SIM- Euro data cap
  • 16. Hacking traveling•Starbucks free wifi + power
  • 17. Hacking traveling• Travel light- Buying things locally
  • 18. Hacking traveling• Don’t update- Like no deploy Friday- Server updates- Local OS updates
  • 19. Let other people help- Summer 2009 moved to several managed servers with Rackspace.
  • 20. Let other people help• Managed hosts- Rackspace managed hosting- Softlayer charge $1/ticket
  • 21. Let other people help• Managed hosts• Support contracts- Depending on the level of support you buy- Expensive- Are ways to work around that; getting involved with projects
  • 22. Outsourcing- Engineers terrible at valuing their own time- “Why pay for something I can build/install/configure myself?”- Can pay a trusted company/individual to do things- Lots of little things that need doing- Examples
  • 23. OutsourcingService access list- List of services employees have access to- Revoking credentials- Adding new users- Password management
  • 24. OutsourcingPCI certification- Paperwork / checklist
  • 25. OutsourcingCDN research- Paperwork / checklist
  • 26. OutsourcingIs it time consuming?
  • 27. OutsourcingIs it time consuming?Boring?
  • 28. OutsourcingIs it time consuming?Boring?Measurable improvement?
  • 29. 2010 - 2011And then there were 3- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.- More than 1 then you start having to think properly
  • 30. Dealing with humans- As much as we’d like an API to life, managing human issues become important for scaling
  • 31. Dealing with humansAutomate as much as possible- You want to remove humans from as much as possible- Prevents mistakes, makes things easier and faster- Keeps a log of what was happened- Ideally you only want to ever manually to something once- Even with just 1 person, setting up config management is a minimum
  • 32. Dealing with humansSilo’d information- Small team so usually 1 person responsible for a lot of code- Not reasonable to have to ask that person every time there’s a problem with that bit
  • 33. Dealing with humansUp to date docs- Every component should be fully documented- Consider appliance manuals with the troubleshooting tables they have at the back- Table of potential failures and how to deal with them- Vendor contact information- Team contact information- Have someone responsible for keeping them up to date
  • 34. Dealing with humansChecklists- Stolen from the Checklist Manifesto / airline industry- Any manual steps, however trivial, should be checklisted- Failover, backup recovery, incident handling
  • 35. Dealing with humansForce scripting- Takes a bit of extra time but the ROI is massive- Disallow direct access to things e.g. database queries- Better to push a button and get a guaranteed result than risk mistakes
  • 36. 2012 - 2013Growing to 12- 12, 11 of which are technical- Now have the luxury of being able to spread things out- Proper on call schedule
  • 37. Dealing with humansOn-call- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone
  • 38. Dealing with humansOn-call 1) Ops engineer- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting
  • 39. Dealing with humansOn-call 1) Ops engineer 2) All engineers- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday
  • 40. Dealing with humansOn-call 1) Ops engineer 2) All engineers 3) Ops engineer- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needsadditional systems expertise
  • 41. Dealing with humansOn-call 1) Ops engineer 2) All engineers 3) Ops engineer 4) Others- Next month we’re launching a major new product into beta- Support from design / frontend engineering- Have to press a button to get them involved
  • 42. Dealing with humansOff-call- Responders to an incident get next 24 hours off-call- Social issues to deal with
  • 43. Dealing with humansOn-call CEO- I receive push notifications + e-mails for all outages
  • 44. Dealing with humansUptime reporting- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime
  • 45. Dealing with humansSocial issues- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?
  • 46. Dealing with humansBackup responder- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity
  • 47. Dealing with outagesExpected- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them
  • 48. Dealing with outagesCommunication Externally- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of whatis happening right now- Generally Amazon and Heroku are good and go into more detail
  • 49. Dealing with outagesCommunication Internally- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situationroom- Faster than typing
  • 50. Dealing with outagesReally test your vendors- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them
  • 51. Dealing with outagesSimulations- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building thingsfirst
  • 52. You want your own team- The only ones who care the most- Know the most- Can fix things fastest
  • 53. Monitoring toolsServer Density
  • 54. www.serverdensity.com/ddWoop Japan!
  • 55. David Mytton @davidmyttondavid@serverdensity.comwww.serverdensity.comWoop Japan!

×