SlideShare a Scribd company logo
StartOps: Growing an
                                              ops team from 1 founder




- Lot of knowledge online but it usually assumes you have a team, lots of time and money
- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve
that
- Tips and tools to help along the way
- Use my own company and gratuitous photos of Japan to illustrate the point
David Mytton




Woop Japan!
Bootstrapping sometimes
                       means leaving things to
                       the last minute.




Photo: dannychoo.com
- First tip
- Limited resources, people, time
April 2009




-   Quick development
-   Experience with PHP + MySQL
-   Slicehost was cheap
-   Problems with MySQL so moved to MongoDB
Why?


• Replication
Why?


• Replication
• Official drivers
Why?


• Replication
• Official drivers
• Easy deployment
Why?


• Replication
• Official drivers
• Easy deployment
• Fast out of the box         (sort of)

1 = changes to WriteConcern
david@pan ~: df -a
Filesystem                 1K-blocks      Used Available Use% Mounted on
/dev/sda1                  156882796 148489776    423964 100% /
proc                               0         0         0   - /proc
none                               0         0         0   - /dev/pts
none                         2097260         0   2097260   0% /dev/shm
none                               0         0         0   - /proc/sys/fs/
binfmt_misc

david@pan ~: df -ah
Filesystem                  Size   Used Avail Use% Mounted on
/dev/sda1                   150G   142G 415M 100% /
proc                           0      0     0   - /proc
none                           0      0     0   - /dev/pts
none                        2.1G      0 2.1G    0% /dev/shm
none                           0      0     0   - /proc/sys/fs/binfmt_




- Needed to upgrade a machine
- Resize = downtime
- Resyncing finished just in time
MongoDB at Server Density


•27 nodes
MongoDB at Server Density


•27 nodes
•17TB data per month
MongoDB at Server Density


Queues

               Primary
              data store

Time series
It also means trying to
                             find the quickest way.



          david@asriel ~: scp david@stelmaria:~/local/local.11 .
          local.11                 100% 2047MB   6.8MB/s   05:01




- Needed to resync a database server across the US
- Take too long; oplog not large enough
- Fast internal network but slow internet
1d, 1h, 58m

11.22MB/s
Hacking traveling



• Roaming is expensive




- Wifi hotspot
- Prepaid SIM
- Euro data cap
Hacking traveling




•Starbucks free wifi + power
Hacking traveling



• Travel light




- Buying things locally
Hacking traveling



• Don’t update




- Like no deploy Friday
- Server updates
- Local OS updates
Let other
                                                      people help




- Summer 2009 moved to several managed servers with Rackspace.
Let other
                               people help

• Managed hosts




- Rackspace managed hosting
- Softlayer charge $1/ticket
Let other
                                                           people help

• Managed hosts
• Support contracts


- Depending on the level of support you buy
- Expensive
- Are ways to work around that; getting involved with projects
Outsourcing




-   Engineers terrible at valuing their own time
-   “Why pay for something I can build/install/configure myself?”
-   Can pay a trusted company/individual to do things
-   Lots of little things that need doing
-   Examples
Outsourcing




Service access list




-   List of services employees have access to
-   Revoking credentials
-   Adding new users
-   Password management
Outsourcing




PCI certification




- Paperwork / checklist
Outsourcing

CDN research




- Paperwork / checklist
Outsourcing


Is it time consuming?
Outsourcing


Is it time consuming?

Boring?
Outsourcing


Is it time consuming?

Boring?

Measurable improvement?
2010 - 2011




And then there were 3




- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.
- More than 1 then you start having to think properly
Dealing with humans




- As much as we’d like an API to life, managing human issues become important for scaling
Dealing with humans


Automate as much as possible




-   You want to remove humans from as much as possible
-   Prevents mistakes, makes things easier and faster
-   Keeps a log of what was happened
-   Ideally you only want to ever manually to something once
-   Even with just 1 person, setting up config management is a minimum
Dealing with humans


Silo’d information




- Small team so usually 1 person responsible for a lot of code
- Not reasonable to have to ask that person every time there’s a problem with that bit
Dealing with humans


Up to date docs




-   Every component should be fully documented
-   Consider appliance manuals with the troubleshooting tables they have at the back
-   Table of potential failures and how to deal with them
-   Vendor contact information
-   Team contact information
-   Have someone responsible for keeping them up to date
Dealing with humans


Checklists




- Stolen from the Checklist Manifesto / airline industry
- Any manual steps, however trivial, should be checklisted
- Failover, backup recovery, incident handling
Dealing with humans


Force scripting




- Takes a bit of extra time but the ROI is massive
- Disallow direct access to things e.g. database queries
- Better to push a button and get a guaranteed result than risk mistakes
2012 - 2013




Growing to 12




- 12, 11 of which are technical
- Now have the luxury of being able to spread things out
- Proper on call schedule
Dealing with humans


On-call




-   Sharing out the responsibility
-   Determining level of response: 24/7 real monitoring or first responder
-   24/7 real monitoring for HA environments, real people at a screen at all times
-   First responder: people at the end of a phone
Dealing with humans


On-call                                     1) Ops engineer




- During working hours our dedicated ops engineers take the first level
- Avoids interrupting product engineers for initial fire fighting
Dealing with humans


On-call                                     1) Ops engineer
                                            2) All engineers




- Out of hours we rotate every engineer, product and ops
- Rotation every 7 days on a Tuesday
Dealing with humans


On-call                                       1) Ops engineer
                                              2) All engineers
                                              3) Ops engineer


- Always have a secondary
- This is always an ops engineer
- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs
additional systems expertise
Dealing with humans


On-call                                    1) Ops engineer
                                           2) All engineers
                                           3) Ops engineer
                                           4) Others
- Next month we’re launching a major new product into beta
- Support from design / frontend engineering
- Have to press a button to get them involved
Dealing with humans


Off-call




- Responders to an incident get next 24 hours off-call
- Social issues to deal with
Dealing with humans


On-call CEO




- I receive push notifications + e-mails for all outages
Dealing with humans

Uptime reporting




- Weekly internal report on G+
- Gives visibility to entire company about any incidents
- Allows us to discuss incidents to get to that 100% uptime
Dealing with humans


Social issues




-   How quickly can you get to a computer?
-   Are they out drinking on a Friday?
-   What happens if someone is ill?
-   What if there’s a sudden emergency: accident? family emergency?
-   Do they have enough phone battery?
-   Can you hear the ringtone?
Dealing with humans


Backup responder




-   Backup responder
-   Time out the initial responder
-   Escalate difficult problems
-   Essentially human redundancy: phone provider, geographic area, internet connectivity
Dealing with outages


Expected




- Outages are going to happen, especially at the beginning
- Costs money for redundancy
- How you deal with them
Dealing with outages
Communication



                                                               Externally



- Telling people what is happening
- Frequently
- Dependent on audience - we can go into more detail because our customers are techies
- Github do a good job of providing incident writeups but don’t provide a good idea of what
is happening right now
- Generally Amazon and Heroku are good and go into more detail
Dealing with outages
Communication



                                                                Internally



- Open Skype conferences between the responders
- Usually mostly silence or the sound of the keyboard, but simulates being in the situation
room
- Faster than typing
Dealing with outages


Really test your vendors




-   Shows up flaws in vendor support processes
-   Frustrating when waiting on someone else
-   You want as much information as possible
-   Major outage? Everyone will be calling them
Dealing with outages


Simulations




- Try and avoid unncessary problems
- Do servers come back up from boot?
- Can hot spares handle the load?
- Test failover: databases, HA firewalls
- Regularly reboot servers
- Wargames can happen at another stage: startups are usually too focused on building things
first
You want your own team




- The only ones who care the most
- Know the most
- Can fix things fastest
Monitoring tools

Server Density
www.serverdensity.com/dd



Woop Japan!
David Mytton

 @davidmytton

david@serverdensity.com

www.serverdensity.com

Woop Japan!

More Related Content

Viewers also liked

Determinationofexpertice2004.PDF
Determinationofexpertice2004.PDFDeterminationofexpertice2004.PDF
Determinationofexpertice2004.PDF
Jan K
 
Pensamientos Inolvidables
Pensamientos InolvidablesPensamientos Inolvidables
Pensamientos Inolvidables
Juan Carlos Fernandez
 
IBM Connections Design To #NOTFAIL
IBM Connections Design To #NOTFAILIBM Connections Design To #NOTFAIL
IBM Connections Design To #NOTFAIL
Gabriella Davis
 
Kompetenz-Navigator oose
Kompetenz-Navigator ooseKompetenz-Navigator oose
Kompetenz-Navigator oose
oose
 
Dr. ITIL presentando los conceptos de ITIL (ed. 2011)
Dr. ITIL presentando los conceptos de ITIL (ed. 2011)Dr. ITIL presentando los conceptos de ITIL (ed. 2011)
Dr. ITIL presentando los conceptos de ITIL (ed. 2011)
Mauricio Corona
 
Analisis de Redes Electricas I (12)
Analisis de Redes Electricas I (12)Analisis de Redes Electricas I (12)
Analisis de Redes Electricas I (12)
Velmuz Buzz
 
Junta electoral mataro
Junta electoral mataroJunta electoral mataro
Junta electoral mataro
Eduard Millán Forn
 
Planteamiento del problema (1)
Planteamiento del problema (1)Planteamiento del problema (1)
Planteamiento del problema (1)
Wilder Soto
 
Halal industry in mauritius by jummah masjid halal products and services
Halal industry in mauritius by jummah masjid halal products and servicesHalal industry in mauritius by jummah masjid halal products and services
Halal industry in mauritius by jummah masjid halal products and services
Alhuda Centre of Islamic Banking & Economics
 
Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)
Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)
Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)
La Lagartija
 
Contratos de suministro y mantenimiento de la luz.
Contratos de suministro y mantenimiento de la luz. Contratos de suministro y mantenimiento de la luz.
Contratos de suministro y mantenimiento de la luz.
CNMC (Comisión Nacional de los Mercados y la Competencia)
 
211274752 diseno-de-partidores-obras-hidraulicas (1)
211274752 diseno-de-partidores-obras-hidraulicas (1)211274752 diseno-de-partidores-obras-hidraulicas (1)
211274752 diseno-de-partidores-obras-hidraulicas (1)
Maria Elisa Delgado Quevedo
 
stemtech ppt
stemtech pptstemtech ppt
stemtech ppt
stemtechbiz
 
Herramientas web20 para el aula
Herramientas web20 para el aulaHerramientas web20 para el aula
Herramientas web20 para el aula
Paola Dellepiane
 
Presentación de Ciees
Presentación de CieesPresentación de Ciees
Presentación de Ciees
jose_yx
 
Ponts Romans
Ponts RomansPonts Romans
Ponts Romans
elenaaaaa
 
Halstead Glen Dimplex brand guidelines
Halstead Glen Dimplex brand guidelinesHalstead Glen Dimplex brand guidelines
Halstead Glen Dimplex brand guidelines
Andy Thornley
 

Viewers also liked (17)

Determinationofexpertice2004.PDF
Determinationofexpertice2004.PDFDeterminationofexpertice2004.PDF
Determinationofexpertice2004.PDF
 
Pensamientos Inolvidables
Pensamientos InolvidablesPensamientos Inolvidables
Pensamientos Inolvidables
 
IBM Connections Design To #NOTFAIL
IBM Connections Design To #NOTFAILIBM Connections Design To #NOTFAIL
IBM Connections Design To #NOTFAIL
 
Kompetenz-Navigator oose
Kompetenz-Navigator ooseKompetenz-Navigator oose
Kompetenz-Navigator oose
 
Dr. ITIL presentando los conceptos de ITIL (ed. 2011)
Dr. ITIL presentando los conceptos de ITIL (ed. 2011)Dr. ITIL presentando los conceptos de ITIL (ed. 2011)
Dr. ITIL presentando los conceptos de ITIL (ed. 2011)
 
Analisis de Redes Electricas I (12)
Analisis de Redes Electricas I (12)Analisis de Redes Electricas I (12)
Analisis de Redes Electricas I (12)
 
Junta electoral mataro
Junta electoral mataroJunta electoral mataro
Junta electoral mataro
 
Planteamiento del problema (1)
Planteamiento del problema (1)Planteamiento del problema (1)
Planteamiento del problema (1)
 
Halal industry in mauritius by jummah masjid halal products and services
Halal industry in mauritius by jummah masjid halal products and servicesHalal industry in mauritius by jummah masjid halal products and services
Halal industry in mauritius by jummah masjid halal products and services
 
Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)
Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)
Percepcion y Práctica Educativa Inclusiva en Monte Patria (David Santos Arrieta)
 
Contratos de suministro y mantenimiento de la luz.
Contratos de suministro y mantenimiento de la luz. Contratos de suministro y mantenimiento de la luz.
Contratos de suministro y mantenimiento de la luz.
 
211274752 diseno-de-partidores-obras-hidraulicas (1)
211274752 diseno-de-partidores-obras-hidraulicas (1)211274752 diseno-de-partidores-obras-hidraulicas (1)
211274752 diseno-de-partidores-obras-hidraulicas (1)
 
stemtech ppt
stemtech pptstemtech ppt
stemtech ppt
 
Herramientas web20 para el aula
Herramientas web20 para el aulaHerramientas web20 para el aula
Herramientas web20 para el aula
 
Presentación de Ciees
Presentación de CieesPresentación de Ciees
Presentación de Ciees
 
Ponts Romans
Ponts RomansPonts Romans
Ponts Romans
 
Halstead Glen Dimplex brand guidelines
Halstead Glen Dimplex brand guidelinesHalstead Glen Dimplex brand guidelines
Halstead Glen Dimplex brand guidelines
 

Similar to StartOps: Growing an ops team from 1 founder

High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
Server Density
 
OSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas BhagatOSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas Bhagat
NETWAYS
 
OSMC 2015: Testing in Production by Devdas Bhagat
OSMC 2015: Testing in Production by Devdas BhagatOSMC 2015: Testing in Production by Devdas Bhagat
OSMC 2015: Testing in Production by Devdas Bhagat
NETWAYS
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
Chris Gates
 
NNT Business Solutions - NNTServe Overview
NNT Business Solutions - NNTServe OverviewNNT Business Solutions - NNTServe Overview
NNT Business Solutions - NNTServe Overview
NNT Solutions
 
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
Puppet
 
Business-Critical Backup: Preparing for a Disaster
Business-Critical Backup: Preparing for a DisasterBusiness-Critical Backup: Preparing for a Disaster
Business-Critical Backup: Preparing for a Disaster
NetWize
 
564 Class Notes July 27, 2010
564 Class Notes July 27, 2010564 Class Notes July 27, 2010
564 Class Notes July 27, 2010
Stephanie Magleby
 
The 5 Minute DBA-DBA Skills for Non-DBA
The 5 Minute DBA-DBA Skills for Non-DBAThe 5 Minute DBA-DBA Skills for Non-DBA
The 5 Minute DBA-DBA Skills for Non-DBA
percona2013
 
Incident Prevention and Incident Response - Alexander Sverdlov, PHDays IV
Incident Prevention and Incident Response - Alexander Sverdlov, PHDays IVIncident Prevention and Incident Response - Alexander Sverdlov, PHDays IV
Incident Prevention and Incident Response - Alexander Sverdlov, PHDays IV
Alexander Sverdlov
 
The Panda Experiment - evolution of DevOps culture at HolidayCheck
The Panda Experiment - evolution of DevOps culture at HolidayCheckThe Panda Experiment - evolution of DevOps culture at HolidayCheck
The Panda Experiment - evolution of DevOps culture at HolidayCheck
Łukasz Przybył
 
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
DevOpsDays Tel Aviv
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
Laurent Cerveau
 
Part Time Agile
Part Time AgilePart Time Agile
Part Time Agile
Dima Malenko
 
Infrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous DeliveryInfrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous Delivery
Carlo Bonamico
 
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
AIIM International
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
Theo Schlossnagle
 
Challenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryChallenges and best practices of database continuous delivery
Challenges and best practices of database continuous delivery
DBmaestro - Database DevOps
 
Ohio 2012-help-sysad-out
Ohio 2012-help-sysad-outOhio 2012-help-sysad-out
Ohio 2012-help-sysad-out
mralexjuarez
 
Monitoring and Managing Network Application Performance
Monitoring and Managing Network Application PerformanceMonitoring and Managing Network Application Performance
Monitoring and Managing Network Application Performance
Savvius, Inc
 

Similar to StartOps: Growing an ops team from 1 founder (20)

High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 
OSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas BhagatOSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas Bhagat
 
OSMC 2015: Testing in Production by Devdas Bhagat
OSMC 2015: Testing in Production by Devdas BhagatOSMC 2015: Testing in Production by Devdas Bhagat
OSMC 2015: Testing in Production by Devdas Bhagat
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
 
NNT Business Solutions - NNTServe Overview
NNT Business Solutions - NNTServe OverviewNNT Business Solutions - NNTServe Overview
NNT Business Solutions - NNTServe Overview
 
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
 
Business-Critical Backup: Preparing for a Disaster
Business-Critical Backup: Preparing for a DisasterBusiness-Critical Backup: Preparing for a Disaster
Business-Critical Backup: Preparing for a Disaster
 
564 Class Notes July 27, 2010
564 Class Notes July 27, 2010564 Class Notes July 27, 2010
564 Class Notes July 27, 2010
 
The 5 Minute DBA-DBA Skills for Non-DBA
The 5 Minute DBA-DBA Skills for Non-DBAThe 5 Minute DBA-DBA Skills for Non-DBA
The 5 Minute DBA-DBA Skills for Non-DBA
 
Incident Prevention and Incident Response - Alexander Sverdlov, PHDays IV
Incident Prevention and Incident Response - Alexander Sverdlov, PHDays IVIncident Prevention and Incident Response - Alexander Sverdlov, PHDays IV
Incident Prevention and Incident Response - Alexander Sverdlov, PHDays IV
 
The Panda Experiment - evolution of DevOps culture at HolidayCheck
The Panda Experiment - evolution of DevOps culture at HolidayCheckThe Panda Experiment - evolution of DevOps culture at HolidayCheck
The Panda Experiment - evolution of DevOps culture at HolidayCheck
 
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
Part Time Agile
Part Time AgilePart Time Agile
Part Time Agile
 
Infrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous DeliveryInfrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous Delivery
 
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Challenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryChallenges and best practices of database continuous delivery
Challenges and best practices of database continuous delivery
 
Ohio 2012-help-sysad-out
Ohio 2012-help-sysad-outOhio 2012-help-sysad-out
Ohio 2012-help-sysad-out
 
Monitoring and Managing Network Application Performance
Monitoring and Managing Network Application PerformanceMonitoring and Managing Network Application Performance
Monitoring and Managing Network Application Performance
 

More from Server Density

Content marketing @ Server Density
Content marketing @ Server DensityContent marketing @ Server Density
Content marketing @ Server Density
Server Density
 
Flight training for DevOps & HumanOps - IncontroDevOps 2016
Flight training for DevOps & HumanOps - IncontroDevOps 2016Flight training for DevOps & HumanOps - IncontroDevOps 2016
Flight training for DevOps & HumanOps - IncontroDevOps 2016
Server Density
 
Flight training for DevOps
Flight training for DevOpsFlight training for DevOps
Flight training for DevOps
Server Density
 
How to Monitor MySQL
How to Monitor MySQLHow to Monitor MySQL
How to Monitor MySQL
Server Density
 
Handling incidents
Handling incidentsHandling incidents
Handling incidents
Server Density
 
Scaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident managementScaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident management
Server Density
 
Briefing: Containers
Briefing: ContainersBriefing: Containers
Briefing: Containers
Server Density
 
Why puppet? Why now?
Why puppet? Why now?Why puppet? Why now?
Why puppet? Why now?
Server Density
 
Infrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metalInfrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metal
Server Density
 
Navigating the customer lifecycle
Navigating the customer lifecycleNavigating the customer lifecycle
Navigating the customer lifecycle
Server Density
 
Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.
Server Density
 
DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.
Server Density
 
How to monitor NGINX
How to monitor NGINXHow to monitor NGINX
How to monitor NGINX
Server Density
 
How to monitor MongoDB
How to monitor MongoDBHow to monitor MongoDB
How to monitor MongoDB
Server Density
 
Puppet at the centre of everything
Puppet at the centre of everythingPuppet at the centre of everything
Puppet at the centre of everything
Server Density
 
NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013
Server Density
 
Remote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the worldRemote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the world
Server Density
 
NoSQL Infrastructure
NoSQL InfrastructureNoSQL Infrastructure
NoSQL Infrastructure
Server Density
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
Server Density
 
Puppet Camp Ghent 2013
Puppet Camp Ghent 2013Puppet Camp Ghent 2013
Puppet Camp Ghent 2013
Server Density
 

More from Server Density (20)

Content marketing @ Server Density
Content marketing @ Server DensityContent marketing @ Server Density
Content marketing @ Server Density
 
Flight training for DevOps & HumanOps - IncontroDevOps 2016
Flight training for DevOps & HumanOps - IncontroDevOps 2016Flight training for DevOps & HumanOps - IncontroDevOps 2016
Flight training for DevOps & HumanOps - IncontroDevOps 2016
 
Flight training for DevOps
Flight training for DevOpsFlight training for DevOps
Flight training for DevOps
 
How to Monitor MySQL
How to Monitor MySQLHow to Monitor MySQL
How to Monitor MySQL
 
Handling incidents
Handling incidentsHandling incidents
Handling incidents
 
Scaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident managementScaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident management
 
Briefing: Containers
Briefing: ContainersBriefing: Containers
Briefing: Containers
 
Why puppet? Why now?
Why puppet? Why now?Why puppet? Why now?
Why puppet? Why now?
 
Infrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metalInfrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metal
 
Navigating the customer lifecycle
Navigating the customer lifecycleNavigating the customer lifecycle
Navigating the customer lifecycle
 
Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.
 
DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.
 
How to monitor NGINX
How to monitor NGINXHow to monitor NGINX
How to monitor NGINX
 
How to monitor MongoDB
How to monitor MongoDBHow to monitor MongoDB
How to monitor MongoDB
 
Puppet at the centre of everything
Puppet at the centre of everythingPuppet at the centre of everything
Puppet at the centre of everything
 
NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013
 
Remote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the worldRemote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the world
 
NoSQL Infrastructure
NoSQL InfrastructureNoSQL Infrastructure
NoSQL Infrastructure
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Puppet Camp Ghent 2013
Puppet Camp Ghent 2013Puppet Camp Ghent 2013
Puppet Camp Ghent 2013
 

Recently uploaded

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 

Recently uploaded (20)

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 

StartOps: Growing an ops team from 1 founder

  • 1. StartOps: Growing an ops team from 1 founder - Lot of knowledge online but it usually assumes you have a team, lots of time and money - That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that - Tips and tools to help along the way - Use my own company and gratuitous photos of Japan to illustrate the point
  • 3.
  • 4.
  • 5. Bootstrapping sometimes means leaving things to the last minute. Photo: dannychoo.com - First tip - Limited resources, people, time
  • 6. April 2009 - Quick development - Experience with PHP + MySQL - Slicehost was cheap - Problems with MySQL so moved to MongoDB
  • 9. Why? • Replication • Official drivers • Easy deployment
  • 10. Why? • Replication • Official drivers • Easy deployment • Fast out of the box (sort of) 1 = changes to WriteConcern
  • 11. david@pan ~: df -a Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 156882796 148489776 423964 100% / proc 0 0 0 - /proc none 0 0 0 - /dev/pts none 2097260 0 2097260 0% /dev/shm none 0 0 0 - /proc/sys/fs/ binfmt_misc david@pan ~: df -ah Filesystem Size Used Avail Use% Mounted on /dev/sda1 150G 142G 415M 100% / proc 0 0 0 - /proc none 0 0 0 - /dev/pts none 2.1G 0 2.1G 0% /dev/shm none 0 0 0 - /proc/sys/fs/binfmt_ - Needed to upgrade a machine - Resize = downtime - Resyncing finished just in time
  • 12. MongoDB at Server Density •27 nodes
  • 13. MongoDB at Server Density •27 nodes •17TB data per month
  • 14. MongoDB at Server Density Queues Primary data store Time series
  • 15. It also means trying to find the quickest way. david@asriel ~: scp david@stelmaria:~/local/local.11 . local.11 100% 2047MB 6.8MB/s 05:01 - Needed to resync a database server across the US - Take too long; oplog not large enough - Fast internal network but slow internet
  • 17. Hacking traveling • Roaming is expensive - Wifi hotspot - Prepaid SIM - Euro data cap
  • 19. Hacking traveling • Travel light - Buying things locally
  • 20. Hacking traveling • Don’t update - Like no deploy Friday - Server updates - Local OS updates
  • 21. Let other people help - Summer 2009 moved to several managed servers with Rackspace.
  • 22. Let other people help • Managed hosts - Rackspace managed hosting - Softlayer charge $1/ticket
  • 23. Let other people help • Managed hosts • Support contracts - Depending on the level of support you buy - Expensive - Are ways to work around that; getting involved with projects
  • 24. Outsourcing - Engineers terrible at valuing their own time - “Why pay for something I can build/install/configure myself?” - Can pay a trusted company/individual to do things - Lots of little things that need doing - Examples
  • 25. Outsourcing Service access list - List of services employees have access to - Revoking credentials - Adding new users - Password management
  • 29. Outsourcing Is it time consuming? Boring?
  • 30. Outsourcing Is it time consuming? Boring? Measurable improvement?
  • 31. 2010 - 2011 And then there were 3 - Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011. - More than 1 then you start having to think properly
  • 32. Dealing with humans - As much as we’d like an API to life, managing human issues become important for scaling
  • 33. Dealing with humans Automate as much as possible - You want to remove humans from as much as possible - Prevents mistakes, makes things easier and faster - Keeps a log of what was happened - Ideally you only want to ever manually to something once - Even with just 1 person, setting up config management is a minimum
  • 34. Dealing with humans Silo’d information - Small team so usually 1 person responsible for a lot of code - Not reasonable to have to ask that person every time there’s a problem with that bit
  • 35. Dealing with humans Up to date docs - Every component should be fully documented - Consider appliance manuals with the troubleshooting tables they have at the back - Table of potential failures and how to deal with them - Vendor contact information - Team contact information - Have someone responsible for keeping them up to date
  • 36. Dealing with humans Checklists - Stolen from the Checklist Manifesto / airline industry - Any manual steps, however trivial, should be checklisted - Failover, backup recovery, incident handling
  • 37. Dealing with humans Force scripting - Takes a bit of extra time but the ROI is massive - Disallow direct access to things e.g. database queries - Better to push a button and get a guaranteed result than risk mistakes
  • 38. 2012 - 2013 Growing to 12 - 12, 11 of which are technical - Now have the luxury of being able to spread things out - Proper on call schedule
  • 39. Dealing with humans On-call - Sharing out the responsibility - Determining level of response: 24/7 real monitoring or first responder - 24/7 real monitoring for HA environments, real people at a screen at all times - First responder: people at the end of a phone
  • 40. Dealing with humans On-call 1) Ops engineer - During working hours our dedicated ops engineers take the first level - Avoids interrupting product engineers for initial fire fighting
  • 41. Dealing with humans On-call 1) Ops engineer 2) All engineers - Out of hours we rotate every engineer, product and ops - Rotation every 7 days on a Tuesday
  • 42. Dealing with humans On-call 1) Ops engineer 2) All engineers 3) Ops engineer - Always have a secondary - This is always an ops engineer - Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise
  • 43. Dealing with humans On-call 1) Ops engineer 2) All engineers 3) Ops engineer 4) Others - Next month we’re launching a major new product into beta - Support from design / frontend engineering - Have to press a button to get them involved
  • 44. Dealing with humans Off-call - Responders to an incident get next 24 hours off-call - Social issues to deal with
  • 45. Dealing with humans On-call CEO - I receive push notifications + e-mails for all outages
  • 46. Dealing with humans Uptime reporting - Weekly internal report on G+ - Gives visibility to entire company about any incidents - Allows us to discuss incidents to get to that 100% uptime
  • 47. Dealing with humans Social issues - How quickly can you get to a computer? - Are they out drinking on a Friday? - What happens if someone is ill? - What if there’s a sudden emergency: accident? family emergency? - Do they have enough phone battery? - Can you hear the ringtone?
  • 48. Dealing with humans Backup responder - Backup responder - Time out the initial responder - Escalate difficult problems - Essentially human redundancy: phone provider, geographic area, internet connectivity
  • 49. Dealing with outages Expected - Outages are going to happen, especially at the beginning - Costs money for redundancy - How you deal with them
  • 50. Dealing with outages Communication Externally - Telling people what is happening - Frequently - Dependent on audience - we can go into more detail because our customers are techies - Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now - Generally Amazon and Heroku are good and go into more detail
  • 51. Dealing with outages Communication Internally - Open Skype conferences between the responders - Usually mostly silence or the sound of the keyboard, but simulates being in the situation room - Faster than typing
  • 52. Dealing with outages Really test your vendors - Shows up flaws in vendor support processes - Frustrating when waiting on someone else - You want as much information as possible - Major outage? Everyone will be calling them
  • 53. Dealing with outages Simulations - Try and avoid unncessary problems - Do servers come back up from boot? - Can hot spares handle the load? - Test failover: databases, HA firewalls - Regularly reboot servers - Wargames can happen at another stage: startups are usually too focused on building things first
  • 54. You want your own team - The only ones who care the most - Know the most - Can fix things fastest
  • 55.
  • 56.
  • 58.