Lessons learned trying to implement
DevOps in a rapidly growing environment
Local Community
Sustainable!
Thank you!
Lessons learned trying to implement
DevOps in a rapidly growing environment
Lessons learned trying to implement
DevOps in a rapidly growing environment
•

Lament of a Failed DevOps Manager

•

Origin of this talk

•

Excuses
Introductions
Michael Collins
Principal Systems Architect
!

http://www.demonware.net/
!

@ook
!
Demonware
•

Online services for Console
Games
•
•

SaaS APIs

•

Cross platform SDKs

•
•

Middleware

Consultancy & Design

Part of Activision Blizzard
Demonware
•

435+ million gamers

•

3.2 million+ concurrent online gamers

•

95+ games

•

300,000+ requests per second at peak

•

Avg. query response time of < .01 second

•

Collect 500,000+ metrics a minute

•

100 billion+ API calls per month
Lessons learned trying to implement
DevOps in a rapidly growing environment
3500


•

What is rapidly growing?
•
•

50-100% annual growth
People, Scale & Complexity

3000

2500


New
Serv

2000


Re4red
Se

Reused
se

1500


•

How Applicable are our lessons?
•
•

Servers
Cu

Ops
Staff


1000


This talk == Not technical
For DW Tech talks see:

500

0

2007


2008


2009


2010


2011


2012


2013


2014


•

Erlang and First-Person Shooters in online games - Malcolm Dowse - Erlang
Factory London 2011

•

PyCon.ie 2011 Keynote - Damien Marshall

•

Puppet at Demonware - Ruaidhrí Power - PuppetConf ’12
A brief history of "DevOps"
at Demonware
•

Early years (2003 - 2007)
•

Focused on P2P, handful of Services, minimal data persistence, 10s of servers,
random hardware, Golden Images, Shell Scripts

•

root for (almost) everybody!

•

“NoOps”

•

Early 2007 - Removed root access for developers

•

April 2008 - Started to dabble with Puppet

•

September 2008 - Automated installs (preseed), Standard Hardware, Puppet based
installs for production base

•

Spring 2009 - Started to build OS packages for our stack

•

June 2009 - DW Engineers attend Velocity for first time
A brief history of "DevOps"
at Demonware
•

Summer 2010 - Rushed switch to Cobbler/CentOS, more Puppet driven by custom ENC - disabled noop

•

January 2011 - First Ops Intern

•

Early 2011 - Ops Re-Org, enter DevOps team

•

August 2011 - First Engineer moved from Dev to Ops

•

September 2011 - Move to Continuous Deployment for Puppet to Prod

•

February 2012 - Work with DTO solutions on "Dev Environment provisioning blue print”

•

March 2012 - Disband DevOps team, new Org Structure - first official Ops Software Engineer job title

•

September 2012 - Rundeck in Production

•

October 2012 - Internal hack-a-thon week to kickstart "Ops API”

•

November 2012 - Ops API first release, read only cached access to Inventory system

•

December 2012 - Our current Build engineer started

•

December 2012 - Prototype v1 of internal IAAS API for bare metal provisioning

•

February 2013 - First engineer transferred from Ops team to another team (Datawarehouse)

•

March 2013 - First release of Build Engineering automated developer environment setup tool
Initial Thoughts on our
DevOps History
•

Suspect Typical Evolution for Traditional Busy Ops & Dev?

•

“DevOps” almost exclusively focused on Ops :-(

!
•

Big Wins
•
•

•

Building internal APIs
Continuous deployment of Puppet to Production

Big Losses
•

Restricting Prod Access

•

Starting with Prod & trying to retrofit

•

Being stereotypical BOFHs
10 Lessons Learned
1 - Be able to clearly
articulate what DevOps is

What is
DevOps?
What does DevOps mean
for …
•

You
•
•

•

Day to Day
Big Picture

Your Organization
•
•

Boss

•

Teams you work with

•
•

Colleagues

Leadership

Can you explain Clearly, Articulately & Concisely to everyone you deal with?
DevOps for me @
Demonware
•

For me
•
•

•

Day to day - “Automate all the things!”
Big Picture - “Service Delivery Pipeline, Organizational Sympathy”

For Demonware
•

Colleagues - “Buzzword Bullsh*t - almost Cloud”

•

My Boss - “DevOps within Ops, Visible Ops”
•

Old Boss - “Developer Self Service; You build it, you run it”

•

•

Leadership - “Bridge Dev vs Ops divide, maintain agility as we grow”

•
•

Teams I work with - “We have to write puppet? What happened to the puppet guys? Wow puppet
sucks”

Everyone - “Something Michael rambles about”

PS. Above Quotes Fabricated
2 - Trust your developers
•

My Single Biggest Mistake
•

•

Revoking developer access to Production

Ops be a good Customer for your Developers, provide:
•

Requirements

•

Bug Reports

•

Examples

•

Metrics & Data
3 - Start with Dev
•

Working on Automation for over 5 years
•
•

Never quite useable in Development

•
•

Almost exclusively focused on Production

Don’t do this

In 2013 easy to start with Dev
•

Packer, Vagrant, Docker, Boxen etc

•

First day: sign in, push “make go now”, get coffee, work
4 - Toolchains not Tools
•

Demonware - "We build & run services which use Erlang,
Python, RabbitMQ, MySQL & Cassandra with Hadoop for
Data Analytics”

•

DevOps@Demonware were “The Puppet guys”

•

Demonware Ops have:
•

Nagios guy

•

Elasticsearch/Logstash/Kibana girl

•

Graphite guy
4 - Toolchains not Tools
•

CfEngine vs Puppet vs Chef vs Ansible

•

Apache vs Lighttpd vs Nginx vs Jetty

•

Who cares?

•

What matters is:
•

Using Configuration Management

•

Using a HTTP server
Distinguish between Tools &
Toolchain Components
•

Knowledge not Trade

•

Components not Things

•

Bezos Amazon Service mandate

•

Containers / VMs / APIs / PaaS

•

Describe not Proscribe
Operations

Development

5 - Service Delivery
Pipelines
??

??

??

??

Build

Run
Operations

Development

5 - Service Delivery
Pipelines

Build

Run
DevOps Toolchain & Service
Delivery
•

Not my idea
•
•

ITIL Service Delivery

•
•

DTO Solutions

Many Others

http://dev2ops.org/category/devops-toolchainproject/
6 - Organizational Sympathy
•

Mechanical Sympathy
•

"Hardware and software working together in
harmony”

•

Martin Thompson, High Performance Low
Latency Specialist

•

Blog & Mailing List
6 - Organizational Sympathy
•

Understand your organization
•

Goals, Processes etc

•

Then decide which Toolchain elements make sense to
re-use
•

And what you have to build

•

Your organization is not Etsy, Facebook or Twitter

•

You can’t map their Toolchain & Processes without
appropriate Transformations
PHB Alert
7 - Organizational Flexibility
•

Org Structure not sacred

•

Annual re-orgs normal?

•

Examples
•

Valve

•

Internally
•

Good - Engineers continuing
to work together post “reorg”

•

Bad - Ops Area, Dev Area :-(
7 - Organizational Flexibility
•

Spend time in different roles
•
•

Sit with other teams

•
•

Google "Mission Control”

Gatecrash scrums

Understand your colleagues POV
8 - Communication is Hard
•

Timezones Suck

•

Cultural differences are Hard

•

Managing Growth without missteps is impossible?

•

Most Nerds^wEngineers pick crappy mediums
•

Face, VC, Voice, IM, Mail …

•

No Silver Bullets

•

Best Writing Advice for Engineers I've Ever Seen. Period.
9 - Hiring Matters
•

The biggest contribution I have made to
Demonware is managing to hire people who are
smarter than me

•

Especially crucial for “DevOps”
10 - Metrics & Data
•

Business Metrics not CPU utilization

•

Data justifies
•

Change

•

Resources

•

Experiments
TL;DR
“How does <X> make it easier to deploy and
run our services?”
Aside - Puppet Continuous
Deployment
•

Problem
•
•

•

Automation just for system build & service prop
“Just stopping puppet, will fix later” - Divergence not Convergence

Solution
•

Sledgehammer
•

Toolchain
•
•

Monitoring & Alerting based on Puppet (Internal Daemon & Nagios)

•

“Positive” Policy enforcement - Disease build “bears”

•
•

Code Review & Aggressive pushing (Git & Gerrit & Fan-out)

Testing - dcinabox

Result
•

Most production hosts 100% puppet managed (working on staging)

•

In large clusters Drain & Rebuild easier then troubleshooting
Looking Forward
•

Distributed Configuration
•

•

Promise Theory, Cluster State Transitions, Multiple Sources of Truth, Constraint Solving

Distributed System Platform Blocks
•
•

Separating Infrastructure, Platform & Applications

•

Containers

•
•

Netflix / Twitter OSS Stacks

DC wide cluster scheduling

Scaling Organisations
•

Remote Workers?

•

Embedded Ops?

•

Flat organizations?
DevOps Lessons Learned
1. Be able to clearly articulate what DevOps is at multiple Levels of Detail
2. Trust your developers
3. Start with Dev
4. Toolchains not Tools
5. Service Delivery Pipelines
6. Organizational Sympathy
7. Organizational Flexibility
8. Communication is hard
9. Hiring Matters
10. Metrics & Data
Surprise - We are Hiring!
•

jobs@demonware.net

•

http://www.demonware.net/

•

@demonware
!

•

Also food & some drinks later
are on us …
Questions?
Random
•

Contenders for inclusion:
•

Operational Acceptance

•

Versioning

•

Release Management

•

Repository Management

•

Agile!11!

Dev ops lessons learned - Michael Collins

  • 1.
    Lessons learned tryingto implement DevOps in a rapidly growing environment
  • 3.
  • 4.
  • 5.
    Lessons learned tryingto implement DevOps in a rapidly growing environment
  • 6.
    Lessons learned tryingto implement DevOps in a rapidly growing environment • Lament of a Failed DevOps Manager • Origin of this talk • Excuses
  • 7.
    Introductions Michael Collins Principal SystemsArchitect ! http://www.demonware.net/ ! @ook !
  • 8.
    Demonware • Online services forConsole Games • • SaaS APIs • Cross platform SDKs • • Middleware Consultancy & Design Part of Activision Blizzard
  • 9.
    Demonware • 435+ million gamers • 3.2million+ concurrent online gamers • 95+ games • 300,000+ requests per second at peak • Avg. query response time of < .01 second • Collect 500,000+ metrics a minute • 100 billion+ API calls per month
  • 10.
    Lessons learned tryingto implement DevOps in a rapidly growing environment 3500
 • What is rapidly growing? • • 50-100% annual growth People, Scale & Complexity 3000
 2500
 New
Serv 2000
 Re4red
Se Reused
se 1500
 • How Applicable are our lessons? • • Servers
Cu Ops
Staff
 1000
 This talk == Not technical For DW Tech talks see: 500
 0
 2007
 2008
 2009
 2010
 2011
 2012
 2013
 2014
 • Erlang and First-Person Shooters in online games - Malcolm Dowse - Erlang Factory London 2011 • PyCon.ie 2011 Keynote - Damien Marshall • Puppet at Demonware - Ruaidhrí Power - PuppetConf ’12
  • 11.
    A brief historyof "DevOps" at Demonware • Early years (2003 - 2007) • Focused on P2P, handful of Services, minimal data persistence, 10s of servers, random hardware, Golden Images, Shell Scripts • root for (almost) everybody! • “NoOps” • Early 2007 - Removed root access for developers • April 2008 - Started to dabble with Puppet • September 2008 - Automated installs (preseed), Standard Hardware, Puppet based installs for production base • Spring 2009 - Started to build OS packages for our stack • June 2009 - DW Engineers attend Velocity for first time
  • 12.
    A brief historyof "DevOps" at Demonware • Summer 2010 - Rushed switch to Cobbler/CentOS, more Puppet driven by custom ENC - disabled noop • January 2011 - First Ops Intern • Early 2011 - Ops Re-Org, enter DevOps team • August 2011 - First Engineer moved from Dev to Ops • September 2011 - Move to Continuous Deployment for Puppet to Prod • February 2012 - Work with DTO solutions on "Dev Environment provisioning blue print” • March 2012 - Disband DevOps team, new Org Structure - first official Ops Software Engineer job title • September 2012 - Rundeck in Production • October 2012 - Internal hack-a-thon week to kickstart "Ops API” • November 2012 - Ops API first release, read only cached access to Inventory system • December 2012 - Our current Build engineer started • December 2012 - Prototype v1 of internal IAAS API for bare metal provisioning • February 2013 - First engineer transferred from Ops team to another team (Datawarehouse) • March 2013 - First release of Build Engineering automated developer environment setup tool
  • 13.
    Initial Thoughts onour DevOps History • Suspect Typical Evolution for Traditional Busy Ops & Dev? • “DevOps” almost exclusively focused on Ops :-( ! • Big Wins • • • Building internal APIs Continuous deployment of Puppet to Production Big Losses • Restricting Prod Access • Starting with Prod & trying to retrofit • Being stereotypical BOFHs
  • 15.
  • 16.
    1 - Beable to clearly articulate what DevOps is What is DevOps?
  • 17.
    What does DevOpsmean for … • You • • • Day to Day Big Picture Your Organization • • Boss • Teams you work with • • Colleagues Leadership Can you explain Clearly, Articulately & Concisely to everyone you deal with?
  • 19.
    DevOps for me@ Demonware • For me • • • Day to day - “Automate all the things!” Big Picture - “Service Delivery Pipeline, Organizational Sympathy” For Demonware • Colleagues - “Buzzword Bullsh*t - almost Cloud” • My Boss - “DevOps within Ops, Visible Ops” • Old Boss - “Developer Self Service; You build it, you run it” • • Leadership - “Bridge Dev vs Ops divide, maintain agility as we grow” • • Teams I work with - “We have to write puppet? What happened to the puppet guys? Wow puppet sucks” Everyone - “Something Michael rambles about” PS. Above Quotes Fabricated
  • 20.
    2 - Trustyour developers • My Single Biggest Mistake • • Revoking developer access to Production Ops be a good Customer for your Developers, provide: • Requirements • Bug Reports • Examples • Metrics & Data
  • 21.
    3 - Startwith Dev • Working on Automation for over 5 years • • Never quite useable in Development • • Almost exclusively focused on Production Don’t do this In 2013 easy to start with Dev • Packer, Vagrant, Docker, Boxen etc • First day: sign in, push “make go now”, get coffee, work
  • 22.
    4 - Toolchainsnot Tools • Demonware - "We build & run services which use Erlang, Python, RabbitMQ, MySQL & Cassandra with Hadoop for Data Analytics” • DevOps@Demonware were “The Puppet guys” • Demonware Ops have: • Nagios guy • Elasticsearch/Logstash/Kibana girl • Graphite guy
  • 23.
    4 - Toolchainsnot Tools • CfEngine vs Puppet vs Chef vs Ansible • Apache vs Lighttpd vs Nginx vs Jetty • Who cares? • What matters is: • Using Configuration Management • Using a HTTP server
  • 24.
    Distinguish between Tools& Toolchain Components • Knowledge not Trade • Components not Things • Bezos Amazon Service mandate • Containers / VMs / APIs / PaaS • Describe not Proscribe
  • 25.
    Operations Development 5 - ServiceDelivery Pipelines ?? ?? ?? ?? Build Run
  • 27.
    Operations Development 5 - ServiceDelivery Pipelines Build Run
  • 28.
    DevOps Toolchain &Service Delivery • Not my idea • • ITIL Service Delivery • • DTO Solutions Many Others http://dev2ops.org/category/devops-toolchainproject/
  • 29.
    6 - OrganizationalSympathy • Mechanical Sympathy • "Hardware and software working together in harmony” • Martin Thompson, High Performance Low Latency Specialist • Blog & Mailing List
  • 30.
    6 - OrganizationalSympathy • Understand your organization • Goals, Processes etc • Then decide which Toolchain elements make sense to re-use • And what you have to build • Your organization is not Etsy, Facebook or Twitter • You can’t map their Toolchain & Processes without appropriate Transformations
  • 31.
  • 32.
    7 - OrganizationalFlexibility • Org Structure not sacred • Annual re-orgs normal? • Examples • Valve • Internally • Good - Engineers continuing to work together post “reorg” • Bad - Ops Area, Dev Area :-(
  • 33.
    7 - OrganizationalFlexibility • Spend time in different roles • • Sit with other teams • • Google "Mission Control” Gatecrash scrums Understand your colleagues POV
  • 34.
    8 - Communicationis Hard • Timezones Suck • Cultural differences are Hard • Managing Growth without missteps is impossible? • Most Nerds^wEngineers pick crappy mediums • Face, VC, Voice, IM, Mail … • No Silver Bullets • Best Writing Advice for Engineers I've Ever Seen. Period.
  • 35.
    9 - HiringMatters • The biggest contribution I have made to Demonware is managing to hire people who are smarter than me • Especially crucial for “DevOps”
  • 36.
    10 - Metrics& Data • Business Metrics not CPU utilization • Data justifies • Change • Resources • Experiments
  • 37.
  • 38.
    “How does <X>make it easier to deploy and run our services?”
  • 39.
    Aside - PuppetContinuous Deployment • Problem • • • Automation just for system build & service prop “Just stopping puppet, will fix later” - Divergence not Convergence Solution • Sledgehammer • Toolchain • • Monitoring & Alerting based on Puppet (Internal Daemon & Nagios) • “Positive” Policy enforcement - Disease build “bears” • • Code Review & Aggressive pushing (Git & Gerrit & Fan-out) Testing - dcinabox Result • Most production hosts 100% puppet managed (working on staging) • In large clusters Drain & Rebuild easier then troubleshooting
  • 40.
    Looking Forward • Distributed Configuration • • PromiseTheory, Cluster State Transitions, Multiple Sources of Truth, Constraint Solving Distributed System Platform Blocks • • Separating Infrastructure, Platform & Applications • Containers • • Netflix / Twitter OSS Stacks DC wide cluster scheduling Scaling Organisations • Remote Workers? • Embedded Ops? • Flat organizations?
  • 41.
    DevOps Lessons Learned 1.Be able to clearly articulate what DevOps is at multiple Levels of Detail 2. Trust your developers 3. Start with Dev 4. Toolchains not Tools 5. Service Delivery Pipelines 6. Organizational Sympathy 7. Organizational Flexibility 8. Communication is hard 9. Hiring Matters 10. Metrics & Data
  • 42.
    Surprise - Weare Hiring! • jobs@demonware.net • http://www.demonware.net/ • @demonware ! • Also food & some drinks later are on us …
  • 43.
  • 44.
    Random • Contenders for inclusion: • OperationalAcceptance • Versioning • Release Management • Repository Management • Agile!11!