Our long road to….continuous improvement (DevOps Days Boston 2014)


Published on

2 year operations journey describing our bumpy journey toward continuous improvement. Discussion covers handling technical debit, finding the bottleneck, evaluate and improve cycle, and some tooling.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • journey can be applied to enterprise & startups.
  • This is my forth time around, and this time I was going to make new mistakes.
    Your mileage will vary.
  • make new mistakes
  • significant technical debt & existing processes
    No Battle Plan Survives Contact With the Enemy
    this was our journey – NOT The silver bullet
  • Marketing
    no walls, bottleneck, improve flow
    shorten feedback loop
    shared tools & process
    continuous improvement
  • understand current condition
  • “More than tech” time, tools, process, people. incremental improvement
  • cattle/pets….
  • debian packages & deb repo per branch
  • hipchat, jira, bitbucket
  • chef , graphite + sensu
  • feedback loop
  • infochimps ironfan – needed support for vpc, spot. effort on discovery.
    asguard workflow – primitives.
    knife-ec2 – basic framework needed additions.
  • I need a cluster, which one should I use?
    what does the db look like?
  • Our long road to….continuous improvement (DevOps Days Boston 2014)

    1. 1. Our long road to…. continuous improvement Kevin Amorin
    2. 2. BitSight Team 15 yrs in Enterprise & Startups  Enterprise Internal IT  IT Virtualization Software  SaaS low latency, high volume  SaaS Big Data, Analytics team: Issa Ashwash Isaac Boehman Sathya Ragavan Pavel Sadikov K e v i n
    3. 3. starting the journey  it’s a journey not a destination  no silver bullet  “always make new mistakes”  do not re-invent wheel  continuous improvement  find and reduce the bottleneck (Lean)
    4. 4. Greenfield
    5. 5. first 100 days  ask questions and listen  hiring (co-ops)  Marketing the message  ‘devops’  continuous improvement  evaluate current condition & improve
    6. 6. evaluate and improve  (simple) value stream mapping  minimal waste in: Design, Build & Sustain  find bottleneck, optimize flow, repeat
    7. 7. bottleneck: symptoms or cause?
    8. 8. improvement: simple yet difficult
    9. 9. infrastructure v1 what does this server do? symptom: lost time on debugging/managing products in data center cause: organic growth, little planning improvement: redesign infrastructure & process for naming (host+subnet+app), access (vpn), users, deployment
    10. 10. build v1 build is broken! symptom: lost time on broken build debugging cause: lack of uniform build environment, committed code with minimal testing and review improvement: centralized build & CI, branching, pull requests, additional unit tests
    11. 11. deployment v1 that system doesn’t have my fix. symptom: lost time debugging on wrong code cause: lack of revisioned artifacts improvement: central artifact repo each package with banch, time, commit
    12. 12. communication v1 what new feature? symptom: lost time with misinformation cause: siloing of information improvement: chat, single issue tracking & process, representatives in standup & retrospective
    13. 13. infrastructure v2 that system is missing pip module. symptom: lost time debugging misconfigured application & systems cause: lack of consistency of systems & applications improvement: config management, monitoring & alerting
    14. 14. build v2 ran in my local system… symptom: code freeze branch would not run end to end. lost time debugging which change caused issue cause: lack of regression/functional test improvement: functional tests & require it to run before merge
    15. 15. deployment v2 db is not correct? symptom: database schema/data did not match code cause: inconsistent process & manual steps on schema/data updates improvement: db schema management tool + process
    16. 16. deployment v3 new server didn’t come up right? symptom: lost time with misconfigured or failed provisioned nodes in AWS cause: inconsistent semi-automated provisioning steps did not have the flexibility needed for a growing product line improvement: knife-bs provisioning & deployment http://github.com/CBitLabs/knife-bs
    17. 17. provisioning research & design Netflix Asgard: Web interface for application deployments and cloud management in Amazon Web Services (AWS) Infochimps Ironfan: Chef orchestration layer -- your system diagram come to life Chef Knife-ec2: plugin gives knife the ability to create, bootstrap, and manage EC2 instances.
    18. 18. knife-bs cloud provisioning tool build on top of opscode/knife-ec2. Using a description of your infrastructure and stacks (in either YAML or JSON), knife-bs will build correct the stack in the correct environment and bootstrap chef.
    19. 19. Describe: Infrastructure & Application environment region  vpc  subnet  stack  profile  profile profile  naming, type, volumes, raid, sg, raid, spot_price, eip, ami… run_list
    20. 20. knife-bs examples knife bs server create ame1.prod portal -count 10 -ebs 10 -eip -ami-id=xxxx knife bs stack create ame1.stag hadoop flavor=c3.xlarge -spot-price=1.00 knife bs server delete ame1.stag tt
    21. 21. infrastructure v3 can I grab a cluster? symptom: who is using what, what state is it in? cause: lack of visibility of ownership & state of application improvement: infrastructure web UI which overlays org meta-data http://github.com/CBitLabs/atlas
    22. 22. Atlas
    23. 23. Delivery Pipeline
    24. 24. Lessons Learned  ‘devops’ messaging  analyze current state  use symptom to find cause  find solution that fits  KISS then scale
    25. 25. @veritaskev kevin@amorin.org img: Marc Cluet