SlideShare a Scribd company logo
1 of 35
0
© Beamly Limited
Evolving operational maturity in a start-up environment
Adrian Spender, Head of Server Engineering
CTO’s in London Meetup – Octopus Labs
6th October 2016
1
About me
• Software engineer with 18 years of JVM-based development experience
• Now focused on engineering management
• Four years at Beamly
• Responsible for our operational support for last two years
@aspender https://linkedin.com/in/aspender
2
About Beamly
Second screen
mobile apps
and website
Jan
2012
Oct
2011
Zeebox
launch
Sky
investment
Sep
2012
US launch
With investment
From Comcast/NBC,
Viacom and HBO
Nov
2012
AU launch with
Ten and Foxtel
3
4
About Beamly
Acquisition
By Coty Inc.
Oct
2015
PIVOT!
In-house
Digital marketing
and web agency
Now
Rebranded
as Beamly
Apr
2014
Aug
2014
Founders
step down
PIVOT!
Social Content
Marketing and
tooling
PIVOT!
TV focussed
Social Network
Second screen
mobile apps
and website
Jan
2012
Oct
2011
Launch
Sky
investment
Sep
2012
US launch
With investment
From Comcast/NBC,
Viacom and HBO
Nov
2012
AU launch with
Ten and Foxtel
Produce
original
TV/celeb
articles
Apr
2015
10m
MAUs
AU
Shut-
down
EOL
mobile
apps
Nov
2015
Aug
2015
Host Coty
brand sites
Run campaigns
Data Science
Aug
2016
Feb
2015
Facebook
spend
tools
5
What do I mean by operational maturity?
• How our ability to support our code running in production has changed over
time and the following variables:
– Product strategy
– Customer base
– Geography
– Technical architecture and practices
– Organisational structure and people
• Lets focus on the last two
6
Technical architecture and practices
• We got some things right from very early on
– A Dev-ops culture of you write it, you run it
• Testing
• Continuous integration
– A ‘platform’ team whose focus is developer effectiveness, not operations
– Service endpoints
• https://github.com/beamly/se4
– Runbooks
– Monitoring and alerting
7
SE4
• Common endpoints for every service, regardless of tech
– /service/status
– /service/healthcheck/gtg
– /service/healthcheck
– /service/metrics
– /service/config
• Acts as single point of understanding about the runtime deployment of the
service
• Useful for problem determination
• Useful for ELB/haproxy/any other healthcheck
8
SE4
9
Architectural evolution
Now
Oct
2011
Monoliths Microservices
10
Operational considerations of Microservices
• Foo is alerting
– What does that actually mean, what is the impact?
– Architect for failure
• Know and eliminate your SPoFs
• Have good tooling to support problem determination
– Runbooks to describe service responsibilities and problem determination steps
– Log aggregation
– Metrics aggregation
– Monitoring
• Internet Scale Services Checklist - Adrian Colyer
11
Architectural evolution
Now
Oct
2011
Monoliths Microservices Event-sourced
15
Organisational structure
Now
Oct
2011
Tech silos
🐰
🐰
🐰
🐧
🐧
🐧
🐼
🐼
🐼
Feature team hybrid
🐰
🐰
🐰
🐧
🐧
🐧
🐼
🐼
🐼
🐰
🐼
🐧
Product teams
🐰
🐧
🐧
🐧
🐼
🐰
🐼
🐼
🐰
🐰
🐼
🐧
16
Conway’s law in action
Now
Oct
2011
🐰
🐰
🐰
🐧
🐧
🐧
🐼
🐼
🐼
🐰
🐧
🐧
🐧
🐼
🐰
🐼
🐼
🐰
🐰
🐼
🐧
No communication Synchronous meetings
🐰
🐧
🐧
🐧
🐼
🐰
🐼
🐼
🐰
🐰
🐼
🐧
17
Organisational structure
Now
Oct
2011
’Spotify model’
🐰
🐧🐧🐧
🐼
🐰
🐼🐼
🐰 🐰
🐼
🐧
🐰
🐼
🐧
18
The operational problem with product teams/squads
Now
Oct
2011
A
B
C
D
E
F
G
H
I
19
The operational problem with product teams/squads
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
20
The operational problem with product teams/squads
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
Shared Infra
21
The operational problem with product teams/squads
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
Shared Infra
22
The operational problem with product teams/squads
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
Shared Infra
23
The operational problem with product teams/squads
• Incident follow-up – lets create a Jira board
• Cross-cutting / fall through the gaps – Engineering Excellence initiative
Incident post mortem tickets
Not Done Done
Engineering Excellence tickets
Not Done Done
24
People
0
1
2
3
4
5
6
7
8
9
0-1 1-2 2-3 3-4 4-5 5+
Length of service (years)
Engineering team length of service
25
People
0
2
4
6
8
10
12
14
Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16
Incidents
Acquisition
‘cleanup’
26
0
1
2
3
4
5
6
7
8
9
0-1 1-2 2-3 3-4 4-5 5+
Length of service (years)
Engineering team length of service
45% of engineers have never
been involved in handling an
operational incident
27
0
1
2
3
4
5
6
7
8
Head of Engineering Technical Architect Senior Software Engineer Software Engineer Junior Software Engineer
Incident and EE tickets completed by job role
Completed tickets Engineers
28
• Product team structure focused on moving forwards
– Velocity vs stability tension
• Cross-cutting tech and issues have no owner
• Lack of operational issue handling practice and experience
• Little investment in improving our availability and reliability through improved
monitoring and automation
• Over-reliance on small subset of the engineering team
– Lack of opportunity for experience and growth for the rest
– Tacit knowledge not being shared / encoded
Current operational challenges
Site Reliability Engineer
• We are hiring into this role to focus exclusively on our
availability and reliability
• Will not be part of a product team
• Will spend at least 50% of their time writing code to
automate away operational burden and improve monitoring
• Will have power to fix things in your production systems if
you can’t/don’t
• Will own the maintenance and evolution of common runtime
infrastructure (e.g. haproxy, Tyk)
• Will help teams plan for production including capacity
planning, performance, architecture
• Will help us evolve operational processes and practices
• Is not ‘platform’ – not focused on developer effectiveness or
IT.
https://thebeamlyagency.bamboohr.co.uk/jobs/view.php?id=17
Being on call – current structure
Mon Tue Wed Thu Fri Sat Sun
Bob Alice Don John Zed Joe Joan
Third line teams
Engineering management
Being on call – new structure
Mon Tue Wed Thu Fri Sat Sun
Bob Alice
Third line teams
Engineering management
Being on call during week == Site Reliability Engineer
• Handling incidents that occur
• Writing up incident post mortems
• Responding to any non-incident issues e.g. automated warnings in the Slack #live-monitoring
channel
• Picking up tickets outstanding from previous post mortems
• Picking up Engineering Excellence tickets. Examples of which would include:
– Resolving issues/pain points through automation
– Improving documentation
– Improving alerting
• Improvements to common infrastructure/services
• Performing routing maintenance on common systems (e.g. HiveMQ upgrade)
• Expanding their knowledge on Beamly systems/architecture (e.g. performing chaos monkey tests)
• Working on technical debt within their own product team that is not specifically prioritised in that
teams own plans.
Much more room for improvement
• Better measurement of availability/reliability
• Error budgeting
• More automation
• Continuous delivery
• Improved tooling
• Never ending…
34
Thank you. Questions?
@aspender https://linkedin.com/in/aspender

More Related Content

Similar to Evolving Operational Maturity in a Startup Environment

CV Senior Integratie en Cloud Architect an
CV Senior Integratie en Cloud Architect anCV Senior Integratie en Cloud Architect an
CV Senior Integratie en Cloud Architect anJurriaan Brandsma
 
How to create a Windows app with Project Siena, SharePoint and Office 365
How to create a Windows app with Project Siena, SharePoint and Office 365How to create a Windows app with Project Siena, SharePoint and Office 365
How to create a Windows app with Project Siena, SharePoint and Office 365Knut Relbe-Moe [MVP, MCT]
 
Sahil Gupta- Resume
Sahil Gupta- ResumeSahil Gupta- Resume
Sahil Gupta- ResumeSahil Gupta
 
Sahil Gupta- Resume
Sahil Gupta- ResumeSahil Gupta- Resume
Sahil Gupta- ResumeSahil Gupta
 
Recommended Design Considerations for Enterprise Monitoring
Recommended Design Considerations for Enterprise Monitoring Recommended Design Considerations for Enterprise Monitoring
Recommended Design Considerations for Enterprise Monitoring Prolifics
 
Html5 today
Html5 todayHtml5 today
Html5 todayRoy Yu
 
StackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinStackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinBoyd Hemphill
 
Open Apereo - Web components workshop
Open Apereo - Web components workshopOpen Apereo - Web components workshop
Open Apereo - Web components workshopbtopro
 
Implementing oracle primavera_analytics
Implementing oracle primavera_analyticsImplementing oracle primavera_analytics
Implementing oracle primavera_analyticsVolantic, Inc
 
BruCON Agnitio Workshop
BruCON Agnitio WorkshopBruCON Agnitio Workshop
BruCON Agnitio WorkshopSecurity Ninja
 
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...TriNimbus
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmervmiss33
 
Oracle Cloud native functions - create application from cli
Oracle Cloud native functions - create application from cliOracle Cloud native functions - create application from cli
Oracle Cloud native functions - create application from cliJohan Louwers
 
Jhl case study soa platform in practice short 2011 09-27 (hs)
Jhl case study soa platform in practice short 2011 09-27 (hs)Jhl case study soa platform in practice short 2011 09-27 (hs)
Jhl case study soa platform in practice short 2011 09-27 (hs)Ambientia
 
A Yarn About Twine -- ISWC 2009 Keynote -- Nova Spivack
A Yarn About Twine -- ISWC 2009 Keynote --   Nova SpivackA Yarn About Twine -- ISWC 2009 Keynote --   Nova Spivack
A Yarn About Twine -- ISWC 2009 Keynote -- Nova SpivackNova Spivack
 
Is Being Agile a Good Thing?
Is Being Agile a Good Thing?Is Being Agile a Good Thing?
Is Being Agile a Good Thing?Alan Hood
 
Reflections on18monthfederaldevopstransformation2015
Reflections on18monthfederaldevopstransformation2015Reflections on18monthfederaldevopstransformation2015
Reflections on18monthfederaldevopstransformation2015steelthread
 
CodeIgniter - PHP MVC Framework by silicongulf.com
CodeIgniter - PHP MVC Framework by silicongulf.comCodeIgniter - PHP MVC Framework by silicongulf.com
CodeIgniter - PHP MVC Framework by silicongulf.comChristopher Cubos
 

Similar to Evolving Operational Maturity in a Startup Environment (20)

CV Senior Integratie en Cloud Architect an
CV Senior Integratie en Cloud Architect anCV Senior Integratie en Cloud Architect an
CV Senior Integratie en Cloud Architect an
 
How to create a Windows app with Project Siena, SharePoint and Office 365
How to create a Windows app with Project Siena, SharePoint and Office 365How to create a Windows app with Project Siena, SharePoint and Office 365
How to create a Windows app with Project Siena, SharePoint and Office 365
 
Sahil Gupta- Resume
Sahil Gupta- ResumeSahil Gupta- Resume
Sahil Gupta- Resume
 
Sahil Gupta- Resume
Sahil Gupta- ResumeSahil Gupta- Resume
Sahil Gupta- Resume
 
Recommended Design Considerations for Enterprise Monitoring
Recommended Design Considerations for Enterprise Monitoring Recommended Design Considerations for Enterprise Monitoring
Recommended Design Considerations for Enterprise Monitoring
 
DevOps Culture and Principles
DevOps Culture and PrinciplesDevOps Culture and Principles
DevOps Culture and Principles
 
AS Viljoen
AS ViljoenAS Viljoen
AS Viljoen
 
Html5 today
Html5 todayHtml5 today
Html5 today
 
StackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinStackEngine Demo - Docker Austin
StackEngine Demo - Docker Austin
 
Open Apereo - Web components workshop
Open Apereo - Web components workshopOpen Apereo - Web components workshop
Open Apereo - Web components workshop
 
Implementing oracle primavera_analytics
Implementing oracle primavera_analyticsImplementing oracle primavera_analytics
Implementing oracle primavera_analytics
 
BruCON Agnitio Workshop
BruCON Agnitio WorkshopBruCON Agnitio Workshop
BruCON Agnitio Workshop
 
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
Chris Munns, DevOps @ Amazon: Microservices, 2 Pizza Teams, & 50 Million Depl...
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
 
Oracle Cloud native functions - create application from cli
Oracle Cloud native functions - create application from cliOracle Cloud native functions - create application from cli
Oracle Cloud native functions - create application from cli
 
Jhl case study soa platform in practice short 2011 09-27 (hs)
Jhl case study soa platform in practice short 2011 09-27 (hs)Jhl case study soa platform in practice short 2011 09-27 (hs)
Jhl case study soa platform in practice short 2011 09-27 (hs)
 
A Yarn About Twine -- ISWC 2009 Keynote -- Nova Spivack
A Yarn About Twine -- ISWC 2009 Keynote --   Nova SpivackA Yarn About Twine -- ISWC 2009 Keynote --   Nova Spivack
A Yarn About Twine -- ISWC 2009 Keynote -- Nova Spivack
 
Is Being Agile a Good Thing?
Is Being Agile a Good Thing?Is Being Agile a Good Thing?
Is Being Agile a Good Thing?
 
Reflections on18monthfederaldevopstransformation2015
Reflections on18monthfederaldevopstransformation2015Reflections on18monthfederaldevopstransformation2015
Reflections on18monthfederaldevopstransformation2015
 
CodeIgniter - PHP MVC Framework by silicongulf.com
CodeIgniter - PHP MVC Framework by silicongulf.comCodeIgniter - PHP MVC Framework by silicongulf.com
CodeIgniter - PHP MVC Framework by silicongulf.com
 

Recently uploaded

DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 

Recently uploaded (20)

DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 

Evolving Operational Maturity in a Startup Environment

  • 1. 0 © Beamly Limited Evolving operational maturity in a start-up environment Adrian Spender, Head of Server Engineering CTO’s in London Meetup – Octopus Labs 6th October 2016
  • 2. 1 About me • Software engineer with 18 years of JVM-based development experience • Now focused on engineering management • Four years at Beamly • Responsible for our operational support for last two years @aspender https://linkedin.com/in/aspender
  • 3. 2 About Beamly Second screen mobile apps and website Jan 2012 Oct 2011 Zeebox launch Sky investment Sep 2012 US launch With investment From Comcast/NBC, Viacom and HBO Nov 2012 AU launch with Ten and Foxtel
  • 4. 3
  • 5. 4 About Beamly Acquisition By Coty Inc. Oct 2015 PIVOT! In-house Digital marketing and web agency Now Rebranded as Beamly Apr 2014 Aug 2014 Founders step down PIVOT! Social Content Marketing and tooling PIVOT! TV focussed Social Network Second screen mobile apps and website Jan 2012 Oct 2011 Launch Sky investment Sep 2012 US launch With investment From Comcast/NBC, Viacom and HBO Nov 2012 AU launch with Ten and Foxtel Produce original TV/celeb articles Apr 2015 10m MAUs AU Shut- down EOL mobile apps Nov 2015 Aug 2015 Host Coty brand sites Run campaigns Data Science Aug 2016 Feb 2015 Facebook spend tools
  • 6. 5 What do I mean by operational maturity? • How our ability to support our code running in production has changed over time and the following variables: – Product strategy – Customer base – Geography – Technical architecture and practices – Organisational structure and people • Lets focus on the last two
  • 7. 6 Technical architecture and practices • We got some things right from very early on – A Dev-ops culture of you write it, you run it • Testing • Continuous integration – A ‘platform’ team whose focus is developer effectiveness, not operations – Service endpoints • https://github.com/beamly/se4 – Runbooks – Monitoring and alerting
  • 8. 7 SE4 • Common endpoints for every service, regardless of tech – /service/status – /service/healthcheck/gtg – /service/healthcheck – /service/metrics – /service/config • Acts as single point of understanding about the runtime deployment of the service • Useful for problem determination • Useful for ELB/haproxy/any other healthcheck
  • 11. 10 Operational considerations of Microservices • Foo is alerting – What does that actually mean, what is the impact? – Architect for failure • Know and eliminate your SPoFs • Have good tooling to support problem determination – Runbooks to describe service responsibilities and problem determination steps – Log aggregation – Metrics aggregation – Monitoring • Internet Scale Services Checklist - Adrian Colyer
  • 13.
  • 14.
  • 15.
  • 16. 15 Organisational structure Now Oct 2011 Tech silos 🐰 🐰 🐰 🐧 🐧 🐧 🐼 🐼 🐼 Feature team hybrid 🐰 🐰 🐰 🐧 🐧 🐧 🐼 🐼 🐼 🐰 🐼 🐧 Product teams 🐰 🐧 🐧 🐧 🐼 🐰 🐼 🐼 🐰 🐰 🐼 🐧
  • 17. 16 Conway’s law in action Now Oct 2011 🐰 🐰 🐰 🐧 🐧 🐧 🐼 🐼 🐼 🐰 🐧 🐧 🐧 🐼 🐰 🐼 🐼 🐰 🐰 🐼 🐧 No communication Synchronous meetings 🐰 🐧 🐧 🐧 🐼 🐰 🐼 🐼 🐰 🐰 🐼 🐧
  • 19. 18 The operational problem with product teams/squads Now Oct 2011 A B C D E F G H I
  • 20. 19 The operational problem with product teams/squads Now Oct 2011 A B C J D E F G H I
  • 21. 20 The operational problem with product teams/squads Now Oct 2011 A B C J D E F G H I Shared Infra
  • 22. 21 The operational problem with product teams/squads Now Oct 2011 A B C J D E F G H I Shared Infra
  • 23. 22 The operational problem with product teams/squads Now Oct 2011 A B C J D E F G H I Shared Infra
  • 24. 23 The operational problem with product teams/squads • Incident follow-up – lets create a Jira board • Cross-cutting / fall through the gaps – Engineering Excellence initiative Incident post mortem tickets Not Done Done Engineering Excellence tickets Not Done Done
  • 25. 24 People 0 1 2 3 4 5 6 7 8 9 0-1 1-2 2-3 3-4 4-5 5+ Length of service (years) Engineering team length of service
  • 26. 25 People 0 2 4 6 8 10 12 14 Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16 Incidents Acquisition ‘cleanup’
  • 27. 26 0 1 2 3 4 5 6 7 8 9 0-1 1-2 2-3 3-4 4-5 5+ Length of service (years) Engineering team length of service 45% of engineers have never been involved in handling an operational incident
  • 28. 27 0 1 2 3 4 5 6 7 8 Head of Engineering Technical Architect Senior Software Engineer Software Engineer Junior Software Engineer Incident and EE tickets completed by job role Completed tickets Engineers
  • 29. 28 • Product team structure focused on moving forwards – Velocity vs stability tension • Cross-cutting tech and issues have no owner • Lack of operational issue handling practice and experience • Little investment in improving our availability and reliability through improved monitoring and automation • Over-reliance on small subset of the engineering team – Lack of opportunity for experience and growth for the rest – Tacit knowledge not being shared / encoded Current operational challenges
  • 30. Site Reliability Engineer • We are hiring into this role to focus exclusively on our availability and reliability • Will not be part of a product team • Will spend at least 50% of their time writing code to automate away operational burden and improve monitoring • Will have power to fix things in your production systems if you can’t/don’t • Will own the maintenance and evolution of common runtime infrastructure (e.g. haproxy, Tyk) • Will help teams plan for production including capacity planning, performance, architecture • Will help us evolve operational processes and practices • Is not ‘platform’ – not focused on developer effectiveness or IT. https://thebeamlyagency.bamboohr.co.uk/jobs/view.php?id=17
  • 31. Being on call – current structure Mon Tue Wed Thu Fri Sat Sun Bob Alice Don John Zed Joe Joan Third line teams Engineering management
  • 32. Being on call – new structure Mon Tue Wed Thu Fri Sat Sun Bob Alice Third line teams Engineering management
  • 33. Being on call during week == Site Reliability Engineer • Handling incidents that occur • Writing up incident post mortems • Responding to any non-incident issues e.g. automated warnings in the Slack #live-monitoring channel • Picking up tickets outstanding from previous post mortems • Picking up Engineering Excellence tickets. Examples of which would include: – Resolving issues/pain points through automation – Improving documentation – Improving alerting • Improvements to common infrastructure/services • Performing routing maintenance on common systems (e.g. HiveMQ upgrade) • Expanding their knowledge on Beamly systems/architecture (e.g. performing chaos monkey tests) • Working on technical debt within their own product team that is not specifically prioritised in that teams own plans.
  • 34. Much more room for improvement • Better measurement of availability/reliability • Error budgeting • More automation • Continuous delivery • Improved tooling • Never ending…
  • 35. 34 Thank you. Questions? @aspender https://linkedin.com/in/aspender

Editor's Notes

  1. We started out as a very tech-driven startup called zeebox in the second screen TV space with an iOS, Android and web app. The company and all engineering has always been based in London. Investment after the UK launch came from Sky, quickly followed by a US launch with Comcast, NBC, Viacom and HBO and later an Australia launch with Network Ten and Foxtel. People watch TV during a fairly limited prime-time of three hours in the evening, and we were soon in the position where we had these three hours in four main geographic regions (Sydney, UK, US East Coast, US West Coast) all being supported from a UK engineering team.
  2. Our partners often promoted the app during their prime-time shows. This made our traffic patterns extremely peaky with huge spikes in an otherwise pretty low level baseline of traffic. Additionally our monetisation mechanism was through in-app advertising synchronised to TV. This led to the first couple of years including a lot of late nights hand-holding unstable technology and simply providing support in case of issues as demanded by our investors. The TV world is quite different from a tech startup environment. Their aversion to risk was very high which led to a lot of over-engineering, scaling and support for what may happen, rather than what did happen. When things did go wrong, we were directly answerable to the TV companies. One other aspect impacting our operational approach during this time was that for a startup we were extremely well funded. To an extent this led to a laziness in that it was quicker to throw more AWS instances at a scale problem than it was to engineer our way out of it.
  3. We later changed strategy to become more of a 24/7 proposition around a TV-based social network, rebranding along the way as Beamly. As part of this we also hired an editorial team to write original article content around TV and celebrity news. The operational impact of this strategy was that firstly we greatly increased the number of services running in production as we aggressively built out social network, news aggregation and publishing functionality. Secondly it was a natural point at which we moved to a Microservices based approach. Over the course of this time we also had an emerging strategy of gaining reach through promoting our article content via Facebook. Over time we build our own tooling to support that spend and this had two effects. Firstly we got very good at bringing in users to our platform from where the challenge then became to retain them. This led to our high watermark of around 10 million Monthly Active Users in April 2015. The second effect was that we started a pivot into social content marketing as our core competency and began taking on external clients. Ultimately this led to our acquisition by Coty Inc. in September 2015. We are now their in-house digital and web agency. We host and build out brand web presence for over thirty leading Coty fragrance and beauty brands, as well as run social, display and video based digital ad campaigns based on a data-led approach. This final pivot has introduced new operational considerations. We no longer run 95 microservices in production but our estate is now much more heterogeneous and includes code that we have not written but has been provided by third party agencies. Additionally outages are no longer primarily a reputational issue for us, but a revenue issue for our parent company (and other clients) At the current time, the Beamly engineering team is made up of 20 engineers in London.
  4. Product strategy has changed over time, but in general we adopt a ‘dual track’ agile approach of doing the minimum to discover the potential of an idea (user testing, low cost tests etc) before we commit to deliver it. Delivery is done as an MVP that is then iterated and built upon based on data feedback. Whilst good from the perspective of finding what works, this approach can present some operational challenges in terms of its tendency to leave services implemented to a ‘good enough’ level when teams then move onto the next thing. Good enough doesn’t always cover non-functional considerations. Sometimes they are not quite good enough… Our customer base has evolved dramatically from the early days of mainly male, technology oriented ’geeks’ and our investors through to a specific attempt to target female 18-25 year olds (including changing our name and branding) through to our current position as an agency where our direct customers are our parent company and other brands, but indirectly we are primarily again focused on a female demographic. Operationally we’ve moved full circle from outages causing us issues with our investors, through to them causing our own reputational damage to them now affecting revenue generating activity for our clients. Geography has always been a difficult issue in operational terms. We are London based but have had to support a global product in a 24/7 fashion for nearly all of our existence. This is still true now as part of a multi-national organisation. The main challenges here are in building on on-call and incident response process that gives us the coverage we require but in a way that is fair to our team. For this presentation, the effect on operations of how our technical architecture and practices, and our organisational structure and people have evolved are the things I’ll expand on.
  5. We’ve followed a typical evolution from monolithic systems to microservices. An in-depth discussion about the pros and cons of various architectures is beyond the scope of this presentation, but we will discuss some of the operational impacts of such an approach. Monoliths are actually pretty good in one regard of operational maturity. If you have a smaller number of codebases, by definition you have more people familiar with that code and more able to support it. You also have fewer moving parts.
  6. The key operational aspect of a microservices architecture for us is to understand the actual impact to end users when a service is failing or unavailable. We historically monitor individual services but have found that there is more value in attempting to monitor functionality instead. It is more useful for a pager to alert that login is not working, than for one of the microservices tangentially involved in login to alert that it is failing. This is particularly true if you do not spend the necessary effort to properly implement a microservices architecture. By that I mean you have taken assumptions about the way that systems will behave in the face of failure. If you are doing things properly (and believe me we did not in a lot of cases) then you will design for failure from the outset and you’ll aim for graceful degradation In favour of total collapse. It is also incredibly important to find and eliminate your Single Points of Failure. A common one in a microservices architecture is your mechanism for internal load balancing of requests. For instance we use HAProxy and had to spend a lot of time understanding how to run that reliably in a fault tolerant way, especially when it is common for instances to come and go and configuration to be re-written. You should not even consider microservices (and autoscaling) without the non functional tooling in place to make it work. You need log aggregation, metrics aggregation, monitoring and the like. Adrian Colyer’s Internet Scale Services Checklist is a great resource for understanding the things you should be thinking about. We use a slimmed down version of this as a pre-live checklist for any new service. Finally, microservices introduces a cognitive overhead in terms of there being many more codebases, potentially in various languages/frameworks. In our case we have a variety of Scala, Scala/Play and node.js based services. It is harder for every engineer to know what everything running in the estate does, how it works, and how to troubleshoot it. This is where Runbooks have been very useful for us.
  7. We’ve now moved beyond microservices to a more event-sourced/driven approach that utilises Apache Kafka and Apache Spark to run code in response to events. This is not suitable for all cases but for example it works really well when building publishing flows. It introduces more challenge in understanding how to run ZooKeeper/Kafka/Spark in a fault tolerant and scalable way (we run everything on AWS) but has the operational advantage that it drastically reduces the complexity of fault handling inherent in a microservices architecture that relies on HTTP communication between services.
  8. Finally from a technical perspective it is worth discussing briefly our approach to the proliferation of technologies we run in production. We have always had a fairly simple approach to this. We like to use the right tool for the job, but it is inherent on any engineer (or team) looking to introduce a new technology to have to do the work to implement that in a scalable and fault tolerant manner that also integrates into our logging/metrics/monitoring and alerting mechanisms. We avoid ‘CV Driven Development’ through this approach and any new technology needs to go through the same pre-live checklist as any code we write. We used to have a rule that any technology used by three or more teams would become owned by the Platform team (whose primarily responsibility is to maximise developer effectiveness) but in reality this doesn’t work as that team are not direct consumers of that technology themselves and are therefore not close enough to the pain points to prioritise effort on it. However, this approach means that our technology estate has become more heterogeneous over time. The above logos are an indication of the complexity of our environment in the early zeebox second screen days. Almost all backend services were written in Scala and the world was pretty simple.
  9. In the social network era of Beamly and in a microservices world, things become more complex…
  10. And today they are more complex still. Significantly we are now writing or supporting code written in Scala (Services, Spark), Javascript (Node) and PHP (Drupal/Wordpress) and where a lot of this complexity is inherited from brand websites that we have taken over. There are some examples in these slides of how we’ve evolved approaches over time in response to the challenges we’ve had with technology. A good example is configuration management and orchestration. We started off with a hand-written set of python based tooling known as Verrot. This was replaced by Puppet and Hieradata, which in turn has been replaced with Ansible/Consul. In each case that migration was costly in engineering effort but paid off in greater effectiveness and operational improvements.
  11. Moving on to organisation structure, this is perhaps the single biggest affecting our operational approaches. Again, we’ve followed a fairly typical evolution in our product/engineering structure. We started off with technology aligned vertical silos (iOS team, Android team, Web team, Backend team etc) – operationally this works quite well as service ownership is clearly delineated and knowledge is co-located. You can build an on-call rotation based around those teams. The possible downsides are that particular teams may tend to get overly burdened by operational issues (there is little that a mobile app team needs to be on-call for as true app issues cannot be resolved without an app store release…) Again, the product value aspects of team structures are beyond the scope of this presentation, but we started moving to more multi-disciplinary ’feature teams’ gradually which then evolved naturally into a ‘product team’ structure. Product teams have a product manager, tech lead and UX designer (or Data Scientist) as a core, supplemented by the right mix of engineers to achieve their goals. They are given business problems or metrics to affect and have the autonomy to do so in whichever way they think. They make data-led decisions and look to prove approaches with minimal code before committing to delivery.
  12. As an aside, it is interesting to see how Conway’s Law appears to be true in the context of Beamly. Conway’s Law states that “organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations” Tech aligned silos do not promote much cross-communication and lead to monoliths. Multi-disciplinary product teams act with a fair degree of autonomy and possess the skills to product their own end to end output. When they do need to communicate it is often via the inefficient form of synchronous meetings or ‘scrum of scrums’ – this is mirrored in our evolution by the use of Microservices reliant on unreliable HTTP based communication. Finally, like pretty much everybody else we’ve been using Slack as the de-facto communications mechanism for two years now. This is asynchronous in nature, discoverable and effective. This is mirrored by our move to a more event-driven architecture.
  13. Where we are now is a subtle but significant evolution of the product team model inspired by the Spotify approach, but tailored to our context. The main outcome of this is that we moved engineer management from within the product teams (which inhibited movement and limited the motiviation for teams to make time to expand knowledge of engineers outside of the scope of that team) to a matrixed approach with ‘Heads of Engineering’ A Head of Engineering is not in any product team and is primarily a people manager. If they make technical contribution it is on non-critical-path stuff and slow progress will not hinder anybody. They have the head-space to look at how we are working as a whole and to identify how we can improve. This has been immensely valuable not only in the ability to focus more on the growth and development of our engineers but also in giving us people who are more incented to identity and start to solve our operational challenges.
  14. Product teams are great for optimisation of delivering product value, but they introduce significant operational headaches. Naturally those teams will start to build services and to begin with all is good with the world.
  15. But then some services will not clearly ’fit’ into a single team and their ownership is indistinct.
  16. And then there will be shared infrastructure which multiple teams rely on…
  17. And then by nature, product teams/squads can disband when their goals are achieved. Imagine the example of a team tasked with improving the login experience. They build services to support Facebook and Google+ login, things are better, then that teams mission is complete and they move off to other things, but the services they built remain critical.
  18. And finally, even if teams persist their focus shifts as they evolve. They leave services behind them which are still running in production but are not a priority to spend time on.
  19. All of these are things we’ve seen happen with the product team/squad model. It causes problems when orphaned or fuzzily-owned services have incidents or we identify improvements we’d like to make. The challenge is that all engineers are in product teams that are not primarily incented to pick up work on systems they do not ’own’ or which does not deliver against their current goals. In Spotify it is probable that these are solved by their scale (and the concept of Tribes) but we are not that big. So, we’ve tried creating a specific Incident post-mortem JIRA board. We also created an ‘Engineering Excellence’ initiative whereby anybody can bring up an initiative to improve something or tackle some tech-debt. These are then up-voted and the most popular can be considered and given a ‘Directly Responsible Individual’ to champion. We then try to carve out time and people to do it by negotiation with product managers. The problem is that both of these initiatives have been failures. In both cases we’ve completed only 27% of the identified tickets over the past 12 months. This leaves us with a problem.
  20. The final aspect affecting our operational maturity is our engineering team itself. Whilst we are five years old as a company, 55% of our engineering team have been with the company under two years.
  21. Additionally, over the past year since acquisition, the number of operational incidents we’ve had to handle has dropped dramatically. Some of this is down to improvements, but the biggest factor is that we shutdown our Beamly product to focus on the requirements from our new parent company. We went from over 90 services in production to around 25. Less code = less complexity and fewer things to go wrong.
  22. The impact of this is that a significant percentage of our engineers have now never had to handle an operational incident whilst they’ve been on call. Again, in one respect this is good (hey boss, fewer incidents!) but actually we need a certain level of incident handling to keep discipline, to keep troubleshooting skills sharp and to maintain knowledge of the process.
  23. Another aspect relates back to the 27% of successfully completed incident and Engineering Excellence tickets. When broken down by job role, it is clear that we overly rely on the most senior (and most tenured) engineers to shoulder the burden of this work. This is down to a number on factors including their experience, knowledge of our systems, the fact that they are more likely to have an emotional attachment to the services being worked on, individual effectiveness and ability to absorb additional work. This presents a cyclic problem – our newest engineers are not engaged with the opportunity to expand their knowledge by working on these issues, and they can’t effectively work on these issues due to the lack of knowledge.
  24. All this builds up to a significant number of challenges to our operational effectiveness.
  25. So what’s next? Site Reliability Engineering has become an increasing trend, driven by the success of this model at Google and other companies. SRE isn’t ops, but is an application of Software Engineering approaches to the problems of how availability and reliability are maximised. It tackles how operational burden can be eliminated through obsessive automation. It is also much more – and the O’Reilly book is an excellent read. Of course, we are not Google. But we want to create an SRE style role within the engineering team to start to address some of the issues. However to begin with we certainly can’t justify a full SRE team…
  26. So we are also going to restructure our on-call structure. Currently, all engineers staff a second line rota on a 24 hour rotation (10am-10am) – this worked well for us when there were fairly regular operational issues but for the last 18 months peoples expectations of being paged have been minimal and as such it is not uncommon now for the schedule to get into a poor state (people on rota when on holiday for instance) – in short we’ve lost some discipline.
  27. So, the new approach will be to move to a 4/3 rota, whereby an engineer is on-call from 10am Monday to 10am Friday. During this time however they will also be extracted from their normal product team duties to work alongside the SRE. Effectively we create a near full time SRE team of two people. Regardless of whether incidents occur, this person gets time and space to focus on wider issues alongside the full-time SRE. The weekend on-call engineer just handles pages as usual.
  28. Whilst on-call in the SRE role during the week, the SRE ‘team’ can focus on a variety of tasks.
  29. The aim of this presentation is not to claim that we are operationally mature, nor to claim that we have best practices but just to share our experience. We are knowingly deficient and still learning all the time. As is common everywhere, the list of things we would like to do far outstrips our ability to spend time and resource on them.