Evolving Operational Maturity in a Startup Environment

0
© Beamly Limited
Evolving operational maturity in a start-up environment
Adrian Spender, Head of Server Engineering
CTO’s in London Meetup – Octopus Labs
6th October 2016

1
About me
• Software engineer with 18 years of JVM-based development experience
• Now focused on engineering management
• Four years at Beamly
• Responsible for our operational support for last two years
@aspender https://linkedin.com/in/aspender

2
About Beamly
Second screen
mobile apps
and website
Jan
2012
Oct
2011
Zeebox
launch
Sky
investment
Sep
2012
US launch
With investment
From Comcast/NBC,
Viacom and HBO
Nov
2012
AU launch with
Ten and Foxtel

4
About Beamly
Acquisition
By Coty Inc.
Oct
2015
PIVOT!
In-house
Digital marketing
and web agency
Now
Rebranded
as Beamly
Apr
2014
Aug
2014
Founders
step down
PIVOT!
Social Content
Marketing and
tooling
PIVOT!
TV focussed
Social Network
Second screen
mobile apps
and website
Jan
2012
Oct
2011
Launch
Sky
investment
Sep
2012
US launch
With investment
From Comcast/NBC,
Viacom and HBO
Nov
2012
AU launch with
Ten and Foxtel
Produce
original
TV/celeb
articles
Apr
2015
10m
MAUs
AU
Shut-
down
EOL
mobile
apps
Nov
2015
Aug
2015
Host Coty
brand sites
Run campaigns
Data Science
Aug
2016
Feb
2015
Facebook
spend
tools

5
What do I mean by operational maturity?
• How our ability to support our code running in production has changed over
time and the following variables:
– Product strategy
– Customer base
– Geography
– Technical architecture and practices
– Organisational structure and people
• Lets focus on the last two

6
Technical architecture and practices
• We got some things right from very early on
– A Dev-ops culture of you write it, you run it
• Testing
• Continuous integration
– A ‘platform’ team whose focus is developer effectiveness, not operations
– Service endpoints
• https://github.com/beamly/se4
– Runbooks
– Monitoring and alerting

7
SE4
• Common endpoints for every service, regardless of tech
– /service/status
– /service/healthcheck/gtg
– /service/healthcheck
– /service/metrics
– /service/config
• Acts as single point of understanding about the runtime deployment of the
service
• Useful for problem determination
• Useful for ELB/haproxy/any other healthcheck

9
Architectural evolution
Now
Oct
2011
Monoliths Microservices

10
Operational considerations of Microservices
• Foo is alerting
– What does that actually mean, what is the impact?
– Architect for failure
• Know and eliminate your SPoFs
• Have good tooling to support problem determination
– Runbooks to describe service responsibilities and problem determination steps
– Log aggregation
– Metrics aggregation
– Monitoring
• Internet Scale Services Checklist - Adrian Colyer

11
Architectural evolution
Now
Oct
2011
Monoliths Microservices Event-sourced

15
Organisational structure
Now
Oct
2011
Tech silos
🐰
🐰
🐰
🐧
🐧
🐧
🐼
🐼
🐼
Feature team hybrid
🐰
🐰
🐰
🐧
🐧
🐧
🐼
🐼
🐼
🐰
🐼
🐧
Product teams
🐰
🐧
🐧
🐧
🐼
🐰
🐼
🐼
🐰
🐰
🐼
🐧

16
Conway’s law in action
Now
Oct
2011
🐰
🐰
🐰
🐧
🐧
🐧
🐼
🐼
🐼
🐰
🐧
🐧
🐧
🐼
🐰
🐼
🐼
🐰
🐰
🐼
🐧
No communication Synchronous meetings
🐰
🐧
🐧
🐧
🐼
🐰
🐼
🐼
🐰
🐰
🐼
🐧

17
Organisational structure
Now
Oct
2011
’Spotify model’
🐰
🐧🐧🐧
🐼
🐰
🐼🐼
🐰 🐰
🐼
🐧
🐰
🐼
🐧

18
The operational problem with product teams/squads
Now
Oct
2011
A
B
C
D
E
F
G
H
I

19
Now
Oct
2011
A
B
C J
D
E
F
G
H
I

20
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
Shared Infra

21
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
Shared Infra

22
Now
Oct
2011
A
B
C J
D
E
F
G
H
I
Shared Infra

23
• Incident follow-up – lets create a Jira board
• Cross-cutting / fall through the gaps – Engineering Excellence initiative
Incident post mortem tickets
Not Done Done
Engineering Excellence tickets
Not Done Done

24
People
0
1
2
3
4
5
6
7
8
9
0-1 1-2 2-3 3-4 4-5 5+
Length of service (years)
Engineering team length of service

25
People
0
2
4
6
8
10
12
14
Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16
Incidents
Acquisition
‘cleanup’

26
0
1
2
3
4
5
6
7
8
9
0-1 1-2 2-3 3-4 4-5 5+
Length of service (years)
Engineering team length of service
45% of engineers have never
been involved in handling an
operational incident

27
0
1
2
3
4
5
6
7
8
Head of Engineering Technical Architect Senior Software Engineer Software Engineer Junior Software Engineer
Incident and EE tickets completed by job role
Completed tickets Engineers

28
• Product team structure focused on moving forwards
– Velocity vs stability tension
• Cross-cutting tech and issues have no owner
• Lack of operational issue handling practice and experience
• Little investment in improving our availability and reliability through improved
monitoring and automation
• Over-reliance on small subset of the engineering team
– Lack of opportunity for experience and growth for the rest
– Tacit knowledge not being shared / encoded
Current operational challenges

Site Reliability Engineer
• We are hiring into this role to focus exclusively on our
availability and reliability
• Will not be part of a product team
• Will spend at least 50% of their time writing code to
automate away operational burden and improve monitoring
• Will have power to fix things in your production systems if
you can’t/don’t
• Will own the maintenance and evolution of common runtime
infrastructure (e.g. haproxy, Tyk)
• Will help teams plan for production including capacity
planning, performance, architecture
• Will help us evolve operational processes and practices
• Is not ‘platform’ – not focused on developer effectiveness or
IT.
https://thebeamlyagency.bamboohr.co.uk/jobs/view.php?id=17

Being on call – current structure
Mon Tue Wed Thu Fri Sat Sun
Bob Alice Don John Zed Joe Joan
Third line teams
Engineering management

Being on call – new structure
Mon Tue Wed Thu Fri Sat Sun
Bob Alice
Third line teams
Engineering management

Being on call during week == Site Reliability Engineer
• Handling incidents that occur
• Writing up incident post mortems
• Responding to any non-incident issues e.g. automated warnings in the Slack #live-monitoring
channel
• Picking up tickets outstanding from previous post mortems
• Picking up Engineering Excellence tickets. Examples of which would include:
– Resolving issues/pain points through automation
– Improving documentation
– Improving alerting
• Improvements to common infrastructure/services
• Performing routing maintenance on common systems (e.g. HiveMQ upgrade)
• Expanding their knowledge on Beamly systems/architecture (e.g. performing chaos monkey tests)
• Working on technical debt within their own product team that is not specifically prioritised in that
teams own plans.

Much more room for improvement
• Better measurement of availability/reliability
• Error budgeting
• More automation
• Continuous delivery
• Improved tooling
• Never ending…

34
Thank you. Questions?
@aspender https://linkedin.com/in/aspender

Evolving Operational Maturity in a Startup Environment

Recommended

Recommended

More Related Content

Similar to Evolving Operational Maturity in a Startup Environment

Similar to Evolving Operational Maturity in a Startup Environment (20)

Recently uploaded

Recently uploaded (20)

Evolving Operational Maturity in a Startup Environment

Editor's Notes