Infrastructure Prowing Pains by David Poblador i Garcia - DevOpsBCN - March 2024

DAVID POBLADOR I GARCIA. DEVOPS BCN. MARCH 2024
A PAINFUL MÉMOIRE OF DOS AND DON’TS IN 8 CHAPTERS
INFRASTRUCTURE
GROWINGPAINS

RIGHTSIZINGANINFRASTRUCTURETEAM
ISAHARDPROBLEM

MAKINGTHERIGHTINFRASTRUCTUREINVESTMENT
ATTHERIGHTMOMENT
ISANIMPOSSIBLEPROBLEM

networks: @davidpoblador
email: david@poblador.com
ABOUTDAVID
(MANDATORYFOLLOWMESLIDE)
@entredevyops
@2048enlla
@DevTunerHQ

- Thanks to Ignasi Fosch and Javi Arellano I got acquainted with the wonderful world of Linux in 1998
- Thanks to Albert Horta I started up an ISP in the middle of the .com bust. We survived. For 7 years
- I led infrastructure departments in several companies in BCN
- I wanted to learn the ropes of high up management and I became CTO at an online retail company
- I hated^H^H^H^H^H disliked it.
- I emigrated to Sweden
- The idea was to build the infra department at a small streaming company
- Emil Fredriksson promised it would be hands on
- It wasn’t. (exclusively)
- And I loved it. The gig lasted for 12 years, I was knackered
- I came back to the motherland.
- I started advising companies who wanted to scale. (You know… they thought money was free)
- I became a CTO for a VC
fi
rm, Secways
- Until a week ago!
- I am building something new to help engineers and teams to be productive without the usual corporate bullshit:
DevTuner.
ABOUTDAVID

FOREWORD
(GOANDTELLYOURBOSSESABOUTTHIS)

SOMEOFMY
MISTAKES
LEARNINGS
(SO YOU CAN COME UP WITH YOUR VERY OWN)

Because naming your
teams squads will not
make your company the
next Spotify
Because using OKRs will
not make your business
the next Google.
Because having unlimited
vacation will not make
you become the next
Netflix.
In the same way as
creating a DevOps
Engineer job title doesn't
make you live by DevOps.
Cargoculting must not
prevent you making
mistakes.
THEMYTHOFTHEGREENERGRASS

YAY/NAY
DECISIONS, DECISIONS…

MYALERTING
DASHBOARDISALLRED
LET'S DOUBLE THE SIZE OF THE INFRASTRUCTURE TEAM

MYALERTING
DASHBOARDISALLRED
LET'S GO TO THE BOTTOM OF ALERTS AND REMEDIATE SYSTEMIC PROBLEMS

- You have an idea
- You raised some money
- You have not proved the idea
- You spend too much time on Hacker News
OPPORTUNITYCOST

LET’SBUILDASUPER-SCALABLESYSTEM
USINGALLTHEMODERNPRIMITIVES
AUTO-SCALING
CONTINUOUSDEPLOYMENT
REQUESTTRACING
ALLTHECOOLFRAMEWORKS
YOUNAMEIT…
FORYOUR…ZEROREALUSERS
BECAUSE THINGS NEED TO SCALE, RIGHT?

Most B2B ideas can run in
one server. If they could in
the early 2000s, they can do
it today.
Every day you delay your
MVP, you delay every
subsequent iteration.
Have an inventory of every
shortcut you’ve taken.
Otherwise you’ll be
complaining about technical
debt 2 years from now.
Combine traits. Right balance
between daydreaming and
pragmatism. They don’t grow
on trees.
FOCUS,FOCUS,FOCUS

CHAPTER2
TIMESOFCHAOSATASWEDISHSTREAMINGCOMPANY

- In 2011, we had an operations team and a backend infrastructure team
- We had less than 100 employees. Most of tech was in the same building (
fl
oor)
- The backend infra team was in charge of building everything connected with plumbing, service discovery,
B2B comms, logging, messaging, building core systems, optimising everything
- They were 6 people.
- The operations team was in charge of rollouts, on-call, giving laptops to new employees, racking servers,
installing switches, con
fi
guring BGP, signing contracts with new datacenter providers, being woken up
every night, working 16 hours a day/night, policing (scarce) resources. And much more.
- They were 6 people.
- There were around 20 important systems. Each had an ops owner and a dev owner.
- Everyone was super busy.
- There were already a few million monthly active users.
- We were also building new features.
- Hiring as crazy!
- Systemic problems were not solved.
- Communication was broken: Backend infra felt they were interrupted. Operations felt they were unheard.
TIMESOFCHAOSATASWEDISHSTREAMINGCOMPANY

OPSINTROUBLE?
WRITEAPINKNOTETOBACKEND
INFRAANDWAIT
YES, LIKE THIS ONE:

Hire people with the
right mindset, someone
who can show the value
of constant
communication. Start
sharing some pain!
Find the right balance
between gardening and
landscaping.
OBSESSIVESWATTEAMTOTHERESCUE

CHAPTER3
3XGROWTHINONEYEAR(MOSTDIMENSIONS)

- 20 to 60 systems
- 100 to 300 employees
- 3X active users
- From 7 to 20 teams (squads)
- In one year, however, we could only hire on systems engineer
- Teams had multiple bottlenecks
- Releasing something required a titanic e
ff
ort
3XGROWTHINONEYEAR(MOSTDIMENSIONS)

SYNCHRONISATION
PROBLEMS?
PUTONEPERSONOF
EACHTEAMINA6-
SEATMEETINGROOM
YES, LIKE THIS ONE

Ask for help, you are not
the first one suffering
from a given problem.
Make sure you distribute
operational
responsibilities into
teams.
It's not only about
making teams feel the
pain. It's also about
allowing them to fly solo!
OPERATIONSINSQUADS

CHAPTER4
SPLITTINGWORKAMONG100INFRAPEOPLE

- Alright, by now teams feel the pain, but who does "operations"?
- Who owns "the service being up"?
- Who owns "the service being down"?
- Who owns cross-cutting work
fl
ows (provisioning, capacity planning, monitoring)
- Who owns "Architecture"?
- Conventions? Best practices? Consistency?
- Onboarding?
- Procurement?
- Security?
- ...
- (Di
ffi
cult to talk about of this, without looking like an old fashioned gatekeeping sysadmin)
SPLITTINGWORKAMONG100INFRAPEOPLE

YOUKNOWWHAT,ITDOESN'TMATTER...
EACHFEATURETEAMWILLOWNTHEIR
INFRA,ANDWEDON'TCAREABOUT
CONSISTENCY
WE ARE SMART ENGINEERS ANYWAY, AREN'T WE?

Make a list of problems faced
by the average team.
Factor out what's common.
Find a sensible split.
Treat each space as its own
"product".
Each product gets its own
team, PM, backlog, planning,
customer interviews. Yeah, like
a real product.
Make those teams
autonomous!
INFRASTRUCTUREASAPRODUCTORG

CHAPTER5
CAPACITYPLANNING?NOPE...WAITINGFORCONSTRUCTION

- It turns out there is a very thin
line between doing capacity
planning for backend services
and becoming a real estate
planner... when you grow fast.
- Large parts of your attention
and energy goes to a set of
problems far from your
business...
4.CAPACITYPLANNING?
NOPE...WAITINGFORCONSTRUCTION

LET'SRUNSOMECOMPUTEINTHE ☁
ANDSLOWLYPORTEVERYCOMPONENT
AWAYFROMTHEDATACENTER
BECAUSE WE LOVE HYBRID ARCHITECTURES, RIGHT?

If you make a move to
remove distractions, the
end game must not be
more distracting than the
original situation. Bet, or
don't, but don't half-arse it.
When you do a major
infrastructure shift, cost of
opportunity can kill you.
Netflix, Dropbox, Twitter…
all of them know about
this.
MAKEBOLDDECISIONS,LIMITHYBRIDENVIRONMENTS

CHAPTER6
WHATDOWEDOWITHALLTHEINFRAPEOPLE?IDENTITYCRISIS

- The traditional systems owned by each
infrastructure team are not as cool as what's out
there.
- It doesn't make sense to replicate functionality
that is available in the cloud.
- Technical debt prevents a "real" cloud workload.
- "What's my job now here?"
- Teams building user facing features are lagging
behind from a blessed stack.
WHATDOWEDOWITHALLTHEINFRAPEOPLE?IDENTITYCRISIS

WEDON'TNEEDINFRAPEOPLE
THEYWILLNEEDTOFIND
ANOTHERTEAM
BECAUSE SOMEONE IN FINANCE READ
CLOUD PROVIDERS ARE THE MODERN SOFTWARE OUTSOURCING COMPANIES, RIGHT?

There is probably no one in
your org who knows how the
sausage is made as your infra
people do.
Years of technology will
forcefully require heavy
alignment.
There are plenty of higher
level abstractions you have
not paid attention to because
you were too busy stocking
up SSD drives.
It’s the time to start encoding
your conventions in your
infrastructure layer.
INFRAPEOPLEAREBESTINCLASSTECHNOLOGYAMBASSADORS

CHAPTER7
WAIT,DOWEREALLYNEEDTOREINVENTTHEWHEEL?

- As alignment improves, bespoke solutions
make less sense
- Higher order infrastructure problems become
commodity (containers, orchestration,
monitoring, distributed databases)
- Cloud providers integrate lots of those
products "for free" (ha!)
- The cost of building some of those components
in-house are di
ffi
cult to calculate. In a cloud
invoice, everything is much clearer (ha!)
- The higher order primitives become messy, it's
di
ffi
cult to understand how pieces
fi
t together.
- Failure domains are impossible to reason
about.
WAIT,DOWEREALLYNEEDTOREINVENTTHEWHEEL?

WECAN'TMAKESENSEOFTHE
ECOSYSTEMANYMORE
LET'SDOUBLETHESIZEOFTHETEAM
BECAUSE MONEY IS^H^H WAS FREE. AND BECAUSE ONBOARDING IS CHEAP. RIGHT?

Managed services must
honour some
parameters: no data
lock-in, based on
standard formats, etc.
Do not underestimate
the future costs of price
increases, or
architectural revamps.
Have a well represented
group that tracks
architectural decisions.
They must not be
gatekeepers. They own
ensuring the strategy is
spread, understood,
shared and evolved by
everyone.
TWOWAYDOORDECISIONS,ALWAYS

CHAPTER8
OHRIGHT,WE'VELOSTTHELEVERAGE

- We don't own infra.
- We don't run infra.
- We forgot we knew how to build infra.
- When people build infra, they don't dare to say they have built infra.
- Some senior people spend too much time on Hacker News.
- Wait, can we really run this cheaper? — says your newly hired VP from BigCorp, Inc.
- But that’s going to make our hands dirty, won’t it?
OHRIGHT,WE'VELOSTTHELEVERAGE

MULTICLOUD
WILLSOLVE
ALLMY
PROBLEMS
BECAUSE WE HAVE NOT LEARNED ANYTHING
ABOUT THE COSTS OF HYBRID SYSTEMS,
RIGHT?

Decide carefully which
battles you want to pick.
A cheap service used by
a few teams in very
different ways? A bad
choice!
An expensive service
used by many in a
limited amount of ways?
You can save millions.
You can still build infra.
Cloud shines at
commodity services. But
cloud providers fund
those investments with
higher order services.
YOUCANSTILLBUILDINFRA

THEREISONERIGHTMOMENT
FOR(ALMOST)EVERYINFRASTRUCTURE
INVESTMENT
(ANDMANYWRONGMOMENTS)

DON’TFEELBADIFYOUGETITWRONG
MOSTSUCCESSFULPRODUCTSWEUSEWOULDN’T
EXISTWITHOUTSUCHPOORLYTIMEDDECISIONS

- We have spent many years optimising for the real-time use cases. We forgot about the
batch compute use case.
- Big chunks of compute are becoming batch (again)
- This will create space for new "cloud providers"
- This will force us to develop new ways to do resource management
- Lots of software and infrastructure powering AI models needs to be rewritten
- There is a surprising amount of technical debt
- It's time we bring to the table a lot of our treasured knowledge about "reproducible"
infrastructure into the new primitives.
- We will have a job, but we must escape the comfort zone.
- We've done it at least twice in 20 years, this will be the third time.
AREWEGETTINGREPLACEDBYAI?
(YOUTHOUGHTIAMTOOOLDTORIDEONTHEBUZZWORD…NOPE!)

DOESEVERYTHINGWEDONEEDTOBEBIG?

THANKYOU

ONEMORETHING
IF YOU WANT TO BE AMONG THE FIRST
TO TRY DEVTUNER…
WE HAVE A WAITING LIST

Infrastructure Prowing Pains by David Poblador i Garcia - DevOpsBCN - March 2024

Recommended

Recommended

More Related Content

Similar to Infrastructure Prowing Pains by David Poblador i Garcia - DevOpsBCN - March 2024

Similar to Infrastructure Prowing Pains by David Poblador i Garcia - DevOpsBCN - March 2024 (20)

Recently uploaded

Recently uploaded (20)

Infrastructure Prowing Pains by David Poblador i Garcia - DevOpsBCN - March 2024