A field guide to the Financial Times, Rhys Evans, Financial Times
1. A field guide to the
Financial Times
Rhys Evans
Principal Engineer, Financial Times
@wheresrhys
2. @wheresrhys
Who I am
● Worked in tech 10+ years
● Gradually moved into tooling
● Co-lead the FT’s Reliability
Engineering team
● Lifelong birdwatcher
3. @wheresrhys
From Wikipedia:
A book designed to help the
reader identify wildlife (plants
or animals) or other objects of
natural occurrence (e.g.
minerals).
What is a field guide
4. ● Why the FT needs a
field guide
● Organising our guide
with neo4j and
GraphQL
● Filling in the details
12. @wheresrhys
“A tool dating from before the
trees that built the ark. Unowned,
unknown, and worth £250k of
business. One day it fell over. We
founds docs dated 1999... which
helped”
Greg Cope, Tech Director, FT
16. @wheresrhys
Microservices
FT were early adopters of microservices
architecture
Lots of independently deployed services easier to
● Pick the right tool for the job
● Release and iterate
● Replace and decommission
18. @wheresrhys
OUT IN
Data Centre Your favourite cloud
‘The FT Platform’ Pick your own SaaS
Java, Java, Java I hear Rust’s good...
Ivory tower What works
19. @wheresrhys
“The upside of this is teams, left
to their own devices, and trusted
to make responsible decisions will
choose what is best for
themselves and the business in
the long-term.”
Matt Chadburn
http://matt.chadburn.co.uk/notes/teams-as-services.html
21. @wheresrhys
Legacy is sooner than you think
● All images appearing on our websites relied on
1 person... who left
● A vanity url service built by a feature team that
disbanded shortly after
● Part of our membership platform built in a
niche language
● And many, many more
22. @wheresrhys
5 years is a long time in tech
Long enough for
● Shiny new things to become legacy
● Budgets and business priorities to move on
● People to leave
23. @wheresrhys
● Have to keep lots of tech ticking over
● Generating more new stuff than ever before to
keep track of
● Liberalising the tech department leads to
ownership & maintenance problems
Need a field guide to help us navigate the space
In summary
27. @wheresrhys
● Reaffirm who owns the various bits of FT tech
● Improve information about what is actually
running and why
● Determine what state it’s in at any given time
3 priorities to improve reliability
28. @wheresrhys
Who is our audience?
Operations team
● Active 24/7
● Broad knowledge of our tech platforms
● Need to know which approaches can be
applied to incident X
● If nothing works, who to call
29. @wheresrhys
CMDB versions 1 - 3 were:
● Too inert - Enter once and forget about it
● Too brittle - Chains of responsibility easily lost
● Too discrete - Hard to make important
connections
Not the first attempt
30. @wheresrhys
● The natural question to ask when addressing a
problem
● Links between people and things dotted all
over our previous CMDBs
● Intuitive but brittle
Who can help me with system X?
31. @wheresrhys
● Hard to connect data, so get overly simplified
models of reality
● Several degrees of separation is modelled as
a systemOwner field
● Simple, but inaccurate and hard to maintain
Relational databases constrain
32. @wheresrhys
● Designed to model complex relationships
● No need to simplify and abstract away details
that actually matter
● If person X is a stakeholder via 4 degrees of
separation, represent them as such
Graph databases liberate
33. @wheresrhys
A graph restatement of the
problem
‘How can I ensure systems are assigned to the
right people’
→
‘How can I ensure systems are connected
somehow to the right people’
36. @wheresrhys
● Pick a unique, human readable code
● Kill infrastructure not tagged with it
● In our graph, the System record must be
connected to a Team
When systems are created we:
38. @wheresrhys
● Stable, manageable subdivisions of the
organisation
● Tech director who is ultimately responsible
On top of this stable foundation we can add the
more ephemeral things
Our tech connected to
41. @wheresrhys
● Self-service
● No such thing as a power user
● Extensible
● API first, but UI a close second
Data warehouse
free
42. @wheresrhys
REST API
● OK when fetching a single record type
● Painful to traverse
‘Canned query’ endpoints
● Less generic
● Limited by our imagination
Some poor API options
43. @wheresrhys
GraphQL to the rescue
“GraphQL is a query language for
APIs [...] gives clients the power to
ask for exactly what they need [...]
not just the properties of one
resource but also smoothly follows
references between them”
46. @wheresrhys
GraphQL big wins
● User friendly: Single, grokable query to get
unlimited connected info
● Future proof: Mirrors the neo4j graph as its
complexity grows
● More efficient: Fewer API calls and fewer and
faster DB calls
47. @wheresrhys
● Hungry users: Allows unwitting construction
of very expensive queries
● Caching: Not obvious what caching behaviour
to implement
● To write or not to write: Not persuaded to
move away from REST yet
Pitfalls of GraphQL
52. @wheresrhys
In summary
● Some confidence that Biz Ops won’t degrade
into a data graveyard
● Unlimited access to data for any person or
machine
But is the data actually any good?
54. @wheresrhys
Not the first attempt
CMDB versions 1 -3 were
● Too inert - Enter once and forget about it
● Too brittle - Chains of responsibility easily lost
● Too discrete - Hard to make important
connections
56. @wheresrhys
Automate
● Machines don’t forget to update information
● Restrict write access for certain records/types
to privileged clients
○ people-api → Writes details of FT staff
○ github-importer → Writes details of repositories
○ …
61. @wheresrhys
Not just visual design
● Understand your users
● Uncover sources of friction
● Learn about their existing/ideal workflow
● Don’t expect them to come to you
● “Good design is invisible”
62. @wheresrhys
● System source code changes in Github,
● But runbook authorship in Biz Ops
● Bound to get out of step
● What if they happened concurrently?
Example: runbook authorship
63. @wheresrhys
● Runbooks written in RUNBOOK.md with front
matter metadata
● Content pulled into Biz Ops when production
code release detected
● Github PR integrations to follow
Example: runbook authorship
64. @wheresrhys
● Underpinning how we handle GDPR requests
● Quicker triaging of security incidents
● Integrating with leavers process
More benefits → more incentives to improve data
Beyond operational info
71. Model the stable stuff first
UX and other
feedback loops
can keep it fresh
72. Thank you
The team:
Geoff Thorpe, Laura Carvajal, Charlie Briggs,
Katie Koschland, Simon Legg, Maggie Allen,
Courtney Osborn, Kat Downes, Sentayhu
Mekoonnali, David Balfour
Images from: https://www.audubon.org/birds-
of-america/
@wheresrhys
www.ft.com/dev/null
Editor's Notes
Flagship website - ft.com
Diverse range of websites
A range of tech not seen much externally
Print & distribute 6 days a week
If this weren’t enough, we are at the whims of a fickle news cycle
It’s a lot to keep an eye on
Key phrase unowned and unknown
There was a conscious movement away from this centralised approach as it was failing to deliver
2 responses emerged at around the same time
Side effect of this means instead of one big thing you have many little things to look after
Rather than build one big thing, build lots of little things
Freeing up teams to choose what they need to deliver value for the business quickly
Draw particular attention to long-term
Left running the stuff that they built
Quickly find yourself in a position where you have some legacy system nobody is looking after
When you liberalise this WILL happen
Even if not liberalised, these facts are still true
If you recognise any of the problems above in your own organisation, maybe some of our solutiosn can inspire you
New team set up
I’ll talk mainly about the first 2, but we’ll touch on the third as well
It needed a rethink
Hard to track movement of people as they move a lot
Systems connected directly to variety of things - idiomatic in a relational data store to have few degrees of separation because creaks under more complex rels
Show diagram of the odl model mapped to neo
No longer have to have a model that ‘leaks’ the choice of DB out
How caMake sure each system is connected to the graphHow can a system be This doesn’t solve the problem by itself
Ultimately people move on -THAT is the problem. Neo4j allows us to connect to better behaved entities, such as teams, and fro there connect to peopleNow can concentrate on the relationships that matter, not eth relationships that are easy
Explain the direct connections to tech director mean lots of records need maintaining, but with graph only one link
We can stop the battle of attrition
When systems are created we
Enforce assigning a unique, human readable code, to the infrastructure e.g. biz-ops-api
In our graph, the System record must be connected to a Team
Teams are relatively few, their hierarchy easily maintained, and ultimately lead to a Tech Director
Fixed -/. Less fixed
Inaccurate data that’s waiting to happen
Start with things you know you can maintainPoorly maintained ACCURATE data will become inaccurate data
Compare to previous problem… rather than...nt that we won’t lose track of the critical stuff
With system -> Team as the core datum [ENFORCED ON CREATION, and cannot create infrastructure without a system code) we can build on top of it
Special people relationships e.g. technicalOwner still exist, but the responsibility clearly lies with the team to find a new person
Cost attribution
System -> team -> group -> tech director is the critical path
BUT clear responsibility doesn’t necessarrilly mean well mainatined - we are all busy [eopl
Lots of connections between people and systems
Who wants to know about GDPR & this system - HAS_DATA_WONER
List of listsCan piggy bag on that chain of responsibility
Amazing what intersting connections you find
Any query goes
Extensible without needing lots of dev work
Talk about componentisation, origami etc
But with this richer, more democratised and extensible data set, the hope is that we will store more connected data able to answer more and more of the questions the business wants to answer
How can we open up access to the data and stop our team being a bottleneck?
Examples
Simple rest endpoints and expect users to traverse themselves? Bas for users (complex) and bad for us (load), but begins to add opinions, and favour the interactions we can imagine now, not what people may want in the future
Perfect - ask for things and the things they’re connected to
E.g. if query is simple prob little cache is fine Far less obvious what keys to cache on, and for how long
DB & API can grow organically…
...but our users want a UI
Which must similarly be able to grow without our team becoming a bottleneck
With graphQL as the foundation, we’ve extended the schema to create an entire read/write ecosystem for this data:
graphQL = name, description, type
Biz ops = name,description, type, label, isSearchable, required….
Use ES, but neo4j should be our search DB soon too
Some people don’t like yaml, because some people are wrong
Don’t let any of the code in any layers be opinionated
Take waht given, apply generic rules
Data & schema drivenMention mobile friendly
No downer on liberalisation
Woul dnever’ve happened under central planning
This is what the cool kids are calling it
Tackled brittle & discrete, but not inert yet
Accurate data is still bad data if you have no confidence in how current it is e.g. misleading confidence ‘don’t know what you don’t know’
But any people problem shouldn’t be attributed to human error https://www.outcome-eng.com/human-error-never-root-cause/ We arrive back at tech or process to fix what’s wrong
No such thing as human error
There is a source of truth we can rely on for current information, and biz ops to make the right connections
Provide tangible benefits
Data correction journey - link to restricted form
Show good dashboards
Getting good quality data is rarely purely a technology problem
Systems don’t forget to update data, _people_ forget to update data
Visibility, easy wins, Natural catalyst
On a public website we work wth UX to drive up conversionsWhy not on an internal site to drive up ‘behaviour conversions’?
UX = tech x 10
Refine the solution so that people can be successful in doing what you want them to do
On a public website we work wth UX to drive up conversionsWhy not on an internal site to drive up ‘behaviour conversions’?And this is for, what, a documentation site? Roll over confluence and github
If the tools you provide are a pleasure to use, peopel warn to the task
Invisibility can apply to workflow
We as engineers shoudl think of more invisibility
Runbook = pages of the fieldguide
we are persisting in making biz-ops the default choice of data store. The more types of data it contains, the more useful connections can be made, and the more powerful it becomes.
Within 3 months of building the platform which is naturally extensible it’s already starting to snowball and we are unable to keep up with demand
Bringing forward features such as self-deploying schema updates to remove us as a bottleneck
Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob -
but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular levelA mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the endPoorly maintained ACCURATE data will become inaccurate data
Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob -
but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular levelA mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the endPoorly maintained ACCURATE data will become inaccurate data
Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob -
but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular levelA mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the endPoorly maintained ACCURATE data will become inaccurate data
Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob -
but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular levelA mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the endPoorly maintained ACCURATE data will become inaccurate data
Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob -
but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular levelA mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the endPoorly maintained ACCURATE data will become inaccurate data
Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob -
but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular levelA mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the endPoorly maintained ACCURATE data will become inaccurate data