A field guide to the
Financial Times
Rhys Evans
Principal Engineer, Financial Times
@wheresrhys
@wheresrhys
Who I am
● Worked in tech 10+ years
● Gradually moved into tooling
● Co-lead the FT’s Reliability
Engineering team
● Lifelong birdwatcher
@wheresrhys
From Wikipedia:
A book designed to help the
reader identify wildlife (plants
or animals) or other objects of
natural occurrence (e.g.
minerals).
What is a field guide
● Why the FT needs a
field guide
● Organising our guide
with neo4j and
GraphQL
● Filling in the details
Why the FT
needs a
field guide
@wheresrhys
Insert non dramatic screenshot
@wheresrhys
@wheresrhys
@wheresrhys
@wheresrhys
@wheresrhys
“A tool dating from before the
trees that built the ark. Unowned,
unknown, and worth £250k of
business. One day it fell over. We
founds docs dated 1999... which
helped”
Greg Cope, Tech Director, FT
@wheresrhys
Starting about 5 years ago, the
range of tech we have to support
exploded
@wheresrhys
Previously
Centralised decision making
Monolithic architectures
Data centres
Infrequent releases
Move slow
and achieve little
@wheresrhys
Microservices
FT were early adopters of microservices
architecture
Lots of independently deployed services easier to
● Pick the right tool for the job
● Release and iterate
● Replace and decommission
@wheresrhys
Liberalisation
Matt Chadburn
http://matt.chadburn.co.uk/notes/teams-as-services.html
“[...] follow the mechanics of free-
market economy. Teams are allowed
and encouraged to pick the best value
tools for the job at hand”
@wheresrhys
OUT IN
Data Centre Your favourite cloud
‘The FT Platform’ Pick your own SaaS
Java, Java, Java I hear Rust’s good...
Ivory tower What works
@wheresrhys
“The upside of this is teams, left
to their own devices, and trusted
to make responsible decisions will
choose what is best for
themselves and the business in
the long-term.”
Matt Chadburn
http://matt.chadburn.co.uk/notes/teams-as-services.html
Build stuff and
disappear
@wheresrhys
Legacy is sooner than you think
● All images appearing on our websites relied on
1 person... who left
● A vanity url service built by a feature team that
disbanded shortly after
● Part of our membership platform built in a
niche language
● And many, many more
@wheresrhys
5 years is a long time in tech
Long enough for
● Shiny new things to become legacy
● Budgets and business priorities to move on
● People to leave
@wheresrhys
● Have to keep lots of tech ticking over
● Generating more new stuff than ever before to
keep track of
● Liberalising the tech department leads to
ownership & maintenance problems
Need a field guide to help us navigate the space
In summary
Unowned &
unknown
Owned &
known
Organising
our guide
with neo4j and
GraphQL
@wheresrhys
● Reaffirm who owns the various bits of FT tech
● Improve information about what is actually
running and why
● Determine what state it’s in at any given time
3 priorities to improve reliability
@wheresrhys
Who is our audience?
Operations team
● Active 24/7
● Broad knowledge of our tech platforms
● Need to know which approaches can be
applied to incident X
● If nothing works, who to call
@wheresrhys
CMDB versions 1 - 3 were:
● Too inert - Enter once and forget about it
● Too brittle - Chains of responsibility easily lost
● Too discrete - Hard to make important
connections
Not the first attempt
@wheresrhys
● The natural question to ask when addressing a
problem
● Links between people and things dotted all
over our previous CMDBs
● Intuitive but brittle
Who can help me with system X?
@wheresrhys
● Hard to connect data, so get overly simplified
models of reality
● Several degrees of separation is modelled as
a systemOwner field
● Simple, but inaccurate and hard to maintain
Relational databases constrain
@wheresrhys
● Designed to model complex relationships
● No need to simplify and abstract away details
that actually matter
● If person X is a stakeholder via 4 degrees of
separation, represent them as such
Graph databases liberate
@wheresrhys
A graph restatement of the
problem
‘How can I ensure systems are assigned to the
right people’
→
‘How can I ensure systems are connected
somehow to the right people’
@wheresrhys
System
?
?
?
? ?
?
?
?
Model the stable stuff first
Model the stable
stuff first
@wheresrhys
● Pick a unique, human readable code
● Kill infrastructure not tagged with it
● In our graph, the System record must be
connected to a Team
When systems are created we:
@wheresrhys
● Stable, manageable subdivisions of the
organisation
● Tech director who is ultimately responsible
On top of this stable foundation we can add the
more ephemeral things
Our tech connected to
@wheresrhys
BIZ-OPS MAN
@wheresrhys
● Self-service
● No such thing as a power user
● Extensible
● API first, but UI a close second
Data warehouse
free
@wheresrhys
REST API
● OK when fetching a single record type
● Painful to traverse
‘Canned query’ endpoints
● Less generic
● Limited by our imagination
Some poor API options
@wheresrhys
GraphQL to the rescue
“GraphQL is a query language for
APIs [...] gives clients the power to
ask for exactly what they need [...]
not just the properties of one
resource but also smoothly follows
references between them”
@wheresrhys
neo4j-graphql-js
● GraphQL normally talks to multiple APIs and
combines the results
● neo4j-graphql-js converts GraphQL queries
to cypher, and talks to neo4j directly
@wheresrhys
@wheresrhys
GraphQL big wins
● User friendly: Single, grokable query to get
unlimited connected info
● Future proof: Mirrors the neo4j graph as its
complexity grows
● More efficient: Fewer API calls and fewer and
faster DB calls
@wheresrhys
● Hungry users: Allows unwitting construction
of very expensive queries
● Caching: Not obvious what caching behaviour
to implement
● To write or not to write: Not persuaded to
move away from REST yet
Pitfalls of GraphQL
@wheresrhys
An extensible UI
@wheresrhys
#GRANDstack
GraphQL + React + Apollo + Neo4j Database
https://grandstack.io/
@wheresrhys
In summary
● Some confidence that Biz Ops won’t degrade
into a data graveyard
● Unlimited access to data for any person or
machine
But is the data actually any good?
Filling in the
details
@wheresrhys
Not the first attempt
CMDB versions 1 -3 were
● Too inert - Enter once and forget about it
● Too brittle - Chains of responsibility easily lost
● Too discrete - Hard to make important
connections
@wheresrhys
Don’t rely on good behaviour
● Automate
● More carrot, less stick
● Gamify
● UX
@wheresrhys
Automate
● Machines don’t forget to update information
● Restrict write access for certain records/types
to privileged clients
○ people-api → Writes details of FT staff
○ github-importer → Writes details of repositories
○ …
@wheresrhys
More carrot, less stick
@wheresrhys
Gamify
Teams respond
well to seeing how
they compare, and
how they can
improve
@wheresrhys
UX
@wheresrhys
@wheresrhys
Not just visual design
● Understand your users
● Uncover sources of friction
● Learn about their existing/ideal workflow
● Don’t expect them to come to you
● “Good design is invisible”
@wheresrhys
● System source code changes in Github,
● But runbook authorship in Biz Ops
● Bound to get out of step
● What if they happened concurrently?
Example: runbook authorship
@wheresrhys
● Runbooks written in RUNBOOK.md with front
matter metadata
● Content pulled into Biz Ops when production
code release detected
● Github PR integrations to follow
Example: runbook authorship
@wheresrhys
● Underpinning how we handle GDPR requests
● Quicker triaging of security incidents
● Integrating with leavers process
More benefits → more incentives to improve data
Beyond operational info
What have
we learned
today?
Model the stable stuff first
Legacy code
comes to us all
Model the stable stuff first
Documented legacy
is good legacy
Model the stable stuff first
Graphs enable
more powerful
modelling
Model the stable stuff first
Using #GRANDstack
is like being the
film version of Mark
Zuckerberg
Model the stable stuff first
Your data won’t
update itself
Model the stable stuff first
UX and other
feedback loops
can keep it fresh
Thank you
The team:
Geoff Thorpe, Laura Carvajal, Charlie Briggs,
Katie Koschland, Simon Legg, Maggie Allen,
Courtney Osborn, Kat Downes, Sentayhu
Mekoonnali, David Balfour
Images from: https://www.audubon.org/birds-
of-america/
@wheresrhys
www.ft.com/dev/null

A field guide to the Financial Times, Rhys Evans, Financial Times

Editor's Notes

  • #7 Flagship website - ft.com
  • #8 Diverse range of websites
  • #9 A range of tech not seen much externally
  • #10 Print & distribute 6 days a week
  • #11 If this weren’t enough, we are at the whims of a fickle news cycle
  • #12 It’s a lot to keep an eye on
  • #13 Key phrase unowned and unknown
  • #16 There was a conscious movement away from this centralised approach as it was failing to deliver 2 responses emerged at around the same time
  • #17 Side effect of this means instead of one big thing you have many little things to look after Rather than build one big thing, build lots of little things
  • #18 Freeing up teams to choose what they need to deliver value for the business quickly
  • #20 Draw particular attention to long-term
  • #21 Left running the stuff that they built
  • #22 Quickly find yourself in a position where you have some legacy system nobody is looking after When you liberalise this WILL happen
  • #23 Even if not liberalised, these facts are still true
  • #24 If you recognise any of the problems above in your own organisation, maybe some of our solutiosn can inspire you
  • #28 New team set up I’ll talk mainly about the first 2, but we’ll touch on the third as well
  • #30  It needed a rethink
  • #31 Hard to track movement of people as they move a lot
  • #33 Systems connected directly to variety of things - idiomatic in a relational data store to have few degrees of separation because creaks under more complex rels Show diagram of the odl model mapped to neo No longer have to have a model that ‘leaks’ the choice of DB out
  • #34 How caMake sure each system is connected to the graphHow can a system be This doesn’t solve the problem by itself Ultimately people move on -THAT is the problem. Neo4j allows us to connect to better behaved entities, such as teams, and fro there connect to peopleNow can concentrate on the relationships that matter, not eth relationships that are easy Explain the direct connections to tech director mean lots of records need maintaining, but with graph only one link We can stop the battle of attrition
  • #35 When systems are created we Enforce assigning a unique, human readable code, to the infrastructure e.g. biz-ops-api In our graph, the System record must be connected to a Team Teams are relatively few, their hierarchy easily maintained, and ultimately lead to a Tech Director Fixed -/. Less fixed
  • #36 Inaccurate data that’s waiting to happen Start with things you know you can maintain Poorly maintained ACCURATE data will become inaccurate data
  • #39 Compare to previous problem… rather than... nt that we won’t lose track of the critical stuff With system -> Team as the core datum [ENFORCED ON CREATION, and cannot create infrastructure without a system code) we can build on top of it Special people relationships e.g. technicalOwner still exist, but the responsibility clearly lies with the team to find a new person Cost attribution System -> team -> group -> tech director is the critical path BUT clear responsibility doesn’t necessarrilly mean well mainatined - we are all busy [eopl
  • #40 Lots of connections between people and systems Who wants to know about GDPR & this system - HAS_DATA_WONER
  • #41 List of lists Can piggy bag on that chain of responsibility Amazing what intersting connections you find
  • #42 Any query goes Extensible without needing lots of dev work Talk about componentisation, origami etc But with this richer, more democratised and extensible data set, the hope is that we will store more connected data able to answer more and more of the questions the business wants to answer How can we open up access to the data and stop our team being a bottleneck? Examples
  • #43 Simple rest endpoints and expect users to traverse themselves? Bas for users (complex) and bad for us (load) , but begins to add opinions, and favour the interactions we can imagine now, not what people may want in the future
  • #44 Perfect - ask for things and the things they’re connected to
  • #48 E.g. if query is simple prob little cache is fine Far less obvious what keys to cache on, and for how long
  • #49 DB & API can grow organically… ...but our users want a UI Which must similarly be able to grow without our team becoming a bottleneck With graphQL as the foundation, we’ve extended the schema to create an entire read/write ecosystem for this data: graphQL = name, description, type Biz ops = name,description, type, label, isSearchable, required…. Use ES, but neo4j should be our search DB soon too Some people don’t like yaml, because some people are wrong
  • #50 Don’t let any of the code in any layers be opinionated Take waht given, apply generic rules Data & schema driven Mention mobile friendly
  • #51 No downer on liberalisation Woul dnever’ve happened under central planning
  • #52 This is what the cool kids are calling it
  • #55 Tackled brittle & discrete, but not inert yet Accurate data is still bad data if you have no confidence in how current it is e.g. misleading confidence ‘don’t know what you don’t know’ But any people problem shouldn’t be attributed to human error https://www.outcome-eng.com/human-error-never-root-cause/ We arrive back at tech or process to fix what’s wrong
  • #56 No such thing as human error
  • #57 There is a source of truth we can rely on for current information, and biz ops to make the right connections
  • #58 Provide tangible benefits
  • #59  Data correction journey - link to restricted form Show good dashboards Getting good quality data is rarely purely a technology problem Systems don’t forget to update data, _people_ forget to update data Visibility, easy wins, Natural catalyst
  • #60 On a public website we work wth UX to drive up conversions Why not on an internal site to drive up ‘behaviour conversions’? UX = tech x 10 Refine the solution so that people can be successful in doing what you want them to do
  • #61 On a public website we work wth UX to drive up conversions Why not on an internal site to drive up ‘behaviour conversions’? And this is for, what, a documentation site? Roll over confluence and github If the tools you provide are a pleasure to use, peopel warn to the task
  • #62 Invisibility can apply to workflow We as engineers shoudl think of more invisibility
  • #63 Runbook = pages of the fieldguide
  • #65 we are persisting in making biz-ops the default choice of data store. The more types of data it contains, the more useful connections can be made, and the more powerful it becomes. Within 3 months of building the platform which is naturally extensible it’s already starting to snowball and we are unable to keep up with demand Bringing forward features such as self-deploying schema updates to remove us as a bottleneck
  • #67 Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob - but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular level A mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the end Poorly maintained ACCURATE data will become inaccurate data
  • #68 Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob - but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular level A mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the end Poorly maintained ACCURATE data will become inaccurate data
  • #69 Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob - but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular level A mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the end Poorly maintained ACCURATE data will become inaccurate data
  • #70 Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob - but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular level A mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the end Poorly maintained ACCURATE data will become inaccurate data
  • #71 Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob - but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular level A mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the end Poorly maintained ACCURATE data will become inaccurate data
  • #72 Obviously, try to represent _some_detail - don’t represent everything as a single amorphous blob - but as soon as you have doubts about how easy it will be to maintain the data, step back to a less granular level A mistake previous incarnations had made was to model what we want to know, regardless of what we can realistically maintain. Misleading in the end Poorly maintained ACCURATE data will become inaccurate data