Considerations for creating, storing and trusting a unified business approach to data in a distributed environment. In order to prevent disjointed and competing views of business facts.
2. Sharing Data Is Caring Data
Is your data playing well with others?
3. Introduction / Motivation
20+ years experience in the
industry
Working at Holiday Extras
Sharing data is not easy
Microservices playing nice
with a data lake
We are still learning...
4. The monolith
Common in smaller organisations
Often seen as legacy or older tech
Unintended victims of their own success
Serving businesses well for years
15. Generating Even More Events!
Microservice Microservice
MicroserviceMicroservice
DB
DB
DB
DB
Event
Event
Event
Event
16. Positive Data Culture
What do we need
to report on?
What state is
changing?
Which business
entities are
involved?
How do we
measure success?
Can this data be
useful to others?
What future products
could the data enable?
Hi I’m mark terry
Thought I’d talk about a recent blog post about how we are sharing data at Holiday Extras between the various parts of the business and teams. Allowing us to implement a growing number of microservices and still maintain usable data warehouse.
For these slide I’ll be focusing on the implementation detail of how we are currently doing this.
But we are still learning here, and still making improvements in this space.
So this is where we started, years ago. Much like other companies I’ve worked in.
Companies spend either let them be, or spend years moving away from them, generally they work well so they are hard to just kill.
Monoliths do get a fair bit of bad press, but in data terms things are ok so far...
Probably the simplest diagram I’ve put on a slide.
Generally monolithic apps are paired with a large datastore too. This was the case in several places I’ve worked
Things are great data wise as there is a single place where engineers can store data, and no one needs to think about differing standards or schemas as it can be tightly coupled with the app.
From an engineering point of view this could be seen as a negative but from a data view often the data is just appended to in whatever format is already there. Consistency wins here.
Overtime this DB is also used for servicing reporting to the business and there might be some simple admin screens to give some insight into the data contained within it.
Once source of truth of the business data
One place to go to find the numbers.
These database often creak under the strain of needing to be quick for the application but contain enough data for good reporting. (pick one)
Enter the world of microservices.
Often there are reasons to break a monolith down into smaller services. Those reasons are a whole other talk but mostly relate to developer experience or deployment cadence.
A common pattern here is to identify components inside the larger app and move these out into their own service.
The new service will still use the original datastore, to limit the amount of refactoring required at each step. This a good example of not thinking about data first.
After several services are broken out of the larger app, you end up with this architecture anti-pattern, the monolithic datastore.
Multiple descent services still sharing the same datastore.
We had this problem at Holiday Extras. You will not be affected by it immediately but it will get you son enough.
This couples the data of services together so database schema changes require complicated deploys.
Services can access and update data without going through advertised interfaces of the services. Making it harder to cache and identity sources of truth.
Microservices should each have their own operational datastores that is only accessible by that service. The data stored in these relate to the function that the service provides.
Data might be duplicated in the different stores with tech and formats may differing.
The sole access to data is via the service’s advertised interface.
Operationally things are great at this point, but our precious data is locked away in many databases reporting and sharing is going to be much harder now.
We go through a process of identifying what business entity a service changes and we have that service emit an event when this happens.
An event is a payload describing something that has happened. For example a new customer account has been created or a booking has been made.
If there are multiple services that perform similar tasks that similar events should be sent, for example having two booking systems.
These events are the key to sharing business data, they serve as an abstraction layer from the implementation detail in a service.
Now the hardest part of this whole process.
Schema’ing!
Deciding what makes up an event.
When you go through this process even the smallest of points will take time. You’ll be surprised by the differences of opinion here.
These discussion do pay off in the long run, its upfront pain which the engineers need to go through. But it gets easier the more schemas that are created.
So what do we do with these events?
Well we collect them all into a single “pipeline”.
The pipeline is made up of several smaller components (microservices) to provide the features we need to use this new data in the business.
In this example we are storing raw data as files and then also storing the events into a single datastore for warehousing.
Other tasks could be added to the pipeline as required, redaction, segregation etc..
From the data warehouse we can then add reports required by the business from a singe source. Great for compliance and makes it much easier join related data together.
We can run large queries here as its completely separated from the operational space.
No data is deleted, great for having large datasets for trends, and predicting customer intents.
The other major feature of this approach is as you have service generating events, you can have services also consuming these events.
Advantages here can include looser coupling of services and queue processing for free.
Services can be built around business entities state changes rather than from current implementation details.
For example send and email when we have a booking event, rather than allow our booking API to send a booking confirmation when someone books online.
It makes the engineers think a bit more generic and how a new service might be useful to others. Services built for the individual team but can be used by the entire busines.
Then the whole process starts again, services consuming events will change state and generate more events, more microservices.
More data to report and analyse!
This whole process gives us a separation of business data from the implementation detail, allowing services to be changed but data consistency remains.
It makes engineers and stakeholders think about the data they need to report on or how a new service would alter business data.
Data driven development can be used if going to the extreme.
Some example questions shown that can help during the development process.
Twitter account if you want to get in touch or happy to chat later this evening.