Valliappa Lakshmanan says: “Ask someone a question in Google and you are likely to receive a link to a BigQuery view or query rather than the actual answer”. That’s what we are doing at Travelstart! I’ll present our DataOps approach and a way to create a culture of DIY, avoiding the BI bottleneck.
3. 3
Who is Travelstart?
SOUTH AFRICA
UAE TAIPEI
EGYPT
NIGERIA
TANZANIA
KENYA
MIDDLE EAST ASIAAFRICAEUROPE
ATHENS
PORTO
Travelstart is a leading online travel agency that helps
today’s business and leisure travellers search, compare
and book the best flight, hotel and car options with all
your favourite airline and accommodation suppliers.
Based in Cape Town, Travelstart proudly employs a team of
more than 850 staff from all walks of life who are
passionate about one goal - making travel simple.
5. At Google, for example, nearly 80% of
employees use Dremel (the internal
counterpart to Google BigQuery) every
month...everyone touches data on a
regular basis to inform their decisions”
Valliappa Lakshmanan (2018)
5
What Are We Doing?
Context
“
6. 6
What Are We Doing?
Context
https://www.statista.com/statistics/219333/number-of-google-employees-by-department/
7. 7
What Are We Doing?
Context
https://www.statista.com/statistics/219333/number-of-google-employees-by-department/
~38%
~62%
8. Ask someone a question and you are likely
to receive a link to a BigQuery view or
query rather than the actual answer”
Valliappa Lakshmanan (2018)
8
What Are We Doing?
Context
“
9. 9
How Did It Start?
Context
2014
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
10. 10
How Did It Start?
Context
2014
2018
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
Non-performant
No test environments
No version control
Users don’t trust the data
Shit Hits the Fan
11. 11
How Did It Start?
Context
2014
2018
2019
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
Non-performant
No test environments
No version control
Users don’t trust the data
Shit Hits the Fan
DataOps Approach
Event-Driven
Best practices of Software Engineering
Remove the BI bottleneck
DIY -> Data Democratisation
13. First Generation
Proprietary enterprise data
warehouse and business
intelligence platforms
13
Data Process Generations
DataOps
https://martinfowler.com/articles/data-monolith-to-mesh.html
Second Generation
Big Data ecosystem with a data
lake as a silver bullet
Third Generation
Event-Driven, Near Real Time
and Cloud Based
14. ● Data Mesh: Embracing ubiquitous data
○ Distributed Domain Driven Architecture (DDD), Self-serve Platform Design, and Product
Thinking with Data
○ Data is discoverable, addressable, trustworthy & truthful, self-describing semantics & syntax,
inter-operable & governed by global standards, domain data cross-functional teams
14
Data Process Generations
DataOps
https://martinfowler.com/articles/data-monolith-to-mesh.html
15. DataOps is a collaborative data management
practice focused on improving the communication,
integration and automation of data flows
between data managers and data consumers
across an organization. The goal of DataOps is to
deliver value faster by creating predictable
delivery and change management of data, data
models and related artifacts”
Gartner Glossary
15
What is DataOps?
DataOps
https://www.gartner.com/en/information-technology/glossary/dataops
“
16. DataOps uses technology to automate the design,
deployment and management of data delivery with
appropriate levels of governance, and it uses metadata to
improve the usability and value of data in a dynamic
environment.”
Gartner Glossary
16
What is DataOps?
DataOps
https://www.gartner.com/en/information-technology/glossary/dataops
“
20. 20
DataOps at Travelstart
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Data Quality Availability Usability Reliability Performance Flexibility
Quality
Attributes
Values
21. 21
DataOps at Travelstart
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Data Quality Availability Usability Reliability Performance Flexibility
Quality
Attributes
Data Lakes
IaC &
CI/CD
Data
Dictionary
Data
Catalog
Values
Descriptive Predictive Prescriptive
Framework
Data
Pipelines
Data
Warehouse
23. 23
Reference Model
The Architecture
Lambda ~ Kappa
Architecture
Event Driven
Architecture
+
Inmutable
master data
Precompute
Views
New Data
stream
Process
Stream
Increment
Views
View 1 View 2 View N
View 1 View 2 View N
Query
Batch
recompute
Real-Time
Increment
Batch views
Real Time views
Batch
Layer
Serving
Layer
Speed
Layer
Main Architectural
Pattern:
Publisher-Subscriber
24. Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
24
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Monitoring & Automation Tier
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD
25. Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
25
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
26. Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
26
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
27. Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
27
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
28. Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
28
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
29. Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
29
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
36. Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
36
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
37. Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
37
Solution Architecture
Architecture
38. Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
38
Solution Architecture
Architecture
39. Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
39
Solution Architecture
Architecture
40. Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
40
Solution Architecture
Architecture
41. 41
Solution Architecture
Architecture
Aggregation Tier
Areas of aggregation
Source System
Producer 2
Read Replica(s) or API
Aggregation System
Distribution &
Transformation
DWH/DLRead Replica(s)
Orchestrator
Orchestrator
Producer n
Event Delivery
Producer 1
Map & Aggr Event Delivery
Windowing [session]
Map & Aggregator
Event Producer
Persistence
Complexity
44. Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
44
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
45. Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
45
Solution Architecture
Architecture
46. Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
46
Solution Architecture
Architecture
47. Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
47
Solution Architecture
Architecture
48. Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
48
Solution Architecture
Architecture
51. Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
51
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
52. Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
52
Solution Architecture
Architecture
53. Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
53
Solution Architecture
Architecture
54. Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
54
Solution Architecture
Architecture
55. Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
55
Solution Architecture
Architecture
56. 56
Solution Architecture
Architecture
Transformation Tier
Ingestion Pipelines In charge of receiving the raw
events, process, clean, enrich and
validate quality of them
- Uncompress
- Standardize
- Cleansing
- Enrichment
- Schematization
- Quality Checks
- Alerting
Directed Acyclic Graph
(DAG)
Canonize Events Process
Quality
Assurance
Persist Raw
(Data Lake)
Persistence
Publish
Clean Event
RAW
CLEAN
62. 62
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines In charge of reading deadletters,
run reports to analyse quality issues
and bring back to life (if possible)
dead/lost events (necromancer)
Directed Acyclic Graph
(DAG)
Identify
Quality Check
Read Event
Sources
Run
Quality Check
Report
CLEAN
74. Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
74
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
75. Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
75
Solution Architecture
Architecture
76. Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
76
Solution Architecture
Architecture
77. Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
77
Solution Architecture
Architecture
78. Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
78
Solution Architecture
Architecture
81. 81
Solution Architecture
Architecture
Serving Tier
Data Lakes, Dead Letters & Backups
Standard/
Hotline
Nearline Coldline Archive
180 days
Min Duration:
90 days
Min Duration:
365 days365 days
Min Duration:
30 days
43% cost
reduction
69% cost
reduction
89% cost
reduction
Deletion Cost
82. 82
Solution Architecture
Architecture
Serving Tier
Data Lakes vs Data Warehouses
Data Lake Data Warehouse
Raw data
It’s dirty
Transactional structure not
analytical
Clean/Skew data
Quality processes applied
Schema defined for
analytical purposes
87. Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
87
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
88. Monitoring & Automation Tier
● The command and control area of the whole
architecture
● It needs to monitor everything related with data
ingestion, processing and infrastructure
● Alerting:
○ Policies
○ Multiple notification channels
● Performance and Stability Monitoring
○ Bottlenecks, Bugs, etc
○ Slow steps, underutilized infrastructure, memory issues, etc
● CI/CD + IaC
○ Development and Deployment lifecycle completely automated
○ Any architectural change should be track through a version control
● Dashboards!
88
Solution Architecture
Architecture