SlideShare a Scribd company logo
1 of 109
Download to read offline
Scaling Your Data:
Data Democratisation and DataOps
Juan Urrego
March 2020
Juan Urrego
2
DataOps Team Lead
YOU-Anne
JOO-Anne
ONE
Who-Anne
Colombia
Columbia
3
Who is Travelstart?
SOUTH AFRICA
UAE TAIPEI
EGYPT
NIGERIA
TANZANIA
KENYA
MIDDLE EAST ASIAAFRICAEUROPE
ATHENS
PORTO
Travelstart is a leading online travel agency that helps
today’s business and leisure travellers search, compare
and book the best flight, hotel and car options with all
your favourite airline and accommodation suppliers.
Based in Cape Town, Travelstart proudly employs a team of
more than 850 staff from all walks of life who are
passionate about one goal - making travel simple.
Context
At Google, for example, nearly 80% of
employees use Dremel (the internal
counterpart to Google BigQuery) every
month...everyone touches data on a
regular basis to inform their decisions”
Valliappa Lakshmanan (2018)
5
What Are We Doing?
Context
“
6
What Are We Doing?
Context
https://www.statista.com/statistics/219333/number-of-google-employees-by-department/
7
What Are We Doing?
Context
https://www.statista.com/statistics/219333/number-of-google-employees-by-department/
~38%
~62%
Ask someone a question and you are likely
to receive a link to a BigQuery view or
query rather than the actual answer”
Valliappa Lakshmanan (2018)
8
What Are We Doing?
Context
“
9
How Did It Start?
Context
2014
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
10
How Did It Start?
Context
2014
2018
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
Non-performant
No test environments
No version control
Users don’t trust the data
Shit Hits the Fan
11
How Did It Start?
Context
2014
2018
2019
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
Non-performant
No test environments
No version control
Users don’t trust the data
Shit Hits the Fan
DataOps Approach
Event-Driven
Best practices of Software Engineering
Remove the BI bottleneck
DIY -> Data Democratisation
DataOps
First Generation
Proprietary enterprise data
warehouse and business
intelligence platforms
13
Data Process Generations
DataOps
https://martinfowler.com/articles/data-monolith-to-mesh.html
Second Generation
Big Data ecosystem with a data
lake as a silver bullet
Third Generation
Event-Driven, Near Real Time
and Cloud Based
● Data Mesh: Embracing ubiquitous data
○ Distributed Domain Driven Architecture (DDD), Self-serve Platform Design, and Product
Thinking with Data
○ Data is discoverable, addressable, trustworthy & truthful, self-describing semantics & syntax,
inter-operable & governed by global standards, domain data cross-functional teams
14
Data Process Generations
DataOps
https://martinfowler.com/articles/data-monolith-to-mesh.html
DataOps is a collaborative data management
practice focused on improving the communication,
integration and automation of data flows
between data managers and data consumers
across an organization. The goal of DataOps is to
deliver value faster by creating predictable
delivery and change management of data, data
models and related artifacts”
Gartner Glossary
15
What is DataOps?
DataOps
https://www.gartner.com/en/information-technology/glossary/dataops
“
DataOps uses technology to automate the design,
deployment and management of data delivery with
appropriate levels of governance, and it uses metadata to
improve the usability and value of data in a dynamic
environment.”
Gartner Glossary
16
What is DataOps?
DataOps
https://www.gartner.com/en/information-technology/glossary/dataops
“
17
What is DataOps?
DataOps
DevOps
Agile
Lean
DataOps
18
What is DataOps?
DataOps
https://www.saagie.com/blog/dataops-devops-2-0/
● A few practices are common to DevOps and DataOps:
○ Automation (CI/CD)
○ Unit tests (code coverage is relevant)
○ Environments management
○ Versions management
○ Monitoring
19
DataOps at Travelstart
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Values
20
DataOps at Travelstart
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Data Quality Availability Usability Reliability Performance Flexibility
Quality
Attributes
Values
21
DataOps at Travelstart
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Data Quality Availability Usability Reliability Performance Flexibility
Quality
Attributes
Data Lakes
IaC &
CI/CD
Data
Dictionary
Data
Catalog
Values
Descriptive Predictive Prescriptive
Framework
Data
Pipelines
Data
Warehouse
The Architecture
23
Reference Model
The Architecture
Lambda ~ Kappa
Architecture
Event Driven
Architecture
+
Inmutable
master data
Precompute
Views
New Data
stream
Process
Stream
Increment
Views
View 1 View 2 View N
View 1 View 2 View N
Query
Batch
recompute
Real-Time
Increment
Batch views
Real Time views
Batch
Layer
Serving
Layer
Speed
Layer
Main Architectural
Pattern:
Publisher-Subscriber
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
24
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Monitoring & Automation Tier
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
25
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
26
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
27
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
28
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
29
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md
30
Solution Architecture
Architecture
Source Tier
Event Granularity
Coarse-grained Fined-grained
Aggregation Required
31
Solution Architecture
Architecture
Source Tier
Event Granularity
Extreme
Aggregation Required
Coarse-grained Fined-grained
32
Solution Architecture
Architecture
Source Tier
Event Granularity
Extreme Field or
sub-property
Aggregation Required
Phone
Email
Coarse-grained Fined-grained
33
Solution Architecture
Architecture
Source Tier
Event Granularity
Extreme
Domain Event
Field or
sub-property
Aggregation Required
Phone
Email
SearchEvent
Composition
vs
Aggregation
Coarse-grained Fined-grained
34
Solution Architecture
Architecture
Source Tier
Format, Compression and Codecs
Plain Text
Compression: GZIP, BZIP2, LZ4
Binary
Codec: Deflate, Snappy, LZO
35
Solution Architecture
Architecture
Source Tier
Travelstart Event Template
Payload
{
“uuid”:
“34c86044-4b73-46ec-b766”,
“timestamp”: 1581749960349,
….
}
● Main Format: JSON
● Compression: GZIP
JSON Minify + GZIP -> ~96% size reduction
Header
Type: search_event
Format: json
System: backend
UUID: 34c86044-4b73-46ec-b766
Ip: 10.10.0.1
InstanceId: backend-01
Hostname: mybackend.travelstart
Compression: gzip
Timestamp: 1581749960349
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
36
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
37
Solution Architecture
Architecture
Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
38
Solution Architecture
Architecture
Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
39
Solution Architecture
Architecture
Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
40
Solution Architecture
Architecture
41
Solution Architecture
Architecture
Aggregation Tier
Areas of aggregation
Source System
Producer 2
Read Replica(s) or API
Aggregation System
Distribution &
Transformation
DWH/DLRead Replica(s)
Orchestrator
Orchestrator
Producer n
Event Delivery
Producer 1
Map & Aggr Event Delivery
Windowing [session]
Map & Aggregator
Event Producer
Persistence
Complexity
42
Solution Architecture
Architecture
Aggregation Tier
External Sources
43
Solution Architecture
Architecture
Aggregation Tier
Schedulers
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
44
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
45
Solution Architecture
Architecture
Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
46
Solution Architecture
Architecture
Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
47
Solution Architecture
Architecture
Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
48
Solution Architecture
Architecture
Distribution Tier
Topics & Subscriptions
49
Solution Architecture
Architecture
Distribution Tier
Topics & Subscriptions
50
Solution Architecture
Architecture
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
51
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
52
Solution Architecture
Architecture
Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
53
Solution Architecture
Architecture
Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
54
Solution Architecture
Architecture
Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 different categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
55
Solution Architecture
Architecture
56
Solution Architecture
Architecture
Transformation Tier
Ingestion Pipelines In charge of receiving the raw
events, process, clean, enrich and
validate quality of them
- Uncompress
- Standardize
- Cleansing
- Enrichment
- Schematization
- Quality Checks
- Alerting
Directed Acyclic Graph
(DAG)
Canonize Events Process
Quality
Assurance
Persist Raw
(Data Lake)
Persistence
Publish
Clean Event
RAW
CLEAN
57
Solution Architecture
Architecture
Transformation Tier
Ingestion Pipelines
58
Solution Architecture
Architecture
Transformation Tier
Ingestion Pipelines
59
Solution Architecture
Architecture
Transformation Tier
Ingestion Pipelines
Col 1 Col 2 Col 3
Quality Check
60
Solution Architecture
Architecture
Transformation Tier
Ingestion Pipelines
Col 1 Col 2 Col 3
Metadata
61
Solution Architecture
Architecture
Transformation Tier
Integration Pipelines In charge of integrating multiple
sources to create consolidated data
martsDirected Acyclic Graph
(DAG)
Identify
Integration Type
Read Event
Sources
Join Sources Persistence
RAW/CLEAN
62
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines In charge of reading deadletters,
run reports to analyse quality issues
and bring back to life (if possible)
dead/lost events (necromancer)
Directed Acyclic Graph
(DAG)
Identify
Quality Check
Read Event
Sources
Run
Quality Check
Report
CLEAN
63
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines
64
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines
65
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines
66
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines
67
Solution Architecture
Architecture
Transformation Tier
Quality Pipelines
com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'bookingUuid'
can't be null | at
com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe
ssageToBookingFn.java:152) : 821270
com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'reference' can't
be null | at
com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe
ssageToBookingFn.java:152) : 7336
com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'bookingId' can't
be null | at
com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe
ssageToBookingFn.java:152) : 2482
Transformation Tier
68
Solution Architecture
Architecture
Transformation Tier
69
Solution Architecture
Architecture
Transformation Tier
70
Solution Architecture
Architecture
Transformation Tier
71
Solution Architecture
Architecture
Transformation Tier
72
Solution Architecture
Architecture
Still Awake?
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
74
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
75
Solution Architecture
Architecture
Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
76
Solution Architecture
Architecture
Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
77
Solution Architecture
Architecture
Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
78
Solution Architecture
Architecture
79
Solution Architecture
Architecture
Serving Tier
Data Lakes, Dead Letters & Backups
Plain Text / Schema-less
Compression: GZIP, BZIP2, LZ4
Binary / Schema
Codec: Deflate, Snappy, LZO, LZIP,
GZIP
80
Solution Architecture
Architecture
Serving Tier
Data Lakes, Dead Letters & Backups
81
Solution Architecture
Architecture
Serving Tier
Data Lakes, Dead Letters & Backups
Standard/
Hotline
Nearline Coldline Archive
180 days
Min Duration:
90 days
Min Duration:
365 days365 days
Min Duration:
30 days
43% cost
reduction
69% cost
reduction
89% cost
reduction
Deletion Cost
82
Solution Architecture
Architecture
Serving Tier
Data Lakes vs Data Warehouses
Data Lake Data Warehouse
Raw data
It’s dirty
Transactional structure not
analytical
Clean/Skew data
Quality processes applied
Schema defined for
analytical purposes
83
Solution Architecture
Architecture
Serving Tier
Data Marts, Event Sourcing and Real Time Views
HyperCube
Pivot
Drilldown Roll Up
Dicing
Slicing
84
Solution Architecture
Architecture
Serving Tier
Data Marts, Event Sourcing and Real Time Views
More than 300 columns
Includes facts and
aggregations
Following mostly a star
modeling and
denormalisation
85
Solution Architecture
Architecture
Serving Tier
Data Marts, Event Sourcing and Real Time Views
Views
Materialise
d Views
86
Solution Architecture
Architecture
Serving Tier
Data Marts, Event Sourcing and Real Time Views
Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
87
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Monitoring & Automation Tier
Alerting
Performance
Monitoring
Dashboarding CI/CD
Monitoring & Automation Tier
● The command and control area of the whole
architecture
● It needs to monitor everything related with data
ingestion, processing and infrastructure
● Alerting:
○ Policies
○ Multiple notification channels
● Performance and Stability Monitoring
○ Bottlenecks, Bugs, etc
○ Slow steps, underutilized infrastructure, memory issues, etc
● CI/CD + IaC
○ Development and Deployment lifecycle completely automated
○ Any architectural change should be track through a version control
● Dashboards!
88
Solution Architecture
Architecture
89
Solution Architecture
Architecture
Monitoring & Automation Tier
Monitoring
90
Solution Architecture
Architecture
Monitoring & Automation Tier
Monitoring
91
Solution Architecture
Architecture
Monitoring & Automation Tier
Monitoring
92
Solution Architecture
Architecture
Monitoring & Automation Tier
Monitoring
93
Solution Architecture
Architecture
Monitoring & Automation Tier
Monitoring
94
Solution Architecture
Architecture
Monitoring & Automation Tier
Alerting
Messages
delivered
rate
Messages in
queue
Too Many
Resources
Error Rate
95
Solution Architecture
Architecture
Monitoring & Automation Tier
Alerting
96
Solution Architecture
Architecture
Monitoring & Automation Tier
Alerting
97
Solution Architecture
Architecture
Monitoring & Automation Tier
Alerting
98
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
Jenkins
(Pipelines) Cloud
Deployment
Manager
Cloud Build
GitHub Actions
99
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
100
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
101
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
Spotless Code
Analysis
PR
Analysis
102
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
103
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
104
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
info:
title: Buckets Template
description: Creates buckets in GCS with the correct life cycle rules and security constraints
version: 1.0
properties:
name:
type: string
description: Name property of the bucket standard `ts-<env>-dataops-<name>`
pattern: ^[a-z0-9]+$
labels:
type: object
description: User-provided bucket labels, in key/value pairs.
YAML Schema
105
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
info:
title: Buckets Template
description: Creates buckets in GCS with the correct life cycle rules and security constraints
version: 1.0
properties:
name:
type: string
description: Name property of the bucket standard `ts-<env>-dataops-<name>`
pattern: ^[a-z0-9]+$
labels:
type: object
description: User-provided bucket labels, in key/value pairs.
YAML Schema
106
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
Jinja Template
107
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
Jinja/YAML Template
- name: ts-dataops-batch
type: buckets.jinja
properties:
name: batch
environment: {{ properties['environment'] }}
labels:
description: temporary_files_to_insert
data_type: raw
deleteRule:
age: 7
isLive: true
108
Solution Architecture
Architecture
Monitoring & Automation Tier
CI/CD & IaC
Questions
juancho088
@juancho088
@tech-travelstart
juan-urrego
jsurrego
We love talent.
We are recruiting!

More Related Content

What's hot

Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analytics
The Marketing Distillery
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
Embarcadero Technologies
 

What's hot (20)

Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategies
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
Designing a Successful Governed Citizen Data Science Strategy
Designing a Successful Governed Citizen Data Science StrategyDesigning a Successful Governed Citizen Data Science Strategy
Designing a Successful Governed Citizen Data Science Strategy
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Predictive and Prescriptive Analytics Expert Session Webinar
Predictive  and Prescriptive Analytics Expert Session Webinar Predictive  and Prescriptive Analytics Expert Session Webinar
Predictive and Prescriptive Analytics Expert Session Webinar
 
Knowledge Graphs for Transformation: Dynamic Context for the Intelligent Ente...
Knowledge Graphs for Transformation: Dynamic Context for the Intelligent Ente...Knowledge Graphs for Transformation: Dynamic Context for the Intelligent Ente...
Knowledge Graphs for Transformation: Dynamic Context for the Intelligent Ente...
 
Fasten you seatbelt and listen to the Data Steward
Fasten you seatbelt and listen to the Data StewardFasten you seatbelt and listen to the Data Steward
Fasten you seatbelt and listen to the Data Steward
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analytics
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 

Similar to Scaling Your Data: Data Democratisation and DataOps

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 

Similar to Scaling Your Data: Data Democratisation and DataOps (20)

Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
De-Risk Your Digital Transformation — And Reduce Time, Cost & Complexity
De-Risk Your Digital Transformation — And Reduce Time, Cost & ComplexityDe-Risk Your Digital Transformation — And Reduce Time, Cost & Complexity
De-Risk Your Digital Transformation — And Reduce Time, Cost & Complexity
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
A federated information infrastructure that works
A federated information infrastructure that works A federated information infrastructure that works
A federated information infrastructure that works
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe Analytics
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Putting data to work
Putting data to workPutting data to work
Putting data to work
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
 

Recently uploaded

Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
LuisMiguelPaz5
 

Recently uploaded (20)

Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 

Scaling Your Data: Data Democratisation and DataOps

  • 1. Scaling Your Data: Data Democratisation and DataOps Juan Urrego March 2020
  • 2. Juan Urrego 2 DataOps Team Lead YOU-Anne JOO-Anne ONE Who-Anne Colombia Columbia
  • 3. 3 Who is Travelstart? SOUTH AFRICA UAE TAIPEI EGYPT NIGERIA TANZANIA KENYA MIDDLE EAST ASIAAFRICAEUROPE ATHENS PORTO Travelstart is a leading online travel agency that helps today’s business and leisure travellers search, compare and book the best flight, hotel and car options with all your favourite airline and accommodation suppliers. Based in Cape Town, Travelstart proudly employs a team of more than 850 staff from all walks of life who are passionate about one goal - making travel simple.
  • 5. At Google, for example, nearly 80% of employees use Dremel (the internal counterpart to Google BigQuery) every month...everyone touches data on a regular basis to inform their decisions” Valliappa Lakshmanan (2018) 5 What Are We Doing? Context “
  • 6. 6 What Are We Doing? Context https://www.statista.com/statistics/219333/number-of-google-employees-by-department/
  • 7. 7 What Are We Doing? Context https://www.statista.com/statistics/219333/number-of-google-employees-by-department/ ~38% ~62%
  • 8. Ask someone a question and you are likely to receive a link to a BigQuery view or query rather than the actual answer” Valliappa Lakshmanan (2018) 8 What Are We Doing? Context “
  • 9. 9 How Did It Start? Context 2014 Inception Classic BI architecture (pull-based) Design to supply a short-term need
  • 10. 10 How Did It Start? Context 2014 2018 Inception Classic BI architecture (pull-based) Design to supply a short-term need Non-performant No test environments No version control Users don’t trust the data Shit Hits the Fan
  • 11. 11 How Did It Start? Context 2014 2018 2019 Inception Classic BI architecture (pull-based) Design to supply a short-term need Non-performant No test environments No version control Users don’t trust the data Shit Hits the Fan DataOps Approach Event-Driven Best practices of Software Engineering Remove the BI bottleneck DIY -> Data Democratisation
  • 13. First Generation Proprietary enterprise data warehouse and business intelligence platforms 13 Data Process Generations DataOps https://martinfowler.com/articles/data-monolith-to-mesh.html Second Generation Big Data ecosystem with a data lake as a silver bullet Third Generation Event-Driven, Near Real Time and Cloud Based
  • 14. ● Data Mesh: Embracing ubiquitous data ○ Distributed Domain Driven Architecture (DDD), Self-serve Platform Design, and Product Thinking with Data ○ Data is discoverable, addressable, trustworthy & truthful, self-describing semantics & syntax, inter-operable & governed by global standards, domain data cross-functional teams 14 Data Process Generations DataOps https://martinfowler.com/articles/data-monolith-to-mesh.html
  • 15. DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data, data models and related artifacts” Gartner Glossary 15 What is DataOps? DataOps https://www.gartner.com/en/information-technology/glossary/dataops “
  • 16. DataOps uses technology to automate the design, deployment and management of data delivery with appropriate levels of governance, and it uses metadata to improve the usability and value of data in a dynamic environment.” Gartner Glossary 16 What is DataOps? DataOps https://www.gartner.com/en/information-technology/glossary/dataops “
  • 18. 18 What is DataOps? DataOps https://www.saagie.com/blog/dataops-devops-2-0/ ● A few practices are common to DevOps and DataOps: ○ Automation (CI/CD) ○ Unit tests (code coverage is relevant) ○ Environments management ○ Versions management ○ Monitoring
  • 19. 19 DataOps at Travelstart DataOps Do It Yourself (DIY) Data democratisation Data-Driven Decision Making Don’t Repeat Yourself (DRY) Values
  • 20. 20 DataOps at Travelstart DataOps Do It Yourself (DIY) Data democratisation Data-Driven Decision Making Don’t Repeat Yourself (DRY) Data Quality Availability Usability Reliability Performance Flexibility Quality Attributes Values
  • 21. 21 DataOps at Travelstart DataOps Do It Yourself (DIY) Data democratisation Data-Driven Decision Making Don’t Repeat Yourself (DRY) Data Quality Availability Usability Reliability Performance Flexibility Quality Attributes Data Lakes IaC & CI/CD Data Dictionary Data Catalog Values Descriptive Predictive Prescriptive Framework Data Pipelines Data Warehouse
  • 23. 23 Reference Model The Architecture Lambda ~ Kappa Architecture Event Driven Architecture + Inmutable master data Precompute Views New Data stream Process Stream Increment Views View 1 View 2 View N View 1 View 2 View N Query Batch recompute Real-Time Increment Batch views Real Time views Batch Layer Serving Layer Speed Layer Main Architectural Pattern: Publisher-Subscriber
  • 24. Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 24 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Monitoring & Automation Tier Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views Alerting Performance Monitoring Dashboarding CI/CD
  • 25. Monitoring & Automation Tier Alerting Performance Monitoring Dashboarding CI/CD Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 25 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views
  • 26. Source Tier ● All the internal and/or external service/component/instance/process that will behave as publisher/producer of events ● An event is a data record expressing an occurrence and its context. ○ Slow Changing Dimensions (SCD) ■ Type 1: Last state ■ Type 2: Complete state ■ Type 3: What changed ● According to cloudevents 1.0 it MUST have: ○ Unique ID ○ Source ○ SpecVersion ○ Type ● According to cloudevents 1.0 it is NICE to have: ○ DataContentType ○ DataSchema ○ Subject ○ Time 26 Solution Architecture Architecture https://github.com/cloudevents/spec/blob/v1.0/spec.md
  • 27. Source Tier ● All the internal and/or external service/component/instance/process that will behave as publisher/producer of events ● An event is a data record expressing an occurrence and its context. ○ Slow Changing Dimensions (SCD) ■ Type 1: Last state ■ Type 2: Complete state ■ Type 3: What changed ● According to cloudevents 1.0 it MUST have: ○ Unique ID ○ Source ○ SpecVersion ○ Type ● According to cloudevents 1.0 it is NICE to have: ○ DataContentType ○ DataSchema ○ Subject ○ Time 27 Solution Architecture Architecture https://github.com/cloudevents/spec/blob/v1.0/spec.md
  • 28. Source Tier ● All the internal and/or external service/component/instance/process that will behave as publisher/producer of events ● An event is a data record expressing an occurrence and its context. ○ Slow Changing Dimensions (SCD) ■ Type 1: Last state ■ Type 2: Complete state ■ Type 3: What changed ● According to cloudevents 1.0 it MUST have: ○ Unique ID ○ Source ○ SpecVersion ○ Type ● According to cloudevents 1.0 it is NICE to have: ○ DataContentType ○ DataSchema ○ Subject ○ Time 28 Solution Architecture Architecture https://github.com/cloudevents/spec/blob/v1.0/spec.md
  • 29. Source Tier ● All the internal and/or external service/component/instance/process that will behave as publisher/producer of events ● An event is a data record expressing an occurrence and its context. ○ Slow Changing Dimensions (SCD) ■ Type 1: Last state ■ Type 2: Complete state ■ Type 3: What changed ● According to cloudevents 1.0 it MUST have: ○ Unique ID ○ Source ○ SpecVersion ○ Type ● According to cloudevents 1.0 it is NICE to have: ○ DataContentType ○ DataSchema ○ Subject ○ Time 29 Solution Architecture Architecture https://github.com/cloudevents/spec/blob/v1.0/spec.md
  • 30. 30 Solution Architecture Architecture Source Tier Event Granularity Coarse-grained Fined-grained Aggregation Required
  • 31. 31 Solution Architecture Architecture Source Tier Event Granularity Extreme Aggregation Required Coarse-grained Fined-grained
  • 32. 32 Solution Architecture Architecture Source Tier Event Granularity Extreme Field or sub-property Aggregation Required Phone Email Coarse-grained Fined-grained
  • 33. 33 Solution Architecture Architecture Source Tier Event Granularity Extreme Domain Event Field or sub-property Aggregation Required Phone Email SearchEvent Composition vs Aggregation Coarse-grained Fined-grained
  • 34. 34 Solution Architecture Architecture Source Tier Format, Compression and Codecs Plain Text Compression: GZIP, BZIP2, LZ4 Binary Codec: Deflate, Snappy, LZO
  • 35. 35 Solution Architecture Architecture Source Tier Travelstart Event Template Payload { “uuid”: “34c86044-4b73-46ec-b766”, “timestamp”: 1581749960349, …. } ● Main Format: JSON ● Compression: GZIP JSON Minify + GZIP -> ~96% size reduction Header Type: search_event Format: json System: backend UUID: 34c86044-4b73-46ec-b766 Ip: 10.10.0.1 InstanceId: backend-01 Hostname: mybackend.travelstart Compression: gzip Timestamp: 1581749960349
  • 36. Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 36 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views Monitoring & Automation Tier Alerting Performance Monitoring Dashboarding CI/CD
  • 37. Aggregation Tier ● All the services and components that aggregate Atomic Events into a Domain Event ● This aggregation can happen: ○ Inside the source ■ On Memory ■ Using Storage ○ Outside the source ■ Middleware dedicated to do this ■ Ingestion or Integration pipelines ● External sources use dedicated endpoints: Receive data from external sources, setup the canonical structure and send to the topic ○ Serverless functions ○ Stateless containers ● Schedulers can serve as trigger to start the ETL 37 Solution Architecture Architecture
  • 38. Aggregation Tier ● All the services and components that aggregate Atomic Events into a Domain Event ● This aggregation can happen: ○ Inside the source ■ On Memory ■ Using Storage ○ Outside the source ■ Middleware dedicated to do this ■ Ingestion or Integration pipelines ● External sources use dedicated endpoints: Receive data from external sources, setup the canonical structure and send to the topic ○ Serverless functions ○ Stateless containers ● Schedulers can serve as trigger to start the ETL 38 Solution Architecture Architecture
  • 39. Aggregation Tier ● All the services and components that aggregate Atomic Events into a Domain Event ● This aggregation can happen: ○ Inside the source ■ On Memory ■ Using Storage ○ Outside the source ■ Middleware dedicated to do this ■ Ingestion or Integration pipelines ● External sources use dedicated endpoints: Receive data from external sources, setup the canonical structure and send to the topic ○ Serverless functions ○ Stateless containers ● Schedulers can serve as trigger to start the ETL 39 Solution Architecture Architecture
  • 40. Aggregation Tier ● All the services and components that aggregate Atomic Events into a Domain Event ● This aggregation can happen: ○ Inside the source ■ On Memory ■ Using Storage ○ Outside the source ■ Middleware dedicated to do this ■ Ingestion or Integration pipelines ● External sources use dedicated endpoints: Receive data from external sources, setup the canonical structure and send to the topic ○ Serverless functions ○ Stateless containers ● Schedulers can serve as trigger to start the ETL 40 Solution Architecture Architecture
  • 41. 41 Solution Architecture Architecture Aggregation Tier Areas of aggregation Source System Producer 2 Read Replica(s) or API Aggregation System Distribution & Transformation DWH/DLRead Replica(s) Orchestrator Orchestrator Producer n Event Delivery Producer 1 Map & Aggr Event Delivery Windowing [session] Map & Aggregator Event Producer Persistence Complexity
  • 44. Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 44 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views Monitoring & Automation Tier Alerting Performance Monitoring Dashboarding CI/CD
  • 45. Distribution Tier ● All the Topics/Queues in charge of receive raw and clean to be distributed to multiple subscribers ● Topic per event type ● Subscription per subscriber ● Communication between producer and topics could failed, so keep in mind: ○ Retries ○ Graveyard (dead letters) 45 Solution Architecture Architecture
  • 46. Distribution Tier ● All the Topics/Queues in charge of receive raw and clean to be distributed to multiple subscribers ● Topic per event type ● Subscription per subscriber ● Communication between producer and topics could failed, so keep in mind: ○ Retries ○ Graveyard (dead letters) 46 Solution Architecture Architecture
  • 47. Distribution Tier ● All the Topics/Queues in charge of receive raw and clean to be distributed to multiple subscribers ● Topic per event type ● Subscription per subscriber ● Communication between producer and topics could failed, so keep in mind: ○ Retries ○ Graveyard (dead letters) 47 Solution Architecture Architecture
  • 48. Distribution Tier ● All the Topics/Queues in charge of receive raw and clean to be distributed to multiple subscribers ● Topic per event type ● Subscription per subscriber ● Communication between producer and topics could failed, so keep in mind: ○ Retries ○ Graveyard (dead letters) 48 Solution Architecture Architecture
  • 49. Distribution Tier Topics & Subscriptions 49 Solution Architecture Architecture
  • 50. Distribution Tier Topics & Subscriptions 50 Solution Architecture Architecture
  • 51. Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 51 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views Monitoring & Automation Tier Alerting Performance Monitoring Dashboarding CI/CD
  • 52. Transformation Tier ● Stream/Batch Data Pipelines in charge of receiving domain events, processing and persisting them ● A Pipeline can be split by domain and/or resource consumption ● 3 different categories: ○ Ingestion ○ Integration ○ Quality ■ Data Lineage and Data Provenance ● Serverless functions can be used here, but be careful with them! 52 Solution Architecture Architecture
  • 53. Transformation Tier ● Stream/Batch Data Pipelines in charge of receiving domain events, processing and persisting them ● A Pipeline can be split by domain and/or resource consumption ● 3 different categories: ○ Ingestion ○ Integration ○ Quality ■ Data Lineage and Data Provenance ● Serverless functions can be used here, but be careful with them! 53 Solution Architecture Architecture
  • 54. Transformation Tier ● Stream/Batch Data Pipelines in charge of receiving domain events, processing and persisting them ● A Pipeline can be split by domain and/or resource consumption ● 3 different categories: ○ Ingestion ○ Integration ○ Quality ■ Data Lineage and Data Provenance ● Serverless functions can be used here, but be careful with them! 54 Solution Architecture Architecture
  • 55. Transformation Tier ● Stream/Batch Data Pipelines in charge of receiving domain events, processing and persisting them ● A Pipeline can be split by domain and/or resource consumption ● 3 different categories: ○ Ingestion ○ Integration ○ Quality ■ Data Lineage and Data Provenance ● Serverless functions can be used here, but be careful with them! 55 Solution Architecture Architecture
  • 56. 56 Solution Architecture Architecture Transformation Tier Ingestion Pipelines In charge of receiving the raw events, process, clean, enrich and validate quality of them - Uncompress - Standardize - Cleansing - Enrichment - Schematization - Quality Checks - Alerting Directed Acyclic Graph (DAG) Canonize Events Process Quality Assurance Persist Raw (Data Lake) Persistence Publish Clean Event RAW CLEAN
  • 59. 59 Solution Architecture Architecture Transformation Tier Ingestion Pipelines Col 1 Col 2 Col 3 Quality Check
  • 61. 61 Solution Architecture Architecture Transformation Tier Integration Pipelines In charge of integrating multiple sources to create consolidated data martsDirected Acyclic Graph (DAG) Identify Integration Type Read Event Sources Join Sources Persistence RAW/CLEAN
  • 62. 62 Solution Architecture Architecture Transformation Tier Quality Pipelines In charge of reading deadletters, run reports to analyse quality issues and bring back to life (if possible) dead/lost events (necromancer) Directed Acyclic Graph (DAG) Identify Quality Check Read Event Sources Run Quality Check Report CLEAN
  • 67. 67 Solution Architecture Architecture Transformation Tier Quality Pipelines com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'bookingUuid' can't be null | at com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe ssageToBookingFn.java:152) : 821270 com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'reference' can't be null | at com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe ssageToBookingFn.java:152) : 7336 com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'bookingId' can't be null | at com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe ssageToBookingFn.java:152) : 2482
  • 74. Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 74 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views Monitoring & Automation Tier Alerting Performance Monitoring Dashboarding CI/CD
  • 75. Serving Tier ● Store all the raw and clean data to be accessible for our stakeholders ● About raw data: ○ Data Lakes ○ Dead Letters ○ Backups ● About OLAP: ○ Data Marts ○ Ubiquitous Data/ Data Sourcing ○ Real Time Views ● About Analytics: ○ Whatever it makes you happy (of course with data governance) 75 Solution Architecture Architecture
  • 76. Serving Tier ● Store all the raw and clean data to be accessible for our stakeholders ● About raw data: ○ Data Lakes ○ Dead Letters ○ Backups ● About OLAP: ○ Data Marts ○ Ubiquitous Data/ Data Sourcing ○ Real Time Views ● About Analytics: ○ Whatever it makes you happy (of course with data governance) 76 Solution Architecture Architecture
  • 77. Serving Tier ● Store all the raw and clean data to be accessible for our stakeholders ● About raw data: ○ Data Lakes ○ Dead Letters ○ Backups ● About OLAP: ○ Data Marts ○ Ubiquitous Data/ Data Sourcing ○ Real Time Views ● About Analytics: ○ Whatever it makes you happy (of course with data governance) 77 Solution Architecture Architecture
  • 78. Serving Tier ● Store all the raw and clean data to be accessible for our stakeholders ● About raw data: ○ Data Lakes ○ Dead Letters ○ Backups ● About OLAP: ○ Data Marts ○ Ubiquitous Data/ Data Sourcing ○ Real Time Views ● About Analytics: ○ Whatever it makes you happy (of course with data governance) 78 Solution Architecture Architecture
  • 79. 79 Solution Architecture Architecture Serving Tier Data Lakes, Dead Letters & Backups Plain Text / Schema-less Compression: GZIP, BZIP2, LZ4 Binary / Schema Codec: Deflate, Snappy, LZO, LZIP, GZIP
  • 81. 81 Solution Architecture Architecture Serving Tier Data Lakes, Dead Letters & Backups Standard/ Hotline Nearline Coldline Archive 180 days Min Duration: 90 days Min Duration: 365 days365 days Min Duration: 30 days 43% cost reduction 69% cost reduction 89% cost reduction Deletion Cost
  • 82. 82 Solution Architecture Architecture Serving Tier Data Lakes vs Data Warehouses Data Lake Data Warehouse Raw data It’s dirty Transactional structure not analytical Clean/Skew data Quality processes applied Schema defined for analytical purposes
  • 83. 83 Solution Architecture Architecture Serving Tier Data Marts, Event Sourcing and Real Time Views HyperCube Pivot Drilldown Roll Up Dicing Slicing
  • 84. 84 Solution Architecture Architecture Serving Tier Data Marts, Event Sourcing and Real Time Views More than 300 columns Includes facts and aggregations Following mostly a star modeling and denormalisation
  • 85. 85 Solution Architecture Architecture Serving Tier Data Marts, Event Sourcing and Real Time Views Views Materialise d Views
  • 86. 86 Solution Architecture Architecture Serving Tier Data Marts, Event Sourcing and Real Time Views
  • 87. Distribution Tier Clean Raw Serving TierTransform TierSource Tier External Internal 87 Reference Architecture The Architecture Aggregation Tier Ingestion Integration Quality Raw Data OLAP Analytics Data Lakes Dead Letters Backups Data Marts Ubiquitous Data Real Time Views Monitoring & Automation Tier Alerting Performance Monitoring Dashboarding CI/CD
  • 88. Monitoring & Automation Tier ● The command and control area of the whole architecture ● It needs to monitor everything related with data ingestion, processing and infrastructure ● Alerting: ○ Policies ○ Multiple notification channels ● Performance and Stability Monitoring ○ Bottlenecks, Bugs, etc ○ Slow steps, underutilized infrastructure, memory issues, etc ● CI/CD + IaC ○ Development and Deployment lifecycle completely automated ○ Any architectural change should be track through a version control ● Dashboards! 88 Solution Architecture Architecture
  • 94. 94 Solution Architecture Architecture Monitoring & Automation Tier Alerting Messages delivered rate Messages in queue Too Many Resources Error Rate
  • 98. 98 Solution Architecture Architecture Monitoring & Automation Tier CI/CD & IaC Jenkins (Pipelines) Cloud Deployment Manager Cloud Build GitHub Actions
  • 101. 101 Solution Architecture Architecture Monitoring & Automation Tier CI/CD & IaC Spotless Code Analysis PR Analysis
  • 104. 104 Solution Architecture Architecture Monitoring & Automation Tier CI/CD & IaC info: title: Buckets Template description: Creates buckets in GCS with the correct life cycle rules and security constraints version: 1.0 properties: name: type: string description: Name property of the bucket standard `ts-<env>-dataops-<name>` pattern: ^[a-z0-9]+$ labels: type: object description: User-provided bucket labels, in key/value pairs. YAML Schema
  • 105. 105 Solution Architecture Architecture Monitoring & Automation Tier CI/CD & IaC info: title: Buckets Template description: Creates buckets in GCS with the correct life cycle rules and security constraints version: 1.0 properties: name: type: string description: Name property of the bucket standard `ts-<env>-dataops-<name>` pattern: ^[a-z0-9]+$ labels: type: object description: User-provided bucket labels, in key/value pairs. YAML Schema
  • 106. 106 Solution Architecture Architecture Monitoring & Automation Tier CI/CD & IaC Jinja Template
  • 107. 107 Solution Architecture Architecture Monitoring & Automation Tier CI/CD & IaC Jinja/YAML Template - name: ts-dataops-batch type: buckets.jinja properties: name: batch environment: {{ properties['environment'] }} labels: description: temporary_files_to_insert data_type: raw deleteRule: age: 7 isLive: true