Scaling Your Data: Data Democratisation and DataOps

Scaling Your Data:
Data Democratisation and DataOps
Juan Urrego
March 2020

Juan Urrego
2
DataOps Team Lead
YOU-Anne
JOO-Anne
ONE
Who-Anne
Colombia
Columbia

3
Who is Travelstart?
SOUTH AFRICA
UAE TAIPEI
EGYPT
NIGERIA
TANZANIA
KENYA
MIDDLE EAST ASIAAFRICAEUROPE
ATHENS
PORTO
Travelstart is a leading online travel agency that helps
today’s business and leisure travellers search, compare
and book the best ﬂight, hotel and car options with all
your favourite airline and accommodation suppliers.
Based in Cape Town, Travelstart proudly employs a team of
more than 850 staﬀ from all walks of life who are
passionate about one goal - making travel simple.

At Google, for example, nearly 80% of
employees use Dremel (the internal
counterpart to Google BigQuery) every
month...everyone touches data on a
regular basis to inform their decisions”
Valliappa Lakshmanan (2018)
5
What Are We Doing?
Context
“

6
What Are We Doing?
Context
https://www.statista.com/statistics/219333/number-of-google-employees-by-department/

7
What Are We Doing?
Context
https://www.statista.com/statistics/219333/number-of-google-employees-by-department/
~38%
~62%

Ask someone a question and you are likely
to receive a link to a BigQuery view or
query rather than the actual answer”
Valliappa Lakshmanan (2018)
8
What Are We Doing?
Context
“

9
How Did It Start?
Context
2014
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need

10
How Did It Start?
Context
2014
2018
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
Non-performant
No test environments
No version control
Users don’t trust the data
Shit Hits the Fan

11
How Did It Start?
Context
2014
2018
2019
Inception
Classic BI
architecture
(pull-based)
Design to supply a
short-term need
Non-performant
No test environments
No version control
Users don’t trust the data
Shit Hits the Fan
DataOps Approach
Event-Driven
Best practices of Software Engineering
Remove the BI bottleneck
DIY -> Data Democratisation

First Generation
Proprietary enterprise data
warehouse and business
intelligence platforms
13
Data Process Generations
DataOps
https://martinfowler.com/articles/data-monolith-to-mesh.html
Second Generation
Big Data ecosystem with a data
lake as a silver bullet
Third Generation
Event-Driven, Near Real Time
and Cloud Based

● Data Mesh: Embracing ubiquitous data
○ Distributed Domain Driven Architecture (DDD), Self-serve Platform Design, and Product
Thinking with Data
○ Data is discoverable, addressable, trustworthy & truthful, self-describing semantics & syntax,
inter-operable & governed by global standards, domain data cross-functional teams
14
Data Process Generations
DataOps
https://martinfowler.com/articles/data-monolith-to-mesh.html

DataOps is a collaborative data management
practice focused on improving the communication,
integration and automation of data ﬂows
between data managers and data consumers
across an organization. The goal of DataOps is to
deliver value faster by creating predictable
delivery and change management of data, data
models and related artifacts”
Gartner Glossary
15
What is DataOps?
DataOps
https://www.gartner.com/en/information-technology/glossary/dataops
“

DataOps uses technology to automate the design,
deployment and management of data delivery with
appropriate levels of governance, and it uses metadata to
improve the usability and value of data in a dynamic
environment.”
Gartner Glossary
16
What is DataOps?
DataOps
https://www.gartner.com/en/information-technology/glossary/dataops
“

17
What is DataOps?
DataOps
DevOps
Agile
Lean
DataOps

18
What is DataOps?
DataOps
https://www.saagie.com/blog/dataops-devops-2-0/
● A few practices are common to DevOps and DataOps:
○ Automation (CI/CD)
○ Unit tests (code coverage is relevant)
○ Environments management
○ Versions management
○ Monitoring

19
DataOps at Travelstart
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Values

20
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Data Quality Availability Usability Reliability Performance Flexibility
Quality
Attributes
Values

21
DataOps
Do It Yourself
(DIY)
Data
democratisation
Data-Driven
Decision Making
Don’t Repeat
Yourself
(DRY)
Data Quality Availability Usability Reliability Performance Flexibility
Quality
Attributes
Data Lakes
IaC &
CI/CD
Data
Dictionary
Data
Catalog
Values
Descriptive Predictive Prescriptive
Framework
Data
Pipelines
Data
Warehouse

23
Reference Model
The Architecture
Lambda ~ Kappa
Architecture
Event Driven
Architecture
+
Inmutable
master data
Precompute
Views
New Data
stream
Process
Stream
Increment
Views
View 1 View 2 View N
View 1 View 2 View N
Query
Batch
recompute
Real-Time
Increment
Batch views
Real Time views
Batch
Layer
Serving
Layer
Speed
Layer
Main Architectural
Pattern:
Publisher-Subscriber

Distribution Tier
Clean
Raw
Serving TierTransform TierSource Tier
External
Internal
24
Reference Architecture
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Monitoring & Automation Tier
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD

Alerting
Performance
Monitoring
Dashboarding CI/CD
Distribution Tier
Clean
Raw
External
Internal
25
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views

Source Tier
● All the internal and/or external service/component/instance/process that will
behave as publisher/producer of events
● An event is a data record expressing an occurrence and its context.
○ Slow Changing Dimensions (SCD)
■ Type 1: Last state
■ Type 2: Complete state
■ Type 3: What changed
● According to cloudevents 1.0 it MUST have:
○ Unique ID
○ Source
○ SpecVersion
○ Type
● According to cloudevents 1.0 it is NICE to have:
○ DataContentType
○ DataSchema
○ Subject
○ Time
26
Solution Architecture
Architecture
https://github.com/cloudevents/spec/blob/v1.0/spec.md

Source Tier
○ Unique ID
○ Source
○ SpecVersion
○ Type
○ DataContentType
○ DataSchema
○ Subject
○ Time
27
Architecture

Source Tier
○ Unique ID
○ Source
○ SpecVersion
○ Type
○ DataContentType
○ DataSchema
○ Subject
○ Time
28
Architecture

Source Tier
○ Unique ID
○ Source
○ SpecVersion
○ Type
○ DataContentType
○ DataSchema
○ Subject
○ Time
29
Architecture

30
Architecture
Source Tier
Event Granularity
Coarse-grained Fined-grained
Aggregation Required

31
Architecture
Source Tier
Event Granularity
Extreme

32
Architecture
Source Tier
Event Granularity
Extreme Field or
sub-property
Phone
Email

33
Architecture
Source Tier
Event Granularity
Extreme
Domain Event
Field or
sub-property
Phone
Email
SearchEvent
Composition
vs
Aggregation

34
Architecture
Source Tier
Format, Compression and Codecs
Plain Text
Compression: GZIP, BZIP2, LZ4
Binary
Codec: Deﬂate, Snappy, LZO

35
Architecture
Source Tier
Travelstart Event Template
Payload
{
“uuid”:
“34c86044-4b73-46ec-b766”,
“timestamp”: 1581749960349,
….
}
● Main Format: JSON
● Compression: GZIP
JSON Minify + GZIP -> ~96% size reduction
Header
Type: search_event
Format: json
System: backend
UUID: 34c86044-4b73-46ec-b766
Ip: 10.10.0.1
InstanceId: backend-01
Hostname: mybackend.travelstart
Compression: gzip
Timestamp: 1581749960349

Distribution Tier
Clean
Raw
External
Internal
36
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD

Aggregation Tier
● All the services and components that aggregate Atomic Events into a
Domain Event
● This aggregation can happen:
○ Inside the source
■ On Memory
■ Using Storage
○ Outside the source
■ Middleware dedicated to do this
■ Ingestion or Integration pipelines
● External sources use dedicated endpoints: Receive data from external
sources, setup the canonical structure and send to the topic
○ Serverless functions
○ Stateless containers
● Schedulers can serve as trigger to start the ETL
37
Architecture

Aggregation Tier
Domain Event
■ On Memory
■ Using Storage
38
Architecture

Aggregation Tier
Domain Event
■ On Memory
■ Using Storage
39
Architecture

Aggregation Tier
Domain Event
■ On Memory
■ Using Storage
40
Architecture

41
Architecture
Aggregation Tier
Areas of aggregation
Source System
Producer 2
Read Replica(s) or API
Aggregation System
Distribution &
Transformation
DWH/DLRead Replica(s)
Orchestrator
Orchestrator
Producer n
Event Delivery
Producer 1
Map & Aggr Event Delivery
Windowing [session]
Map & Aggregator
Event Producer
Persistence
Complexity

42
Architecture
Aggregation Tier
External Sources

43
Architecture
Aggregation Tier
Schedulers

Distribution Tier
Clean
Raw
External
Internal
44
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD

Distribution Tier
● All the Topics/Queues in charge of receive raw and clean to be distributed to
multiple subscribers
● Topic per event type
● Subscription per subscriber
● Communication between producer and topics could failed, so keep in mind:
○ Retries
○ Graveyard (dead letters)
45
Architecture

Distribution Tier
○ Retries
46
Architecture

Distribution Tier
○ Retries
47
Architecture

Distribution Tier
○ Retries
48
Architecture

Distribution Tier
Topics & Subscriptions
49
Architecture

Distribution Tier
Topics & Subscriptions
50
Architecture

Distribution Tier
Clean
Raw
External
Internal
51
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD

Transformation Tier
● Stream/Batch Data Pipelines in charge of receiving domain events,
processing and persisting them
● A Pipeline can be split by domain and/or resource consumption
● 3 diﬀerent categories:
○ Ingestion
○ Integration
○ Quality
■ Data Lineage and Data Provenance
● Serverless functions can be used here, but be careful with them!
52
Architecture

Transformation Tier
○ Ingestion
○ Integration
○ Quality
53
Architecture

Transformation Tier
○ Ingestion
○ Integration
○ Quality
54
Architecture

Transformation Tier
○ Ingestion
○ Integration
○ Quality
55
Architecture

56
Architecture
Transformation Tier
Ingestion Pipelines In charge of receiving the raw
events, process, clean, enrich and
validate quality of them
- Uncompress
- Standardize
- Cleansing
- Enrichment
- Schematization
- Quality Checks
- Alerting
Directed Acyclic Graph
(DAG)
Canonize Events Process
Quality
Assurance
Persist Raw
(Data Lake)
Persistence
Publish
Clean Event
RAW
CLEAN

57
Architecture
Transformation Tier
Ingestion Pipelines

58
Architecture
Transformation Tier
Ingestion Pipelines

59
Architecture
Transformation Tier
Ingestion Pipelines
Col 1 Col 2 Col 3
Quality Check

60
Architecture
Transformation Tier
Ingestion Pipelines
Col 1 Col 2 Col 3
Metadata

61
Architecture
Transformation Tier
Integration Pipelines In charge of integrating multiple
sources to create consolidated data
martsDirected Acyclic Graph
(DAG)
Identify
Integration Type
Read Event
Sources
Join Sources Persistence
RAW/CLEAN

62
Architecture
Transformation Tier
Quality Pipelines In charge of reading deadletters,
run reports to analyse quality issues
and bring back to life (if possible)
dead/lost events (necromancer)
Directed Acyclic Graph
(DAG)
Identify
Quality Check
Read Event
Sources
Run
Quality Check
Report
CLEAN

63
Architecture
Transformation Tier
Quality Pipelines

64
Architecture
Transformation Tier
Quality Pipelines

65
Architecture
Transformation Tier
Quality Pipelines

66
Architecture
Transformation Tier
Quality Pipelines

67
Architecture
Transformation Tier
Quality Pipelines
com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'bookingUuid'
can't be null | at
com.travelstart.bi.etl.pipelines.booking.transforms.TsMessageToBookingFn.processElement(TsMe
ssageToBookingFn.java:152) : 821270
com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'reference' can't
be null | at
com.travelstart.bi.etl.obqm.exceptions.NonNullableFieldException: The field 'bookingId' can't
be null | at

Transformation Tier
68
Architecture

Transformation Tier
69
Architecture

Transformation Tier
70
Architecture

Transformation Tier
71
Architecture

Transformation Tier
72
Architecture

Distribution Tier
Clean
Raw
External
Internal
74
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD

Serving Tier
● Store all the raw and clean data to be accessible for our stakeholders
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Ubiquitous Data/ Data Sourcing
○ Real Time Views
● About Analytics:
○ Whatever it makes you happy (of course with data governance)
75
Architecture

Serving Tier
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Real Time Views
76
Architecture

Serving Tier
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Real Time Views
77
Architecture

Serving Tier
● About raw data:
○ Data Lakes
○ Dead Letters
○ Backups
● About OLAP:
○ Data Marts
○ Real Time Views
78
Architecture

79
Architecture
Serving Tier
Data Lakes, Dead Letters & Backups
Plain Text / Schema-less
Compression: GZIP, BZIP2, LZ4
Binary / Schema
Codec: Deﬂate, Snappy, LZO, LZIP,
GZIP

80
Architecture
Serving Tier

81
Architecture
Serving Tier
Standard/
Hotline
Nearline Coldline Archive
180 days
Min Duration:
90 days
Min Duration:
365 days365 days
Min Duration:
30 days
43% cost
reduction
69% cost
reduction
89% cost
reduction
Deletion Cost

82
Architecture
Serving Tier
Data Lakes vs Data Warehouses
Data Lake Data Warehouse
Raw data
It’s dirty
Transactional structure not
analytical
Clean/Skew data
Quality processes applied
Schema deﬁned for
analytical purposes

83
Architecture
Serving Tier
Data Marts, Event Sourcing and Real Time Views
HyperCube
Pivot
Drilldown Roll Up
Dicing
Slicing

84
Architecture
Serving Tier
More than 300 columns
Includes facts and
aggregations
Following mostly a star
modeling and
denormalisation

85
Architecture
Serving Tier
Views
Materialise
d Views

86
Architecture
Serving Tier

Distribution Tier
Clean
Raw
External
Internal
87
The Architecture
Aggregation Tier
Ingestion
Integration
Quality
Raw Data
OLAP
Analytics
Data
Lakes
Dead
Letters
Backups
Data
Marts
Ubiquitous
Data
Real Time
Views
Alerting
Performance
Monitoring
Dashboarding CI/CD

● The command and control area of the whole
architecture
● It needs to monitor everything related with data
ingestion, processing and infrastructure
● Alerting:
○ Policies
○ Multiple notiﬁcation channels
● Performance and Stability Monitoring
○ Bottlenecks, Bugs, etc
○ Slow steps, underutilized infrastructure, memory issues, etc
● CI/CD + IaC
○ Development and Deployment lifecycle completely automated
○ Any architectural change should be track through a version control
● Dashboards!
88
Architecture

89
Architecture
Monitoring

90
Architecture
Monitoring

91
Architecture
Monitoring

92
Architecture
Monitoring

93
Architecture
Monitoring

94
Architecture
Alerting
Messages
delivered
rate
Messages in
queue
Too Many
Resources
Error Rate

95
Architecture
Alerting

96
Architecture
Alerting

97
Architecture
Alerting

98
Architecture
CI/CD & IaC
Jenkins
(Pipelines) Cloud
Deployment
Manager
Cloud Build
GitHub Actions

99
Architecture
CI/CD & IaC

100
Architecture
CI/CD & IaC

101
Architecture
CI/CD & IaC
Spotless Code
Analysis
PR
Analysis

102
Architecture
CI/CD & IaC

103
Architecture
CI/CD & IaC

104
Architecture
CI/CD & IaC
info:
title: Buckets Template
description: Creates buckets in GCS with the correct life cycle rules and security constraints
version: 1.0
properties:
name:
type: string
description: Name property of the bucket standard `ts-<env>-dataops-<name>`
pattern: ^[a-z0-9]+$
labels:
type: object
description: User-provided bucket labels, in key/value pairs.
YAML Schema

105
Architecture
CI/CD & IaC
info:
title: Buckets Template
description: Creates buckets in GCS with the correct life cycle rules and security constraints
version: 1.0
properties:
name:
type: string
description: Name property of the bucket standard `ts-<env>-dataops-<name>`
pattern: ^[a-z0-9]+$
labels:
type: object
description: User-provided bucket labels, in key/value pairs.
YAML Schema

106
Architecture
CI/CD & IaC
Jinja Template

107
Architecture
CI/CD & IaC
Jinja/YAML Template
- name: ts-dataops-batch
type: buckets.jinja
properties:
name: batch
environment: {{ properties['environment'] }}
labels:
description: temporary_ﬁles_to_insert
data_type: raw
deleteRule:
age: 7
isLive: true

108
Architecture
CI/CD & IaC

Questions
juancho088
@juancho088
@tech-travelstart
juan-urrego
jsurrego
We love talent.
We are recruiting!

Scaling Your Data: Data Democratisation and DataOps

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Your Data: Data Democratisation and DataOps

Similar to Scaling Your Data: Data Democratisation and DataOps (20)

Recently uploaded

Recently uploaded (20)

Scaling Your Data: Data Democratisation and DataOps