2. Same same but different
Data
Engineering
Data Systems
3. Our mission
To deliver a Data Platform
that empowers both Data Creators and Data Consumers,
maximising the capability of Coolblue
to keep our Customers smiling
53. Cookie master chef
Cindy Cressot
● From France
● Have a little Daughter
● PhD applied sciences
● Interests: Big Data, Spark
● Data Engineer
● Coolblue since April 2018
54. First things first
Agenda
● Context
● Challenges
● Method (PySpark)
● A step further with Big Data
● Key takeaways
66. Why?
● Each customer has 1 coolblue-account
● 1 account is linked to 1 email address
● 1 email address represents 1 person
○ However, customers can update their email address
○ And customers can use same account (households)
● All the different scenarios with different attributes (email, address, phone,
etc.) makes it harder to recognize the customer.
67. Let's start with customers and emails
customer_id email
1 a@a
2 a@a
3 b@b
4 c@c
68. Step 1
● PARTITION customers BY email
● ORDER BY registration_date
● promote the FIRST customer as MASTER
69. Group customer by email
customer_id email master_customer_id
1 a@a
1
2 a@a
3 b@b 3
4 c@c 4
master
70. Dealing with updates
What if customers 1 and 3 update their email addresses ?
● We would need to keep track of all updates in their email addresses.
E.g: a@a changed email to: b@b
76. Group customer by email
customer_id email master_customer_id
2 a@a
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c
What
about
this ?
77. Step 2 (a bit different)
● PARTITION master_customer_ids BY customer_id
● ORDER BY master_customer_id
● promote the FIRST master_customer_id as deduplicated_customer_id
78. Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c 3
79. Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1
1 b@b
1
3 b@b
3 c@c
3
4 c@c 3
80. Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1
1 b@b
1
3 b@b
1
3 c@c
3
4 c@c 3
90. ● We only saw 2 Iterations
○ But we could have multiple depths of relationships.
○ We could have more attributes like address or phone number.
● We had to deal with multiple columns
○ customer_id, master_id, dedup_id
● Are you confused?
○ It was already hard to explain and to understand.
● Is there a different way to see it?
Conclusion
107. Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019
108. Flipping tables
Data Systems Team
● We work with data, a lot of data!
● Keep data clean and centralized in our Data
Warehouse
● Create data pipelines in Airflow
● Support semantic layer using OLAP Cubes
109. The Python Jedi
Gwildor Sok
● Data Engineer at Coolblue since July 2017
● Started with Python in 2012
● Game and full stack Web development
before moving to Data Engineering
110. They call me the cube guy
● Business Intelligence Engineer at Coolblue since April 2018
● Experienced in Microsoft BI Stack
● Start learning how to love open source
● OLAP! OLAP everywhere!
André Santos
113. Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1
Data Source 2
Data Source 3
Data Mart 2
114. Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
115. Data Center Architecture
Data Source 1
External Systems
(...)
Azkaban
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
116. Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Azkaban
117. Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Azkaban
118. Azkaban
Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
119. Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
120.
121. ● Templated arguments, like SQL
queries
● A lot of building blocks designed for
data engineering work
● Interface is not great to
browse historical runs
● Hard to rerun individual tasks
Azkaban vs Airflow
Azkaban
● Task configuration separate
from code
● Dates as first class citizen; easily
trigger historical runs
Airflow
126. Our new daily process(es!)
Time
Load to staging area
325 tasks
Heavy calculations
50 tasks
Loading data warehouse tables
140 tasks
Process
semantic layer
20 tasks
127. ● Easier to read and reason
● Maintainable
● Separate logical units
● Less dependency management
● Interdependency checks can fail,
blocking the next step
● More code because of the
interdependency checks
Checkpoints approach
Disadvantages
● Easier to test● Generally slower
Advantages
129. Old code
Most code looked like this
def main():
result = run_query('some_query.sql')
filename = create_csv(result)
upload_to_gcs(filename, GCS_FILENAME)
load_to_mssql(GCS_FILENAME, MSSQL_TABLE)
130. New code
Now we have this
OracleToGCSOperator(
sql='some_query.sql',
gcs_location=GCS_FILENAME)
GCSToMSSQLOperator(
gcs_location=GCS_FILENAME,
mssql_table=MSSQL_TABLE)
131. Advantages
Now we have this
● Configuration as code
○ Easier to read
○ Very easy to test
● Less code to maintain
○ Written and maintained by Airflow contributors
○ Custom code is rare instead of the default
● Quicker to create new pipelines
132. Summary
Airflow
● All configuration now in code
● Building blocks for faster pipeline development
● Lot less code
● Manageable daily process
133.
134. Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
136. SQL Server(less)
Amazon Relational Database Service (RDS)
● Simple to setup and configure
● Supports multiple databases providers
● Patching the database software, backing up databases and
some other DBA tasks are managed by AWS itself
137. SQL Server(less)
Step by Step - SQL Server Migration to Cloud
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
138. SQL Server(less)
Step by Step - SQL Server Migration to Cloud
MyDB:
Properties:
AllocatedStorage: "100"
DBInstanceClass: db.m1.small
Engine: sqlserver-se
EngineVersion: "14.00.3015.40.v1"
Type: "AWS::RDS::DBInstance"
139. SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
Team City
Deployment
140. SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
141. SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
Data Source 1
Data Source 2
Data Source 3 ETL
142. SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
143. SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Apache
Beam
Step by Step - SQL Server Migration to Cloud
NBi
Data Validation
144. Summary
SQL Server on RDS
● We can easily scale our instance
● No server maintenance
● All configurations in code (Cloudformation) facilitates maintenance
● Backup mechanism offered by AWS has some limitations
+
145.
146. Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
147. OLAP Server
What is an OLAP database?
● OLAP stands for OnLine Analytical Processing
● An OLAP database is a multi-dimensional array of data,
commonly referred as “cube”
● This technology used to facilitate query processing on
data warehouse.
148. OLAP Server
OLAP on top of Data warehouse
Data warehouse
Report 1
Report 2
Report N
153. OLAP Server
Main Challenges
No support for our
OLAP technology
● Owning and support our VM
(EC2)
● Configure VM using “code” (no
UI on Windows Server Core)
160. OLAP Server
Integrate OLAP Server with Airflow
Partition 2019W04
Partition 2019W03
Partition 2019W02
...
161. OLAP Server
Integrate OLAP Server with Airflow
Process Partition
Create Partition Partition 2019W04
Partition 2019W03
Partition 2019W02
...
162. OLAP Server
Integrate OLAP Server with Airflow
Process Partition
Create Partition Partition 2019W04
Partition 2019W03
Partition 2019W02
...
163. OLAP Server
Integrate OLAP Server with Airflow and... USERS
Partition 2019W04
Partition 2019W03
Partition 2019W02
...
164. Summary
OLAP Server on EC2
● We can easily scale our instance
● Infrastructure as Code facilitates maintenance
● Easy to rebuild machine if gets corrupted
● A lot of overhead costs on training upfront (really)
+
165.
166. Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
171. Automated validation
Grouping the output
Table Type Count
A not_in_target 0
not_in_source 5
different 1000
B not_in_target 20
not_in_source 0
different 500
174. Automated validation
Automated validation steps
1. Get result set from source and target
2. Calculate hashes
3. Compare hashes, track differences
4. Store counts of differences in tracking tables
5. Talk through differences every day
177. Custom validation
NBi
● Unit testing for Business Intelligence, based on NUnit
● For tables where the logic changed, so needs custom
validation
● For validating the OLAP Server output
178. Summary
Validation
● Automated validation for most of our data
● Custom validation for tables that changed
● Custom validation for important parts of the
OLAP Server
Apache Beam NBi
181. What we learned
Lessons learned from this migration (1 / 2)
● Not everything you have on data center will be supported by AWS as it is
● Less monitoring capabilities in comparison to data center. No superpowers on
RDS
● Doing two migrations in parallel (Azkaban → Airflow, data center → AWS) might
not be such a smart idea
182. What we learned
Lessons learned from this migration (2 / 2)
● You should get extra training on AWS/DevOps upfront
● Think about infrastructure as code, both for Airflow pipelines as well as weekly
OLAP recycling: all is in code now, less in documentation or manual changes
● AWS flexibility allows you to scale your infrastructure with ease
183. Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019
184. Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019