Ingestion and historization in the Data Lake
Illia Todor
corporate.hrs.com
|
HRS |
/iamtodor
Ingestion and historization in the Data Lake 2
Data Engineer
AWS Cloud Practitioner Certified
|
HRS |
/iamtodor
Ingestion and historization in the Data Lake 3
Data Engineer
AWS Cloud Practitioner Certified
|
HRS |
/iamtodor
Ingestion and historization in the Data Lake 4
https://github.com/iamtodor
https://stackoverflow.com/users/5151861/iamtodor
https://iamtodor.medium.com/
https://www.linkedin.com/in/iamtodor/
Data Engineer
AWS Cloud Practitioner Certified
HRS |
provides the world's premier lodging
experience to corporations globally.
50
markets worldwide > 800.000
yearly contacts
with Hotels
1,000
employees
> 6,000
Global customers
32,000
cities globally
HRS |
Lodging as a Service
HRS |
Lodging as a Service
HRS |
Lodging as a Service
|
HRS |
● DataSource
● Historical Changes
● DataLake
● DataWarehouse
● Dashboards
Request
9
|
HRS |
We wanted to move ingestions from DataSource into Data Lake along with applying historization
approach. The data we need in the DWH for reporting.
The idea of historization approach is to track historical changes.
Historization: the process
Ingestion and historization in the Data Lake 10
|
HRS |
Architecture: abstract
Ingestion and historization in the Data Lake 11
|
HRS |
Architecture: in details
Ingestion and historization in the Data Lake 12
|
HRS |
Infrastructure as a Code
13
|
HRS |
First part: global overview
14
|
HRS |
First part: DMS
15
|
HRS |
AWS DMS: Database Migration Service
16
|
HRS |
AWS DMS: Database Migration Service
17
|
HRS |
AWS DMS: Database Migration Service
18
AWS Secrets Manager
|
HRS |
DMS: Replication Instance
Ingestion and historization in the Data Lake 19
|
HRS |
DMS: Target Endpoint
Ingestion and historization in the Data Lake 20
|
HRS |
DMS: Target Endpoint
Ingestion and historization in the Data Lake 21
|
HRS |
DMS: Replication Task
Ingestion and historization in the Data Lake 22
|
HRS |
DMS: Replication Task
Ingestion and historization in the Data Lake 23
|
HRS |
DMS: Replication Task
Ingestion and historization in the Data Lake 24
|
HRS |
DMS: Replication Task
Ingestion and historization in the Data Lake 25
|
HRS |
DMS: Replication Task
Ingestion and historization in the Data Lake 26
|
HRS |
DMS: Replication Task
Ingestion and historization in the Data Lake 27
|
HRS |
First part: DMS
28
|
HRS |
Process automation and scaling
29
|
HRS |
First part: Metadata management
30
|
HRS |
What is metadata itself
Ingestion and historization in the Data Lake 31
|
HRS |
Suitable tool for metadata
Ingestion and historization in the Data Lake 32
|
HRS |
Suitable tool for metadata
Ingestion and historization in the Data Lake 33
|
HRS |
Suitable tool for metadata
Ingestion and historization in the Data Lake 34
|
HRS |
config.json
Ingestion and historization in the Data Lake 35
|
HRS |
config.json
Ingestion and historization in the Data Lake 36
|
HRS |
config.json
Ingestion and historization in the Data Lake 37
|
HRS |
config.json
Ingestion and historization in the Data Lake 38
|
HRS |
config.json
Ingestion and historization in the Data Lake 39
|
HRS |
config.json
Ingestion and historization in the Data Lake 40
|
HRS |
config.json
Ingestion and historization in the Data Lake 41
|
HRS |
First part: Metadata management
42
|
HRS |
First part: DMS saving data
43
|
HRS |
DMS: saving data
Ingestion and historization in the Data Lake 44
|
HRS |
DMS: saving data
Ingestion and historization in the Data Lake 45
|
HRS |
DMS: saving data
Ingestion and historization in the Data Lake 46
|
HRS |
First part: DMS
47
|
HRS |
AWS CloudWatch
Ingestion and historization in the Data Lake 48
|
HRS |
AWS CloudWatch
Ingestion and historization in the Data Lake 49
Data Platform product
owner
|
HRS |
AWS CloudWatch
Ingestion and historization in the Data Lake 50
Data Platform product
owner
He is happy when he
observes metrics
|
HRS |
AWS CloudWatch
Ingestion and historization in the Data Lake 51
Data Platform product
owner
He is happy when he
observes metrics
|
HRS |
First part: DMS
52
|
HRS |
Second part: Historization
53
|
HRS |
Historization: components
Ingestion and historization in the Data Lake 54
|
HRS |
Historization: components
Ingestion and historization in the Data Lake 55
|
HRS |
Historization: Athena
Ingestion and historization in the Data Lake 56
|
HRS |
Historization: components
Ingestion and historization in the Data Lake 57
|
HRS |
Spark jobs on EMR cluster
Ingestion and historization in the Data Lake 58
How it looks like in general
|
HRS |
Historization: script input params
Ingestion and historization in the Data Lake 59
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 60
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 61
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 62
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 63
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 64
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 65
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 66
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 67
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 68
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 69
|
HRS |
Historization: algorithm
Ingestion and historization in the Data Lake 70
|
HRS |
The workflow is orchestrated by MWAA
Workflow management platform
Ingestion and historization in the Data Lake 71
|
HRS |
The workflow is orchestrated by MWAA
Workflow management platform
Ingestion and historization in the Data Lake 72
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 73
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 74
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 75
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 76
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 77
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 78
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 79
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 80
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 81
|
HRS |
Airflow: Dag first iteration
Ingestion and historization in the Data Lake 82
|
HRS |
Dag first iteration: Gantt diagramm
Ingestion and historization in the Data Lake 83
time
tasks
|
HRS |
Airflow: Dag second iteration
Ingestion and historization in the Data Lake 84
|
HRS |
Airflow: Monitoring and alerting
Ingestion and historization in the Data Lake 85
|
HRS |
Airflow: Monitoring
Ingestion and historization in the Data Lake 86
|
HRS |
Airflow: Monitoring
Ingestion and historization in the Data Lake 87
|
HRS |
Airflow: Monitoring
Ingestion and historization in the Data Lake 88
|
HRS |
Airflow: points to improve
Ingestion and historization in the Data Lake 89
|
HRS |
EMR EC2 Spot Instances
Ingestion and historization in the Data Lake 90
Name your own price for EC2 Compute
● A market where price of compute
changes based upon Supply and
Demand
● When Bid Price exceeds Spot Market
Price, instance is launched
● Instance is terminated (with 2 minute
warning) if market price exceeds bid
price
● Unused On-Demand Instances
|
HRS |
Airflow: points to improve
Ingestion and historization in the Data Lake 91
|
HRS |
Airflow: ec2 fleet
Ingestion and historization in the Data Lake 92
● The instance fleets configuration for a
cluster offers the widest variety of
provisioning options for EC2
instances
● With instance fleets, you specify
target capacities for On-Demand
Instances and Spot Instances within
each fleet.
|
HRS |
That’s it?
93
Ingestion and historization in the Data Lake
|
HRS |
That’s it?
94
Ingestion and historization in the Data Lake
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 95
● time_millis
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 96
● time_millis
Spark ParquetSchemaConverter.scala
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 97
● time_millis
Spark ParquetSchemaConverter.scala
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 98
● time_millis
● inadequate dates
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 99
● time_millis
● inadequate dates
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 100
● time_millis
● inadequate dates
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 101
● time_millis
● inadequate dates
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 102
● time_millis
● inadequate dates
|
HRS |
Data type issues
Ingestion and historization in the Data Lake 103
● time_millis
● inadequate dates
● decimal
|
HRS |
Data type issues: solution
Ingestion and historization in the Data Lake 104
|
HRS |
Data type issues: solution
Ingestion and historization in the Data Lake 105
|
HRS |
DMS: Source Endpoint
Ingestion and historization in the Data Lake 106
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 107
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 108
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 109
DataSource schema: my_schm
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 110
DataSource schema: my_schm
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 111
DataSource schema: my_schm
known issue when using
DB2 as source with
schema name having (_)
underscore and less than 8
characters selection rule
fails
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 112
DataSource schema: my_schm
known issue when using
DB2 as source with
schema name having (_)
underscore and less than 8
characters selection rule
fails
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 113
DataSource schema: my_schm
known issue when using
DB2 as source with
schema name having (_)
underscore and less than 8
characters selection rule
fails
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
We are in process of updating the documentation with this
limitation.
Being said that there is a workaround to select one specific
table by changing "rule-action" from "include" to "explicit"
|
HRS |
DMS: ReplicationTask fun part
Ingestion and historization in the Data Lake 114
DataSource schema: my_schm
known issue when using
DB2 as source with
schema name having (_)
underscore and less than 8
characters selection rule
fails
No tables were found at task initialization. Either the selected table(s) no longer exist
or no match was found for the table selection pattern
We are in process of updating the documentation with this
limitation.
Being said that there is a workaround to select one specific
table by changing "rule-action" from "include" to "explicit"
|
HRS |
Further work
Ingestion and historization in the Data Lake 115
|
HRS |
Achievements
116
Ingestion and historization in the Data Lake
|
HRS |
● We have built ETL process for full and
incremental load data from Data Source to
Data Lake
Achievements
117
Ingestion and historization in the Data Lake
|
HRS |
● We have built ETL process for full and
incremental load data from Data Source to
Data Lake
● We have automated and auto scaled
process
Achievements
118
Ingestion and historization in the Data Lake
|
HRS |
● We have built ETL process for full and
incremental load data from Data Source to
Data Lake
● We have automated and auto scaled
process
● About 70 tables are running day by day
Achievements
119
Ingestion and historization in the Data Lake
|
HRS |
● We have built ETL process for full and
incremental load data from Data Source to
Data Lake
● We have automated and auto scaled
process
● About 70 tables are running day by day
● We have just one entry point for adding new
tables to the process - all other steps are
catching up and reusing this metadata
Achievements
120
Ingestion and historization in the Data Lake
|
HRS |
● We have built ETL process for full and
incremental load data from Data Source to
Data Lake
● We have automated and auto scaled
process
● About 70 tables are running day by day
● We have just one entry point for adding new
tables to the process - all other steps are
catching up and reusing this metadata
● We enabled process monitoring and
alarming on data and process issues
Achievements
121
Ingestion and historization in the Data Lake
|
HRS |
● We have built ETL process for full and
incremental load data from Data Source to
Data Lake
● We have automated and auto scaled
process
● About 70 tables are running day by day
● We have just one entry point for adding new
tables to the process - all other steps are
catching up and reusing this metadata
● We enabled process monitoring and
alarming on data and process issues
● Our data stored in effective way and easy to
access with different tools
Achievements
122
Ingestion and historization in the Data Lake
|
HRS |
Learning points to you
123
Ingestion and historization in the Data Lake
|
HRS |
Q&A: Your voice matters
124
Link to the anonymous survey
Ingestion and historization in the Data Lake
|
HRS |
Metadata in DynamoDB
Ingestion and historization in the Data Lake 125
Partition key:
TableID is the following composition <Source>-<Target>-<SchemaName>-<TableName>
|
HRS |
Metadata in DynamoDB
Ingestion and historization in the Data Lake 126
Partition key:
TableID is the following composition <Source>-<Target>-<SchemaName>-<TableName>
Sort key: TableName
|
HRS |
Metadata in DynamoDB
Ingestion and historization in the Data Lake 127
Partition key:
TableID is the following composition <Source>-<Target>-<SchemaName>-<TableName>
Sort key: TableName

Ingestion and Historization in the Data Lake