Thomas Bombeke
Infofarm
We harvest business value
Data Science Big Data Training & Workshops
Applied Artificial Intelligence services,
Machine Learning & Deep Learning
techniques
Setting up and maintaining on-premise
and Cloud-based Big Data architectures
with Apache Spark and Hadoop
Training and guiding organisations
through the digital era using the power of
Data Science
OUR PASSION
Plow data – Grow information – Harvest value
Powered by
Datalake Architecture
Building Blocks
Sinkholes
Use Cases
Concepts
Concepts
Datalake
“If you think of a datawarehouse as a store of bottled water – cleansed
and packaged and structured for easy consumption – the data lake is a
large body of water in a more natural state. The contents of the data
lake stream in from a source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples” – James Dixon CTO
Pentaho
• Based on data source or
category
• Single truth
• Ingests data from different
sources
• Accessible for multiple users
Datalake
• Has one schema
• Single type of storage
• Queryable
Puddles
Puddle
Puddle
Puddle
• Raw dump
• Not ready for use
• Reprocessing purpose
Raw Puddles
• Schema and formats applied
• Timesorted event data
• Ready for any use
Accepted Puddles
• Enriched/aggregated data
• Single purpose
• Dashboarding
• Exporting
Master Puddles
Streams
Export
Ingestion Puddle
Puddle
Processing
Example
Ingestion
Click Event Stream
Daily Database Snapshot
Processing
Raw
Daily Database Ingest
Click Event Stream
Accepted
Processing
Raw Accepted
2019/04/07 2019/04/01/20190407
2019/04/07/20190407
Processing
Accepted
Accepted
Master
2019/04/07/20190407
2019/04/07
Exporting
Raw
Sends triggerMaster
Building Blocks
Getting Event Data
• Point in time
• Late arriving data
• Stream or batch
Getting Event Data
Click Event Stream
Daily Database Snapshot
Getting Entity Data
• No relations
• Avoid if does not bring value
• Reducing load on source
• Snapshot history
Getting Entity Data
Click Event Stream
Daily Database Snapshot
Storing Data
File System Data Store Streaming Platform
• CSV, Parquet, Avro
• High latency
• Needs work for querying
• Write once, read-many
• Column or row based
• Low latency
• REST, Thrift, JDBC, …
• Updates possible
• Messages
• Log compaction
• Reloading from start
Storing Data
Raw
Daily Database Ingest
Click Event Stream
Accepted
Processing Data
Batch Streaming
• ”All” the data
• Big data
• Micro-batching
• Rolling window
• Fast data
• More challenges
Processing Data
Raw
Accepted
Master
Accepted
Exporting Data
Scheduling
Cronjob• Frequency
• Dependency
• Execution time
Scheduling
1. Ingest click stream hourly
2. Prepare raw clickstream hourly
3. Ingest database everyday at 05:00
4. Create master sessions everyday after step 2
5. Export master sessions after step 4
Checkpointing
JOB CHECKPOINT
Ingest tv catalague 2019/04/10 05:00
Ingest click events 2019/04/10 15:00
Create master sessions 2019/04/07 07:00
Export master sessions 2019/04/07 07:20
Cloud or On-premise?
• Elastic computing/storage
• No admins needed
• Cheaper
• Hard limits
• “More secure”
• Maintenance
• Better performance
Cloud or On-premise?
• Sandbox
• Test environment
• Ad Hoc
• Non-elastic load
• Sensitive data
Use Cases
Personal Advertisement
Master
Master
Accepted Master
Accepted
?
?
Personal Advertisement
Master
Master
Accepted
ConnectorTechType
Datastore
Kudu
Impala
Spark
HBase Hive
Personal Advertisement
Master
Master
Accepted
ConnectorTechType
Datastore
Kudu
Impala
Spark
HBase Hive
Personal Advertisement
Master
Master
Accepted
ConnectorTechType
Datastore
Kudu
Impala
Spark
HBase Hive
Personal Advertisement
Master
Master
Accepted Master
Accepted
AWS Predictive autoscaling
adjust scaling
AWS Predictive autoscaling
adjust scaling
AWS Predictive autoscaling
adjust scaling
Sinkholes
GDPR
• Retention time
• Right to be informed
• Right to be forgotten
GDPR - Retention Time
• What? Puddle schema
• When? Puddle partitioning
• Where? Puddle location
GDPR - RTBI
• Puddles are queryable
• Master script that queries all the
puddles
GDPR - RTBF
• Datastores with delete
functionality
• What about file systems?
GDPR - RTBF
1. Create an alias
2. Replace customer id with alias
during ingestion
3. Replace alias with customer id
during export
4. RTBF? Delete alias
Schema Evolution
• Some datastores/file formats can handle schema changes
• Breaking changes? New puddle
Small Files/Partitions
• Many small files on filesystem? Compact
• Datastore partitioning? Roll up
Data Quality
• Rejected puddle for malformed records
• Stream metrics for data going in and out
• Monitor metrics with conditions or CPD algorithm
• Normalization and formats
Wrapping up
Get in Touch!
Thomas Bombeke
Big Data Engineer
+32 (0)498 60 66 55
thomas.bombeke@infofarm.be
info@infofarm.be
www.infofarm.be
www.xploregroup.be

Meetup 25/04/19: Big Data