Meetup 25/04/19: Big Data

Infofarm
We harvest business value
Data Science Big Data Training & Workshops
Applied Artificial Intelligence services,
Machine Learning & Deep Learning
techniques
Setting up and maintaining on-premise
and Cloud-based Big Data architectures
with Apache Spark and Hadoop
Training and guiding organisations
through the digital era using the power of
Data Science
OUR PASSION
Plow data – Grow information – Harvest value
Powered by

Datalake Architecture
Building Blocks
Sinkholes
Use Cases
Concepts

Datalake
“If you think of a datawarehouse as a store of bottled water – cleansed
and packaged and structured for easy consumption – the data lake is a
large body of water in a more natural state. The contents of the data
lake stream in from a source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples” – James Dixon CTO
Pentaho

• Based on data source or
category
• Single truth
• Ingests data from different
sources
• Accessible for multiple users
Datalake

• Has one schema
• Single type of storage
• Queryable
Puddles
Puddle
Puddle
Puddle

• Raw dump
• Not ready for use
• Reprocessing purpose
Raw Puddles

• Schema and formats applied
• Timesorted event data
• Ready for any use
Accepted Puddles

• Enriched/aggregated data
• Single purpose
• Dashboarding
• Exporting
Master Puddles

Streams
Export
Ingestion Puddle
Puddle
Processing

Ingestion
Click Event Stream
Daily Database Snapshot

Processing
Raw
Daily Database Ingest
Click Event Stream
Accepted

Processing
Raw Accepted
2019/04/07 2019/04/01/20190407
2019/04/07/20190407

Processing
Accepted
Accepted
Master
2019/04/07/20190407
2019/04/07

Exporting
Raw
Sends triggerMaster

Getting Event Data
• Point in time
• Late arriving data
• Stream or batch

Getting Event Data
Click Event Stream

Getting Entity Data
• No relations
• Avoid if does not bring value
• Reducing load on source
• Snapshot history

Getting Entity Data
Click Event Stream

Storing Data
File System Data Store Streaming Platform
• CSV, Parquet, Avro
• High latency
• Needs work for querying
• Write once, read-many
• Column or row based
• Low latency
• REST, Thrift, JDBC, …
• Updates possible
• Messages
• Log compaction
• Reloading from start

Storing Data
Raw
Daily Database Ingest
Click Event Stream
Accepted

Processing Data
Batch Streaming
• ”All” the data
• Big data
• Micro-batching
• Rolling window
• Fast data
• More challenges

Processing Data
Raw
Accepted
Master
Accepted

Scheduling
Cronjob• Frequency
• Dependency
• Execution time

Scheduling
1. Ingest click stream hourly
2. Prepare raw clickstream hourly
3. Ingest database everyday at 05:00
4. Create master sessions everyday after step 2
5. Export master sessions after step 4

Checkpointing
JOB CHECKPOINT
Ingest tv catalague 2019/04/10 05:00
Ingest click events 2019/04/10 15:00
Create master sessions 2019/04/07 07:00
Export master sessions 2019/04/07 07:20

Cloud or On-premise?
• Elastic computing/storage
• No admins needed
• Cheaper
• Hard limits
• “More secure”
• Maintenance
• Better performance

Cloud or On-premise?
• Sandbox
• Test environment
• Ad Hoc
• Non-elastic load
• Sensitive data

Personal Advertisement
Master
Master
Accepted Master
Accepted
?
?

Master
Master
Accepted
ConnectorTechType
Datastore
Kudu
Impala
Spark
HBase Hive

Master
Master
Accepted Master
Accepted

AWS Predictive autoscaling
adjust scaling

GDPR
• Retention time
• Right to be informed
• Right to be forgotten

GDPR - Retention Time
• What? Puddle schema
• When? Puddle partitioning
• Where? Puddle location

GDPR - RTBI
• Puddles are queryable
• Master script that queries all the
puddles

GDPR - RTBF
• Datastores with delete
functionality
• What about file systems?

GDPR - RTBF
1. Create an alias
2. Replace customer id with alias
during ingestion
3. Replace alias with customer id
during export
4. RTBF? Delete alias

Schema Evolution
• Some datastores/file formats can handle schema changes
• Breaking changes? New puddle

Small Files/Partitions
• Many small files on filesystem? Compact
• Datastore partitioning? Roll up

Data Quality
• Rejected puddle for malformed records
• Stream metrics for data going in and out
• Monitor metrics with conditions or CPD algorithm
• Normalization and formats

Get in Touch!
Thomas Bombeke
Big Data Engineer
+32 (0)498 60 66 55
thomas.bombeke@infofarm.be
info@infofarm.be
www.infofarm.be
www.xploregroup.be

Meetup 25/04/19: Big Data

More Related Content

What's hot

Similar to Meetup 25/04/19: Big Data

More from Digipolis Antwerpen

Recently uploaded

Meetup 25/04/19: Big Data