Data Integration
Contents
Introduction1
2 Data Ingestion
3 Data Processing
4 Data Architectures
5 Workshop
1. Introduction
vision
products
data science
Data access
data
infrastructure
Data Needs
Relational DBs
Log filesSearch indexes
NoSQL DBs
Message queueMonitoring
Data Sources
Data Warehouse
ETL
ETL ETL
ETL
Data Warehouse Ingestion
Sink
Source
. . . .Transform
Load
Extract
1990 Data Warehousing
- Drop relational assumption
- Programmability
- Open Source
2008 Hadoop + MapReduce
- Batch → Real-time
- Daily → Continous
2015 Kafka + Streaming data
2. Data Ingestion
From ETL to ELT: Flume, sqoop, kafka
sqoopflume
Data Lake
Kafka Producer Kafka Producer
Kafka Consumer
Data Lake Ingestion
Kafka
Channel
Channel
Processor
Interceptor #1
Interceptor #N
SinkSource
Flume Agent
Apache Flume
Avro
Thrift
Kafka
Exec
JMS
Spool dir
Twitter
Netcat
Syslog
HTTP
HDFS
Kafka
Hive
Logger
Avro
Thrift
IRC
HBase
Elastic
RDBMS
Apache Sqoop
Sqoop Tool
Import
Export
Data Pipeline Problem
Inter-process
communication
channel
Data Pipeline Problem
Metrics
Pub/Sub
A publish/subscribe
System
Data Pipeline Problem
Metrics
Pub/Sub
Logging
Pub/Sub
Multiple
publish/subscribe
Systems
Apache Kafka
Broker 1 Broker 2 Broker 3
Kafka Cluster
●
●
●
●
Consumer
Kafka as reliable Flume channel
Flume + Kafka
Source Sink
Channel
Producer
Flume as kafka producer/consumer
3. Data Processing
Batch Processing
Data Lake
Batch
Processing
Pageviews
[url, timestamp]
[url, timestamp]
[url, timestamp]
[url, timestamp]
DBRollups
[url, hour,
count]
[url, hour,
count]
[url, hour,
count]
{url+hour :
count}
{url+hour :
count}
{url+hour :
count}
mapreduce mapreduce Data Analysis
Stream Processing
Real Time Technologies
Data
Source
flume
Kafka producer
Events /
DB writes
Process
Stream
Event
Stream
Output
Stream
4. Data Architectures
Data Lake
Batch
Processing
Data Processing Architecture
Data
Source
flume
Kafka producer
Data Analysis
Data Lake
Batch
Processing
Stream
Processing
Data Processing Architecture
Data
Source
flume
Kafka producer
Data Analysis
Lambda Architecture
Serving Layer
New Data
Stream
Batch Views
Real-Time Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Bath LayerPrecompute Views
(MapReduce)Batch
Processing
Real-Time
Layer
Increment Views
Stream
Processing
Process
Stream
Merged
View
query
merge
Data Lake
Batch
Processing
Stream
Processing
Data Processing Architecture
Data
Source
flume
Kafka producer
Serving
Layer
Data Analysis
Kappa Architecture
Serving Layer
query
Serving DB
Output Table n
Output Table n+1
Stream Processing System
Job Version n
Job Version n+1
Data Storage
1
New Data
Stream
2 3 ..
Where everything is a stream
Real-Time Layer
query
4. Workshop
THANKS!
Any questions?
@datiobd
flasheras@datiobd.com rbravo@datiobd.com
datio-big-data

Data Integration