Large scale ETL with Hadoop

Large Scale ETL with Hadoop
Headline Goes Here
Eric Sammer | Principal Solution Architect
Speaker Name or Subhead Goes Here
@esammer
Strata + Hadoop World 2012

1

ETL is like “REST” or “Disaster Recovery”

2

Everyone deﬁnes it differently (and loves to ﬁght
about it)

2

about it)
It’s more of a problem/solution space than a thing

2

about it)
Hard to generalize without being lossy in some
way

2

about it)
Hard to generalize without being lossy in some
way
Worst, it’s trivial at face value, complicated in
practice

2

So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)

3

So why is ETL hard?
Data integration

3

So why is ETL hard?
Data integration
Organization and management

3

So why is ETL hard?
Data integration
Process orchestration and scheduling

3

So why is ETL hard?
Data integration
Accessibility

3

So why is ETL hard?
Data integration
Accessibility
How it all ﬁts together

3

Hadoop is two components

4

HDFS – Massive, redundant data storage

4

HDFS – Massive, redundant data storage
MapReduce – Batch-oriented data processing at
scale

4

The ecosystem brings additional functionality

5

Higher level languages and abstractions on
MapReduce

5

MapReduce
Hive, Pig, Cascading, ...

5

MapReduce
File, relational, and streaming data integration

6

MapReduce
Flume, Sqoop, WebHDFS, ...

6

MapReduce

7

MapReduce
Oozie, Azkaban, ...

7

MapReduce
Libraries for parsing and text extraction

8

MapReduce
Tika, ?, ...

8

MapReduce
...and now low latency query with Impala

9

To truly scale ETL, separate infrastructure from
processes

10

processes, and make it a macro-level service

11

processes, and make it a macro-level service
(composed of other services).

12

The services of ETL
Process Repository

13

The services of ETL
Process Repository
Metadata Repository

13

The services of ETL
Process Repository
Metadata Repository
Scheduling

13

The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration

13

The services of ETL
Process Repository
Metadata Repository
Scheduling
Integration Adapters or Channels

13

The services of ETL
Process Repository
Metadata Repository
Scheduling
Integration Adapters or Channels
Service and Process Instrumentation and
Collection

13

What do we have today?
HDFS and MapReduce – The core

14

Flume – Streaming event data integration

14

Sqoop – Batch exchange of relational database
tables

14

tables
Oozie – Process orchestration and basic
scheduling

14

tables
Oozie – Process orchestration and basic
scheduling
Impala – Fast analysis of data quality

14

MapReduce is the assembly language of data
processing

15

processing
“Simple things are hard, but hard things are
possible”

15

processing
possible”
Comparatively low level

15

processing
possible”
Java knowledge required

15

processing
possible”
Java knowledge required
Use higher level tools where possible

15

Data organization in HDFS

16

Standard ﬁle system tricks to make operations
atomic

16

Standard ﬁle system tricks to make operations
atomic
Use a well-deﬁned structure that supports tooling

16

Data organization in HDFS – Hierarchy
/intent
/category
/application (optional)
/dataset
/partitions
/files

Examples:
/data/fraud/txs/2012-01-01/20120101-00.avro
/data/fraud/txs/2012-01-01/20120101-01.avro
/group/research/model-17/training-txs/part-00000.avro
/group/research/model-17/training-txs/part-00001.avro
/user/esammer/scratch/foo/

17

A view of data integration

18

Event
headers:({
((app:((1234,
((type:(321
((ts:(((<epoch>
},
body:(((<bytes>

Syslog)
Events Flume)Agent

HDFS
Flume)
Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/
Events

Flume) /data/web/core/2012P01P01/
(Channel)2) /data/web/retail/2012P01P01/
Clickstream)
Events Relational Data
/data/pos/US/NY/17/2012P01P01/
Flume) /data/pos/US/CA/42/2012P01P01/
Point)of)Sale) (Channel)3)
Events
Sqoop Web)App)
(Job)1) Database
/data/wdb/<database>/<table>/

Streaming Data /data/edw/<database>/<table>/ Sqoop
EDW
(Job)2)

19

Structure data in tiers

20

A clear hierarchy of source/derived relationships

20

One step on the road to proper lineage

20

Simple “fault and rebuild” processes

20

Examples

20

Examples
Tier 0 – Raw data from source systems

20

Examples
Tier 1 – Derived from 0, cleansed, normalized

20

Examples
Tier 1 – Derived from 0, cleansed, normalized
Tier 2 – Derived from 1, aggregated

20

HDFS%(Tier%0) HDFS%(Tier%1)

/data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/

Sessioniza9on

/data/web/core/2012G01G01/
/data/repor9ng/eventsGday/YYYYGMMGDD/
/data/web/retail/2012G01G01/

/data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/
/data/pos/US/CA/42/2012G01G01/

/data/wdb/<database>/<table>/

Inventory%Reconcilia9on HDFS%(For%export)

/data/edw/<database>/<table>/ /export/edw/inventory/itemGdiﬀ/<ts>/

21

There’s a lot to do
Build libraries or services to reveal higher-level
interfaces

22

interfaces
Data management and lifecycle events

22

interfaces
Instrument jobs and services for performance/
quality

22

interfaces
quality
Metadata, metadata, metadata (metadata)

22

interfaces
quality
Metadata, metadata, metadata (metadata)
Process (job) deployment, service location,

22

To the contributors, potential and current

23

We have work to do

23

We have work to do
Still way too much scaffolding work

23

I’m out of time (for now)

24

Join me for office hours – 1:40 - 2:20 in
Rhinelander

24

Join me for office hours – 1:40 - 2:20 in
Rhinelander
I’m signing copies of Hadoop Operations tonight

24

Large scale ETL with Hadoop

More Related Content

What's hot

Similar to Large scale ETL with Hadoop

More from OReillyStrata

Large scale ETL with Hadoop

Editor's Notes