Large scale ETL with Hadoop
Upcoming SlideShare
Loading in...5
×
 

Large scale ETL with Hadoop

on

  • 20,051 views

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, ...

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.

Statistics

Views

Total Views
20,051
Views on SlideShare
19,910
Embed Views
141

Actions

Likes
30
Downloads
953
Comments
0

4 Embeds 141

http://www.scoop.it 106
https://twitter.com 33
https://www.rebelmouse.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Large scale ETL with Hadoop Large scale ETL with Hadoop Presentation Transcript

  • Large Scale ETL with Hadoop Headline Goes Here Eric Sammer | Principal Solution Architect Speaker Name or Subhead Goes Here @esammer Strata + Hadoop World 20121
  • ETL is like “REST” or “Disaster Recovery”2
  • ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it)2
  • ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing2
  • ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way2
  • ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way Worst, it’s trivial at face value, complicated in practice2
  • So why is ETL hard?3
  • So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore)3
  • So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration3
  • So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management3
  • So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling3
  • So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility3
  • So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility How it all fits together3
  • Hadoop is two components4
  • Hadoop is two components HDFS – Massive, redundant data storage4
  • Hadoop is two components HDFS – Massive, redundant data storage MapReduce – Batch-oriented data processing at scale4
  • The ecosystem brings additional functionality5
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce5
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce Hive, Pig, Cascading, ...5
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration6
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Flume, Sqoop, WebHDFS, ...6
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling7
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Oozie, Azkaban, ...7
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction8
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction Tika, ?, ...8
  • The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction ...and now low latency query with Impala9
  • To truly scale ETL, separate infrastructure from processes10
  • To truly scale ETL, separate infrastructure from processes, and make it a macro-level service11
  • To truly scale ETL, separate infrastructure from processes, and make it a macro-level service (composed of other services).12
  • The services of ETL13
  • The services of ETL Process Repository13
  • The services of ETL Process Repository Metadata Repository13
  • The services of ETL Process Repository Metadata Repository Scheduling13
  • The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration13
  • The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels13
  • The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels Service and Process Instrumentation and Collection13
  • What do we have today?14
  • What do we have today? HDFS and MapReduce – The core14
  • What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration14
  • What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables14
  • What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling14
  • What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling Impala – Fast analysis of data quality14
  • MapReduce is the assembly language of data processing15
  • MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible”15
  • MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level15
  • MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required15
  • MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required Use higher level tools where possible15
  • Data organization in HDFS16
  • Data organization in HDFS Standard file system tricks to make operations atomic16
  • Data organization in HDFS Standard file system tricks to make operations atomic Use a well-defined structure that supports tooling16
  • Data organization in HDFS – Hierarchy /intent /category /application (optional) /dataset /partitions /files Examples: /data/fraud/txs/2012-01-01/20120101-00.avro /data/fraud/txs/2012-01-01/20120101-01.avro /group/research/model-17/training-txs/part-00000.avro /group/research/model-17/training-txs/part-00001.avro /user/esammer/scratch/foo/17
  • A view of data integration18
  • Event headers:({ ((app:((1234, ((type:(321 ((ts:(((<epoch> }, body:(((<bytes> Syslog) Events Flume)Agent HDFS Flume) Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/ Events Flume) /data/web/core/2012P01P01/ (Channel)2) /data/web/retail/2012P01P01/ Clickstream) Events Relational Data /data/pos/US/NY/17/2012P01P01/ Flume) /data/pos/US/CA/42/2012P01P01/ Point)of)Sale) (Channel)3) Events Sqoop Web)App) (Job)1) Database /data/wdb/<database>/<table>/ Streaming Data /data/edw/<database>/<table>/ Sqoop EDW (Job)2)19
  • Structure data in tiers20
  • Structure data in tiers A clear hierarchy of source/derived relationships20
  • Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage20
  • Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes20
  • Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples20
  • Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems20
  • Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized20
  • Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized Tier 2 – Derived from 1, aggregated20
  • HDFS%(Tier%0) HDFS%(Tier%1) /data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/ Sessioniza9on /data/web/core/2012G01G01/ /data/repor9ng/eventsGday/YYYYGMMGDD/ /data/web/retail/2012G01G01/ /data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/ /data/pos/US/CA/42/2012G01G01/ /data/wdb/<database>/<table>/ Inventory%Reconcilia9on HDFS%(For%export) /data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/21
  • There’s a lot to do22
  • There’s a lot to do Build libraries or services to reveal higher-level interfaces22
  • There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events22
  • There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality22
  • There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata)22
  • There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) Process (job) deployment, service location,22
  • To the contributors, potential and current23
  • To the contributors, potential and current We have work to do23
  • To the contributors, potential and current We have work to do Still way too much scaffolding work23
  • To the contributors, potential and current We have work to do Still way too much scaffolding work23
  • I’m out of time (for now)24
  • I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander24
  • I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander I’m signing copies of Hadoop Operations tonight24
  • 25