Large Scale ETL with Hadoop
    Headline Goes Here
    Eric Sammer | Principal Solution Architect
    Speaker Name or Subhead Goes Here
    @esammer
    Strata + Hadoop World 2012




1
ETL is like “REST” or “Disaster Recovery”




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)
       It’s more of a problem/solution space than a thing




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)
       It’s more of a problem/solution space than a thing
       Hard to generalize without being lossy in some
       way




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)
       It’s more of a problem/solution space than a thing
       Hard to generalize without being lossy in some
       way
       Worst, it’s trivial at face value, complicated in
       practice

2
So why is ETL hard?




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management
       Process orchestration and scheduling




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management
       Process orchestration and scheduling
       Accessibility



3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management
       Process orchestration and scheduling
       Accessibility
       How it all fits together


3
Hadoop is two components




4
Hadoop is two components
      HDFS – Massive, redundant data storage




4
Hadoop is two components
      HDFS – Massive, redundant data storage
      MapReduce – Batch-oriented data processing at
      scale




4
The ecosystem brings additional functionality




5
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce




5
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
          Hive, Pig, Cascading, ...




5
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration




6
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
          Flume, Sqoop, WebHDFS, ...




6
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling




7
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
          Oozie, Azkaban, ...




7
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
      Libraries for parsing and text extraction




8
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
      Libraries for parsing and text extraction
          Tika, ?, ...



8
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
      Libraries for parsing and text extraction
      ...and now low latency query with Impala


9
To truly scale ETL, separate infrastructure from
     processes




10
To truly scale ETL, separate infrastructure from
     processes, and make it a macro-level service




11
To truly scale ETL, separate infrastructure from
     processes, and make it a macro-level service
     (composed of other services).




12
The services of ETL




13
The services of ETL
       Process Repository




13
The services of ETL
       Process Repository
       Metadata Repository




13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling




13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling
       Process Orchestration




13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling
       Process Orchestration
       Integration Adapters or Channels



13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling
       Process Orchestration
       Integration Adapters or Channels
       Service and Process Instrumentation and
       Collection

13
What do we have today?




14
What do we have today?
       HDFS and MapReduce – The core




14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration




14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration
       Sqoop – Batch exchange of relational database
       tables




14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration
       Sqoop – Batch exchange of relational database
       tables
       Oozie – Process orchestration and basic
       scheduling


14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration
       Sqoop – Batch exchange of relational database
       tables
       Oozie – Process orchestration and basic
       scheduling
       Impala – Fast analysis of data quality

14
MapReduce is the assembly language of data
     processing




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”
        Comparatively low level




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”
        Comparatively low level
        Java knowledge required




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”
        Comparatively low level
        Java knowledge required
        Use higher level tools where possible


15
Data organization in HDFS




16
Data organization in HDFS
        Standard file system tricks to make operations
        atomic




16
Data organization in HDFS
        Standard file system tricks to make operations
        atomic
        Use a well-defined structure that supports tooling




16
Data organization in HDFS – Hierarchy
       /intent
          /category
             /application (optional)
                /dataset
                    /partitions
                       /files

       Examples:
       /data/fraud/txs/2012-01-01/20120101-00.avro
       /data/fraud/txs/2012-01-01/20120101-01.avro
       /group/research/model-17/training-txs/part-00000.avro
       /group/research/model-17/training-txs/part-00001.avro
       /user/esammer/scratch/foo/



17
A view of data integration




18
Event
                      headers:({
                      ((app:((1234,
                      ((type:(321
                      ((ts:(((<epoch>
                      },
                      body:(((<bytes>


        Syslog)
        Events             Flume)Agent

                                                        HDFS
                              Flume)
     Applica7on)            (Channel)1)   /data/ops/syslog/2012P01P01/
       Events


                              Flume)      /data/web/core/2012P01P01/
                            (Channel)2)   /data/web/retail/2012P01P01/
     Clickstream)
        Events                                                             Relational Data
                                          /data/pos/US/NY/17/2012P01P01/
                              Flume)      /data/pos/US/CA/42/2012P01P01/
     Point)of)Sale)         (Channel)3)
        Events
                                                                           Sqoop     Web)App)
                                                                           (Job)1)   Database
                                          /data/wdb/<database>/<table>/




        Streaming Data                    /data/edw/<database>/<table>/    Sqoop
                                                                                      EDW
                                                                           (Job)2)




19
Structure data in tiers




20
Structure data in tiers
        A clear hierarchy of source/derived relationships




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples
           Tier 0 – Raw data from source systems




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples
           Tier 0 – Raw data from source systems
           Tier 1 – Derived from 0, cleansed, normalized



20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples
           Tier 0 – Raw data from source systems
           Tier 1 – Derived from 0, cleansed, normalized
           Tier 2 – Derived from 1, aggregated


20
HDFS%(Tier%0)                                                  HDFS%(Tier%1)

     /data/ops/syslog/2012G01G01/                               /data/repor9ng/sessionsGday/YYYYGMMGDD/

                                           Sessioniza9on

     /data/web/core/2012G01G01/
                                                                /data/repor9ng/eventsGday/YYYYGMMGDD/
     /data/web/retail/2012G01G01/



     /data/pos/US/NY/17/2012G01G01/   Event%Report%Aggrega9on   /data/repor9ng/eventsGhour/YYYYGMMGDD/
     /data/pos/US/CA/42/2012G01G01/



     /data/wdb/<database>/<table>/

                                      Inventory%Reconcilia9on                HDFS%(For%export)


     /data/edw/<database>/<table>/                              /export/edw/inventory/itemGdiff/<ts>/




21
There’s a lot to do




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events
       Instrument jobs and services for performance/
       quality




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events
       Instrument jobs and services for performance/
       quality
       Metadata, metadata, metadata (metadata)


22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events
       Instrument jobs and services for performance/
       quality
       Metadata, metadata, metadata (metadata)
       Process (job) deployment, service location,

22
To the contributors, potential and current




23
To the contributors, potential and current
        We have work to do




23
To the contributors, potential and current
        We have work to do
        Still way too much scaffolding work




23
To the contributors, potential and current
        We have work to do
        Still way too much scaffolding work




23
I’m out of time (for now)




24
I’m out of time (for now)
        Join me for office hours – 1:40 - 2:20 in
        Rhinelander




24
I’m out of time (for now)
        Join me for office hours – 1:40 - 2:20 in
        Rhinelander
        I’m signing copies of Hadoop Operations tonight




24
25

Large scale ETL with Hadoop

  • 1.
    Large Scale ETLwith Hadoop Headline Goes Here Eric Sammer | Principal Solution Architect Speaker Name or Subhead Goes Here @esammer Strata + Hadoop World 2012 1
  • 2.
    ETL is like“REST” or “Disaster Recovery” 2
  • 3.
    ETL is like“REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) 2
  • 4.
    ETL is like“REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing 2
  • 5.
    ETL is like“REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way 2
  • 6.
    ETL is like“REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way Worst, it’s trivial at face value, complicated in practice 2
  • 7.
    So why isETL hard? 3
  • 8.
    So why isETL hard? It’s not because ƒ(A) → B is hard (anymore) 3
  • 9.
    So why isETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration 3
  • 10.
    So why isETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management 3
  • 11.
    So why isETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling 3
  • 12.
    So why isETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility 3
  • 13.
    So why isETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility How it all fits together 3
  • 14.
    Hadoop is twocomponents 4
  • 15.
    Hadoop is twocomponents HDFS – Massive, redundant data storage 4
  • 16.
    Hadoop is twocomponents HDFS – Massive, redundant data storage MapReduce – Batch-oriented data processing at scale 4
  • 17.
    The ecosystem bringsadditional functionality 5
  • 18.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce 5
  • 19.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce Hive, Pig, Cascading, ... 5
  • 20.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration 6
  • 21.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Flume, Sqoop, WebHDFS, ... 6
  • 22.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling 7
  • 23.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Oozie, Azkaban, ... 7
  • 24.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction 8
  • 25.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction Tika, ?, ... 8
  • 26.
    The ecosystem bringsadditional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction ...and now low latency query with Impala 9
  • 27.
    To truly scaleETL, separate infrastructure from processes 10
  • 28.
    To truly scaleETL, separate infrastructure from processes, and make it a macro-level service 11
  • 29.
    To truly scaleETL, separate infrastructure from processes, and make it a macro-level service (composed of other services). 12
  • 30.
  • 31.
    The services ofETL Process Repository 13
  • 32.
    The services ofETL Process Repository Metadata Repository 13
  • 33.
    The services ofETL Process Repository Metadata Repository Scheduling 13
  • 34.
    The services ofETL Process Repository Metadata Repository Scheduling Process Orchestration 13
  • 35.
    The services ofETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels 13
  • 36.
    The services ofETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels Service and Process Instrumentation and Collection 13
  • 37.
    What do wehave today? 14
  • 38.
    What do wehave today? HDFS and MapReduce – The core 14
  • 39.
    What do wehave today? HDFS and MapReduce – The core Flume – Streaming event data integration 14
  • 40.
    What do wehave today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables 14
  • 41.
    What do wehave today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling 14
  • 42.
    What do wehave today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling Impala – Fast analysis of data quality 14
  • 43.
    MapReduce is theassembly language of data processing 15
  • 44.
    MapReduce is theassembly language of data processing “Simple things are hard, but hard things are possible” 15
  • 45.
    MapReduce is theassembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level 15
  • 46.
    MapReduce is theassembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required 15
  • 47.
    MapReduce is theassembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required Use higher level tools where possible 15
  • 48.
  • 49.
    Data organization inHDFS Standard file system tricks to make operations atomic 16
  • 50.
    Data organization inHDFS Standard file system tricks to make operations atomic Use a well-defined structure that supports tooling 16
  • 51.
    Data organization inHDFS – Hierarchy /intent /category /application (optional) /dataset /partitions /files Examples: /data/fraud/txs/2012-01-01/20120101-00.avro /data/fraud/txs/2012-01-01/20120101-01.avro /group/research/model-17/training-txs/part-00000.avro /group/research/model-17/training-txs/part-00001.avro /user/esammer/scratch/foo/ 17
  • 52.
    A view ofdata integration 18
  • 53.
    Event headers:({ ((app:((1234, ((type:(321 ((ts:(((<epoch> }, body:(((<bytes> Syslog) Events Flume)Agent HDFS Flume) Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/ Events Flume) /data/web/core/2012P01P01/ (Channel)2) /data/web/retail/2012P01P01/ Clickstream) Events Relational Data /data/pos/US/NY/17/2012P01P01/ Flume) /data/pos/US/CA/42/2012P01P01/ Point)of)Sale) (Channel)3) Events Sqoop Web)App) (Job)1) Database /data/wdb/<database>/<table>/ Streaming Data /data/edw/<database>/<table>/ Sqoop EDW (Job)2) 19
  • 54.
  • 55.
    Structure data intiers A clear hierarchy of source/derived relationships 20
  • 56.
    Structure data intiers A clear hierarchy of source/derived relationships One step on the road to proper lineage 20
  • 57.
    Structure data intiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes 20
  • 58.
    Structure data intiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples 20
  • 59.
    Structure data intiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems 20
  • 60.
    Structure data intiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized 20
  • 61.
    Structure data intiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized Tier 2 – Derived from 1, aggregated 20
  • 62.
    HDFS%(Tier%0) HDFS%(Tier%1) /data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/ Sessioniza9on /data/web/core/2012G01G01/ /data/repor9ng/eventsGday/YYYYGMMGDD/ /data/web/retail/2012G01G01/ /data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/ /data/pos/US/CA/42/2012G01G01/ /data/wdb/<database>/<table>/ Inventory%Reconcilia9on HDFS%(For%export) /data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/ 21
  • 63.
  • 64.
    There’s a lotto do Build libraries or services to reveal higher-level interfaces 22
  • 65.
    There’s a lotto do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events 22
  • 66.
    There’s a lotto do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality 22
  • 67.
    There’s a lotto do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) 22
  • 68.
    There’s a lotto do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) Process (job) deployment, service location, 22
  • 69.
    To the contributors,potential and current 23
  • 70.
    To the contributors,potential and current We have work to do 23
  • 71.
    To the contributors,potential and current We have work to do Still way too much scaffolding work 23
  • 72.
    To the contributors,potential and current We have work to do Still way too much scaffolding work 23
  • 73.
    I’m out oftime (for now) 24
  • 74.
    I’m out oftime (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander 24
  • 75.
    I’m out oftime (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander I’m signing copies of Hadoop Operations tonight 24
  • 76.