Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large Scale ETL with Hadoop    Headline Goes Here    Eric Sammer | Principal Solution Architect    Speaker Name or Subhead...
ETL is like “REST” or “Disaster Recovery”2
ETL is like “REST” or “Disaster Recovery”       Everyone defines it differently (and loves to fight       about it)2
ETL is like “REST” or “Disaster Recovery”       Everyone defines it differently (and loves to fight       about it)       It...
ETL is like “REST” or “Disaster Recovery”       Everyone defines it differently (and loves to fight       about it)       It...
ETL is like “REST” or “Disaster Recovery”       Everyone defines it differently (and loves to fight       about it)       It...
So why is ETL hard?3
So why is ETL hard?       It’s not because ƒ(A) → B is hard (anymore)3
So why is ETL hard?       It’s not because ƒ(A) → B is hard (anymore)       Data integration3
So why is ETL hard?       It’s not because ƒ(A) → B is hard (anymore)       Data integration       Organization and manage...
So why is ETL hard?       It’s not because ƒ(A) → B is hard (anymore)       Data integration       Organization and manage...
So why is ETL hard?       It’s not because ƒ(A) → B is hard (anymore)       Data integration       Organization and manage...
So why is ETL hard?       It’s not because ƒ(A) → B is hard (anymore)       Data integration       Organization and manage...
Hadoop is two components4
Hadoop is two components      HDFS – Massive, redundant data storage4
Hadoop is two components      HDFS – Massive, redundant data storage      MapReduce – Batch-oriented data processing at   ...
The ecosystem brings additional functionality5
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce5
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce          Hive...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
The ecosystem brings additional functionality      Higher level languages and abstractions on      MapReduce      File, re...
To truly scale ETL, separate infrastructure from     processes10
To truly scale ETL, separate infrastructure from     processes, and make it a macro-level service11
To truly scale ETL, separate infrastructure from     processes, and make it a macro-level service     (composed of other s...
The services of ETL13
The services of ETL       Process Repository13
The services of ETL       Process Repository       Metadata Repository13
The services of ETL       Process Repository       Metadata Repository       Scheduling13
The services of ETL       Process Repository       Metadata Repository       Scheduling       Process Orchestration13
The services of ETL       Process Repository       Metadata Repository       Scheduling       Process Orchestration       ...
The services of ETL       Process Repository       Metadata Repository       Scheduling       Process Orchestration       ...
What do we have today?14
What do we have today?       HDFS and MapReduce – The core14
What do we have today?       HDFS and MapReduce – The core       Flume – Streaming event data integration14
What do we have today?       HDFS and MapReduce – The core       Flume – Streaming event data integration       Sqoop – Ba...
What do we have today?       HDFS and MapReduce – The core       Flume – Streaming event data integration       Sqoop – Ba...
What do we have today?       HDFS and MapReduce – The core       Flume – Streaming event data integration       Sqoop – Ba...
MapReduce is the assembly language of data     processing15
MapReduce is the assembly language of data     processing        “Simple things are hard, but hard things are        possi...
MapReduce is the assembly language of data     processing        “Simple things are hard, but hard things are        possi...
MapReduce is the assembly language of data     processing        “Simple things are hard, but hard things are        possi...
MapReduce is the assembly language of data     processing        “Simple things are hard, but hard things are        possi...
Data organization in HDFS16
Data organization in HDFS        Standard file system tricks to make operations        atomic16
Data organization in HDFS        Standard file system tricks to make operations        atomic        Use a well-defined stru...
Data organization in HDFS – Hierarchy       /intent          /category             /application (optional)                ...
A view of data integration18
Event                      headers:({                      ((app:((1234,                      ((type:(321                 ...
Structure data in tiers20
Structure data in tiers        A clear hierarchy of source/derived relationships20
Structure data in tiers        A clear hierarchy of source/derived relationships        One step on the road to proper lin...
Structure data in tiers        A clear hierarchy of source/derived relationships        One step on the road to proper lin...
Structure data in tiers        A clear hierarchy of source/derived relationships        One step on the road to proper lin...
Structure data in tiers        A clear hierarchy of source/derived relationships        One step on the road to proper lin...
Structure data in tiers        A clear hierarchy of source/derived relationships        One step on the road to proper lin...
Structure data in tiers        A clear hierarchy of source/derived relationships        One step on the road to proper lin...
HDFS%(Tier%0)                                                  HDFS%(Tier%1)     /data/ops/syslog/2012G01G01/             ...
There’s a lot to do22
There’s a lot to do       Build libraries or services to reveal higher-level       interfaces22
There’s a lot to do       Build libraries or services to reveal higher-level       interfaces       Data management and li...
There’s a lot to do       Build libraries or services to reveal higher-level       interfaces       Data management and li...
There’s a lot to do       Build libraries or services to reveal higher-level       interfaces       Data management and li...
There’s a lot to do       Build libraries or services to reveal higher-level       interfaces       Data management and li...
To the contributors, potential and current23
To the contributors, potential and current        We have work to do23
To the contributors, potential and current        We have work to do        Still way too much scaffolding work23
To the contributors, potential and current        We have work to do        Still way too much scaffolding work23
I’m out of time (for now)24
I’m out of time (for now)        Join me for office hours – 1:40 - 2:20 in        Rhinelander24
I’m out of time (for now)        Join me for office hours – 1:40 - 2:20 in        Rhinelander        I’m signing copies of...
25
Upcoming SlideShare
Loading in …5
×

Large scale ETL with Hadoop

35,706 views

Published on

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.

Large scale ETL with Hadoop

  1. 1. Large Scale ETL with Hadoop Headline Goes Here Eric Sammer | Principal Solution Architect Speaker Name or Subhead Goes Here @esammer Strata + Hadoop World 20121
  2. 2. ETL is like “REST” or “Disaster Recovery”2
  3. 3. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it)2
  4. 4. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing2
  5. 5. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way2
  6. 6. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way Worst, it’s trivial at face value, complicated in practice2
  7. 7. So why is ETL hard?3
  8. 8. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore)3
  9. 9. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration3
  10. 10. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management3
  11. 11. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling3
  12. 12. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility3
  13. 13. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility How it all fits together3
  14. 14. Hadoop is two components4
  15. 15. Hadoop is two components HDFS – Massive, redundant data storage4
  16. 16. Hadoop is two components HDFS – Massive, redundant data storage MapReduce – Batch-oriented data processing at scale4
  17. 17. The ecosystem brings additional functionality5
  18. 18. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce5
  19. 19. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce Hive, Pig, Cascading, ...5
  20. 20. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration6
  21. 21. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Flume, Sqoop, WebHDFS, ...6
  22. 22. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling7
  23. 23. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Oozie, Azkaban, ...7
  24. 24. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction8
  25. 25. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction Tika, ?, ...8
  26. 26. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction ...and now low latency query with Impala9
  27. 27. To truly scale ETL, separate infrastructure from processes10
  28. 28. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service11
  29. 29. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service (composed of other services).12
  30. 30. The services of ETL13
  31. 31. The services of ETL Process Repository13
  32. 32. The services of ETL Process Repository Metadata Repository13
  33. 33. The services of ETL Process Repository Metadata Repository Scheduling13
  34. 34. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration13
  35. 35. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels13
  36. 36. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels Service and Process Instrumentation and Collection13
  37. 37. What do we have today?14
  38. 38. What do we have today? HDFS and MapReduce – The core14
  39. 39. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration14
  40. 40. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables14
  41. 41. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling14
  42. 42. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling Impala – Fast analysis of data quality14
  43. 43. MapReduce is the assembly language of data processing15
  44. 44. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible”15
  45. 45. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level15
  46. 46. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required15
  47. 47. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required Use higher level tools where possible15
  48. 48. Data organization in HDFS16
  49. 49. Data organization in HDFS Standard file system tricks to make operations atomic16
  50. 50. Data organization in HDFS Standard file system tricks to make operations atomic Use a well-defined structure that supports tooling16
  51. 51. Data organization in HDFS – Hierarchy /intent /category /application (optional) /dataset /partitions /files Examples: /data/fraud/txs/2012-01-01/20120101-00.avro /data/fraud/txs/2012-01-01/20120101-01.avro /group/research/model-17/training-txs/part-00000.avro /group/research/model-17/training-txs/part-00001.avro /user/esammer/scratch/foo/17
  52. 52. A view of data integration18
  53. 53. Event headers:({ ((app:((1234, ((type:(321 ((ts:(((<epoch> }, body:(((<bytes> Syslog) Events Flume)Agent HDFS Flume) Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/ Events Flume) /data/web/core/2012P01P01/ (Channel)2) /data/web/retail/2012P01P01/ Clickstream) Events Relational Data /data/pos/US/NY/17/2012P01P01/ Flume) /data/pos/US/CA/42/2012P01P01/ Point)of)Sale) (Channel)3) Events Sqoop Web)App) (Job)1) Database /data/wdb/<database>/<table>/ Streaming Data /data/edw/<database>/<table>/ Sqoop EDW (Job)2)19
  54. 54. Structure data in tiers20
  55. 55. Structure data in tiers A clear hierarchy of source/derived relationships20
  56. 56. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage20
  57. 57. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes20
  58. 58. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples20
  59. 59. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems20
  60. 60. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized20
  61. 61. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized Tier 2 – Derived from 1, aggregated20
  62. 62. HDFS%(Tier%0) HDFS%(Tier%1) /data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/ Sessioniza9on /data/web/core/2012G01G01/ /data/repor9ng/eventsGday/YYYYGMMGDD/ /data/web/retail/2012G01G01/ /data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/ /data/pos/US/CA/42/2012G01G01/ /data/wdb/<database>/<table>/ Inventory%Reconcilia9on HDFS%(For%export) /data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/21
  63. 63. There’s a lot to do22
  64. 64. There’s a lot to do Build libraries or services to reveal higher-level interfaces22
  65. 65. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events22
  66. 66. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality22
  67. 67. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata)22
  68. 68. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) Process (job) deployment, service location,22
  69. 69. To the contributors, potential and current23
  70. 70. To the contributors, potential and current We have work to do23
  71. 71. To the contributors, potential and current We have work to do Still way too much scaffolding work23
  72. 72. To the contributors, potential and current We have work to do Still way too much scaffolding work23
  73. 73. I’m out of time (for now)24
  74. 74. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander24
  75. 75. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander I’m signing copies of Hadoop Operations tonight24
  76. 76. 25

×