Large scale ETL with Hadoop

33,524 views

Published on

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.

1 Comment
84 Likes
Statistics
Notes
No Downloads
Views
Total views
33,524
On SlideShare
0
From Embeds
0
Number of Embeds
392
Actions
Shares
0
Downloads
1,920
Comments
1
Likes
84
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Large scale ETL with Hadoop

    1. 1. Large Scale ETL with Hadoop Headline Goes Here Eric Sammer | Principal Solution Architect Speaker Name or Subhead Goes Here @esammer Strata + Hadoop World 20121
    2. 2. ETL is like “REST” or “Disaster Recovery”2
    3. 3. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it)2
    4. 4. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing2
    5. 5. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way2
    6. 6. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way Worst, it’s trivial at face value, complicated in practice2
    7. 7. So why is ETL hard?3
    8. 8. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore)3
    9. 9. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration3
    10. 10. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management3
    11. 11. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling3
    12. 12. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility3
    13. 13. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility How it all fits together3
    14. 14. Hadoop is two components4
    15. 15. Hadoop is two components HDFS – Massive, redundant data storage4
    16. 16. Hadoop is two components HDFS – Massive, redundant data storage MapReduce – Batch-oriented data processing at scale4
    17. 17. The ecosystem brings additional functionality5
    18. 18. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce5
    19. 19. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce Hive, Pig, Cascading, ...5
    20. 20. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration6
    21. 21. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Flume, Sqoop, WebHDFS, ...6
    22. 22. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling7
    23. 23. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Oozie, Azkaban, ...7
    24. 24. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction8
    25. 25. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction Tika, ?, ...8
    26. 26. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction ...and now low latency query with Impala9
    27. 27. To truly scale ETL, separate infrastructure from processes10
    28. 28. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service11
    29. 29. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service (composed of other services).12
    30. 30. The services of ETL13
    31. 31. The services of ETL Process Repository13
    32. 32. The services of ETL Process Repository Metadata Repository13
    33. 33. The services of ETL Process Repository Metadata Repository Scheduling13
    34. 34. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration13
    35. 35. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels13
    36. 36. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels Service and Process Instrumentation and Collection13
    37. 37. What do we have today?14
    38. 38. What do we have today? HDFS and MapReduce – The core14
    39. 39. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration14
    40. 40. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables14
    41. 41. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling14
    42. 42. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling Impala – Fast analysis of data quality14
    43. 43. MapReduce is the assembly language of data processing15
    44. 44. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible”15
    45. 45. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level15
    46. 46. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required15
    47. 47. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required Use higher level tools where possible15
    48. 48. Data organization in HDFS16
    49. 49. Data organization in HDFS Standard file system tricks to make operations atomic16
    50. 50. Data organization in HDFS Standard file system tricks to make operations atomic Use a well-defined structure that supports tooling16
    51. 51. Data organization in HDFS – Hierarchy /intent /category /application (optional) /dataset /partitions /files Examples: /data/fraud/txs/2012-01-01/20120101-00.avro /data/fraud/txs/2012-01-01/20120101-01.avro /group/research/model-17/training-txs/part-00000.avro /group/research/model-17/training-txs/part-00001.avro /user/esammer/scratch/foo/17
    52. 52. A view of data integration18
    53. 53. Event headers:({ ((app:((1234, ((type:(321 ((ts:(((<epoch> }, body:(((<bytes> Syslog) Events Flume)Agent HDFS Flume) Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/ Events Flume) /data/web/core/2012P01P01/ (Channel)2) /data/web/retail/2012P01P01/ Clickstream) Events Relational Data /data/pos/US/NY/17/2012P01P01/ Flume) /data/pos/US/CA/42/2012P01P01/ Point)of)Sale) (Channel)3) Events Sqoop Web)App) (Job)1) Database /data/wdb/<database>/<table>/ Streaming Data /data/edw/<database>/<table>/ Sqoop EDW (Job)2)19
    54. 54. Structure data in tiers20
    55. 55. Structure data in tiers A clear hierarchy of source/derived relationships20
    56. 56. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage20
    57. 57. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes20
    58. 58. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples20
    59. 59. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems20
    60. 60. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized20
    61. 61. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized Tier 2 – Derived from 1, aggregated20
    62. 62. HDFS%(Tier%0) HDFS%(Tier%1) /data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/ Sessioniza9on /data/web/core/2012G01G01/ /data/repor9ng/eventsGday/YYYYGMMGDD/ /data/web/retail/2012G01G01/ /data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/ /data/pos/US/CA/42/2012G01G01/ /data/wdb/<database>/<table>/ Inventory%Reconcilia9on HDFS%(For%export) /data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/21
    63. 63. There’s a lot to do22
    64. 64. There’s a lot to do Build libraries or services to reveal higher-level interfaces22
    65. 65. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events22
    66. 66. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality22
    67. 67. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata)22
    68. 68. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) Process (job) deployment, service location,22
    69. 69. To the contributors, potential and current23
    70. 70. To the contributors, potential and current We have work to do23
    71. 71. To the contributors, potential and current We have work to do Still way too much scaffolding work23
    72. 72. To the contributors, potential and current We have work to do Still way too much scaffolding work23
    73. 73. I’m out of time (for now)24
    74. 74. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander24
    75. 75. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander I’m signing copies of Hadoop Operations tonight24
    76. 76. 25

    ×