Your SlideShare is downloading. ×

Large scale ETL with Hadoop

20,432
views

Published on

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, …

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.


1 Comment
40 Likes
Statistics
Notes
No Downloads
Views
Total Views
20,432
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
1,157
Comments
1
Likes
40
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Large Scale ETL with Hadoop Headline Goes Here Eric Sammer | Principal Solution Architect Speaker Name or Subhead Goes Here @esammer Strata + Hadoop World 20121
    • 2. ETL is like “REST” or “Disaster Recovery”2
    • 3. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it)2
    • 4. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing2
    • 5. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way2
    • 6. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way Worst, it’s trivial at face value, complicated in practice2
    • 7. So why is ETL hard?3
    • 8. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore)3
    • 9. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration3
    • 10. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management3
    • 11. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling3
    • 12. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility3
    • 13. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility How it all fits together3
    • 14. Hadoop is two components4
    • 15. Hadoop is two components HDFS – Massive, redundant data storage4
    • 16. Hadoop is two components HDFS – Massive, redundant data storage MapReduce – Batch-oriented data processing at scale4
    • 17. The ecosystem brings additional functionality5
    • 18. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce5
    • 19. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce Hive, Pig, Cascading, ...5
    • 20. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration6
    • 21. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Flume, Sqoop, WebHDFS, ...6
    • 22. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling7
    • 23. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Oozie, Azkaban, ...7
    • 24. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction8
    • 25. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction Tika, ?, ...8
    • 26. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction ...and now low latency query with Impala9
    • 27. To truly scale ETL, separate infrastructure from processes10
    • 28. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service11
    • 29. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service (composed of other services).12
    • 30. The services of ETL13
    • 31. The services of ETL Process Repository13
    • 32. The services of ETL Process Repository Metadata Repository13
    • 33. The services of ETL Process Repository Metadata Repository Scheduling13
    • 34. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration13
    • 35. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels13
    • 36. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels Service and Process Instrumentation and Collection13
    • 37. What do we have today?14
    • 38. What do we have today? HDFS and MapReduce – The core14
    • 39. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration14
    • 40. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables14
    • 41. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling14
    • 42. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling Impala – Fast analysis of data quality14
    • 43. MapReduce is the assembly language of data processing15
    • 44. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible”15
    • 45. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level15
    • 46. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required15
    • 47. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required Use higher level tools where possible15
    • 48. Data organization in HDFS16
    • 49. Data organization in HDFS Standard file system tricks to make operations atomic16
    • 50. Data organization in HDFS Standard file system tricks to make operations atomic Use a well-defined structure that supports tooling16
    • 51. Data organization in HDFS – Hierarchy /intent /category /application (optional) /dataset /partitions /files Examples: /data/fraud/txs/2012-01-01/20120101-00.avro /data/fraud/txs/2012-01-01/20120101-01.avro /group/research/model-17/training-txs/part-00000.avro /group/research/model-17/training-txs/part-00001.avro /user/esammer/scratch/foo/17
    • 52. A view of data integration18
    • 53. Event headers:({ ((app:((1234, ((type:(321 ((ts:(((<epoch> }, body:(((<bytes> Syslog) Events Flume)Agent HDFS Flume) Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/ Events Flume) /data/web/core/2012P01P01/ (Channel)2) /data/web/retail/2012P01P01/ Clickstream) Events Relational Data /data/pos/US/NY/17/2012P01P01/ Flume) /data/pos/US/CA/42/2012P01P01/ Point)of)Sale) (Channel)3) Events Sqoop Web)App) (Job)1) Database /data/wdb/<database>/<table>/ Streaming Data /data/edw/<database>/<table>/ Sqoop EDW (Job)2)19
    • 54. Structure data in tiers20
    • 55. Structure data in tiers A clear hierarchy of source/derived relationships20
    • 56. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage20
    • 57. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes20
    • 58. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples20
    • 59. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems20
    • 60. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized20
    • 61. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized Tier 2 – Derived from 1, aggregated20
    • 62. HDFS%(Tier%0) HDFS%(Tier%1) /data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/ Sessioniza9on /data/web/core/2012G01G01/ /data/repor9ng/eventsGday/YYYYGMMGDD/ /data/web/retail/2012G01G01/ /data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/ /data/pos/US/CA/42/2012G01G01/ /data/wdb/<database>/<table>/ Inventory%Reconcilia9on HDFS%(For%export) /data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/21
    • 63. There’s a lot to do22
    • 64. There’s a lot to do Build libraries or services to reveal higher-level interfaces22
    • 65. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events22
    • 66. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality22
    • 67. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata)22
    • 68. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) Process (job) deployment, service location,22
    • 69. To the contributors, potential and current23
    • 70. To the contributors, potential and current We have work to do23
    • 71. To the contributors, potential and current We have work to do Still way too much scaffolding work23
    • 72. To the contributors, potential and current We have work to do Still way too much scaffolding work23
    • 73. I’m out of time (for now)24
    • 74. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander24
    • 75. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander I’m signing copies of Hadoop Operations tonight24
    • 76. 25