1
Scaling ETL
with Hadoop
Gwen Shapira, Solutions Architect
@gwenshap
gshapira@cloudera.com
2
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
Hadoop Is…
• HDFS – Replicated, distributed data storage
• Map-Reduce – Batch oriented data processing at
scale. Parallel programming platform.
4
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
6
Why ETL with Hadoop?
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
8
Replace ETL Clusters
• Informatica is:
• Easy to program
• Complete ETL solution
• Hadoop is:
• Cheaper
• MUCH more flexible
• Faster?
• More scalable?
• You can have both
9
Data Warehouse Offloading
• DWH resources are
EXPENSIVE
• Reduce storage costs
• Release CPU capacity
• Scale
• Better tools
10
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
ETL Cluster
ETL Cluster with
some Hadoop
12
We’ll Discuss:
Technologies Speed & Scale Tips & Tricks
Extract
Transform
Load
Workflow
13
14
Extract
Let me count the ways
• From Databases: Sqoop
• Log Data: Flume + CDK
• Or just write data to HDFS
15
Sqoop – The Balancing Act
16
Scale Sqoop Slowly
• Balance between:
• Maximizing network utilization
• Minimizing database impact
• Start with smallish table (1-10G)
• 1 mapper, 2 mappers, 4 mappers
• Where’s the bottleneck?
• vmtat, iostat, mpstat, netstat, iptraf
17
When Loading Files:
Same principles apply:
• Parallel Copy
• Add Parallelism
• Find Bottlenecks
• Resolve them
• Avoid Self-DDOS
18
Scaling Sqoop
• Split column - match index or partitions
• Compression
• Drivers + Connectors + Direct mode
• Incremental import
19
Ingest Tips
• Use file system tricks to ensure consistency
• Directory structure:
/intent
/category
/application (optional)
/dataset
/partitions
/files
• Examples:
/data/fraud/txs/2011-01-01/20110101-00.avro
/group/research/model-17/training-tx/part-00000.txt
/user/gshapira/scratch/surge/
20
Ingest Tips
• External tables in Hive
• Keep raw data
• Trigger workflows on
file arrival
21
22
Transform
Endless Possibilites
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
• Morphlines
23
Prototype
24
Partitioning
• Hive
• Directory Structure
• Pre-filter
• Adds metadata
25
Tune Data Structures
• Joins are expensive
• Disk space is not
• De-normalize
• Store same data in
multiple formats
26
Map-Reduce
• Assembly language of data processing
• Simple things are hard, hard things are possible
• Use for:
• Optimization: Do in one MR job what Hive does in 3
• Optimization: Partition the data just right
• GeoSpatial
• Mahout – Map/Reduce machine learning
27
Parallelism –Unit of Work
• Amdahl’s Law
• Small Units
• That stay small
• One user?
• One day?
• Ten square meter?
28
Remember the Basics
• X reduce output is 3X disk IO and 2X network IO
• Less jobs = Less reduces = Less IO = Faster and Scalier
• Know your network and disk throughput
• Have rough idea of ops-per-second
29
Instrumentation
• Optimize the right things
• Right jobs
• Right hardware
• 90% of the time –
its not the hardware
30
Tips
Avoid the
“old data warehouse guy”
syndrome
31
Tips
• Slowly changing dimensions:
• Load changes
• Merge
• And swap
• Store intermediate results:
• Performance
• Debugging
• Store Source/Derived relation
32
Fault and Rebuild
• Tier 0 – raw data
• Tier 1 – cleaned data
• Tier 2 – transformations, lookups and denormalization
• Tier 3 - Aggregations
33
Never Metadata I didn’t like
• Metadata is small data
• That changes frequently
• Solutions:
• Oozie
• Directory structure / Partitions
• Outside of HDFS
• HBase
• Cloudera Navigator
34
Few words about Real Time ETL
• What does it even mean?
• Fast reporting?
• No delay from OLTP to DWH?
• Micro-batches make more sense:
• Aggregation
• Economy of scale
• Late data happens
• Near-line solutions
35
36
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• NoSQLs
37
Scaling
• Sqoop works better for some DBs
• Fuse-FS and DB tools give more control
• Load in parallel
• Directly to FS
• Partitions
• Do all formatting on Hadoop
• Do you REALLY need to load that?
38
How not to Load
• Most Hadoop customers don’t load data in bulk
• History can stay in Hadoop
• Load only aggregated data
• Or computation results – recommendations, reports.
• Most queries can run in Hadoop
• BI tools often run in Hadoop
39
40
Workflow Management
Tools
• Oozie
• Azkaban
• Pentaho Kettle
• TalenD
• Informatica
41
} Native Hadoop
} Kinda Open Source
Scaling Challenges
• Keeping track of:
• Code Components
• Metadata
• Integrations and Adapters
• Reports, results, artifacts
• Scheduling and Orchestration
• Cohesive System View
• Life Cycle
• Instrumentation, Measurement and Monitoring
42
My Toolbox
• Hue + Oozie:
• Scheduling + Orchestration
• Cohesive system view
• Process repository
• Some metadata
• Some instrumentation
• Cloudera Manager for monitoring
• … and way too many home grown scripts
43
Hue + Oozie
44
45
— Josh Wills
A lot is still missing
• Metadata solution
• Instrumentation and monitoring solution
• Lifecycle management
• Lineage tracking
Contributions are welcome
46
47

Scaling etl with hadoop shapira 3

  • 1.
    1 Scaling ETL with Hadoop GwenShapira, Solutions Architect @gwenshap gshapira@cloudera.com
  • 2.
  • 3.
    ETL is… • Extractingdata from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  • 4.
    Hadoop Is… • HDFS– Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  • 5.
    The Ecosystem • Highlevel languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6.
  • 7.
    Got unstructured data? •Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  • 8.
    Replace ETL Clusters •Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  • 9.
    Data Warehouse Offloading •DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  • 10.
    What I oftensee ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  • 11.
  • 12.
    We’ll Discuss: Technologies Speed& Scale Tips & Tricks Extract Transform Load Workflow 13
  • 13.
  • 14.
    Let me countthe ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  • 15.
    Sqoop – TheBalancing Act 16
  • 16.
    Scale Sqoop Slowly •Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  • 17.
    When Loading Files: Sameprinciples apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  • 18.
    Scaling Sqoop • Splitcolumn - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  • 19.
    Ingest Tips • Usefile system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  • 20.
    Ingest Tips • Externaltables in Hive • Keep raw data • Trigger workflows on file arrival 21
  • 21.
  • 22.
    Endless Possibilites • MapReduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  • 23.
  • 24.
    Partitioning • Hive • DirectoryStructure • Pre-filter • Adds metadata 25
  • 25.
    Tune Data Structures •Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  • 26.
    Map-Reduce • Assembly languageof data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  • 27.
    Parallelism –Unit ofWork • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  • 28.
    Remember the Basics •X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  • 29.
    Instrumentation • Optimize theright things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  • 30.
    Tips Avoid the “old datawarehouse guy” syndrome 31
  • 31.
    Tips • Slowly changingdimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  • 32.
    Fault and Rebuild •Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  • 33.
    Never Metadata Ididn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  • 34.
    Few words aboutReal Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  • 35.
  • 36.
    Technologies • Sqoop • Fuse-DFS •Oracle Connectors • NoSQLs 37
  • 37.
    Scaling • Sqoop worksbetter for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  • 38.
    How not toLoad • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  • 39.
  • 40.
    Tools • Oozie • Azkaban •Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  • 41.
    Scaling Challenges • Keepingtrack of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  • 42.
    My Toolbox • Hue+ Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  • 43.
  • 44.
  • 45.
    A lot isstill missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  • 46.

Editor's Notes

  • #25 Start with a portion of your data for fast iterationsPrototype – with Impala / streamingStart high level – tune as you go
  • #29 Pregnancy takes 9 month, no matter how many women are assigned to itBut – Elephants are pregnant for two years!
  • #32 50% of Hadoop systems end up managed by the data warehouse team. And they want to do things as they always did – Kimball methodologies, time dimensions, etc.Not always a good idea, not always scalable, not always possible.