3. ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
4. Hadoop Is…
• HDFS – Replicated, distributed data storage
• Map-Reduce – Batch oriented data processing at
scale. Parallel programming platform.
4
5. The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
8. Replace ETL Clusters
• Informatica is:
• Easy to program
• Complete ETL solution
• Hadoop is:
• Cheaper
• MUCH more flexible
• Faster?
• More scalable?
• You can have both
9
9. Data Warehouse Offloading
• DWH resources are
EXPENSIVE
• Reduce storage costs
• Release CPU capacity
• Scale
• Better tools
10
10. What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
ETL Cluster
ETL Cluster with
some Hadoop
25. Tune Data Structures
• Joins are expensive
• Disk space is not
• De-normalize
• Store same data in
multiple formats
26
26. Map-Reduce
• Assembly language of data processing
• Simple things are hard, hard things are possible
• Use for:
• Optimization: Do in one MR job what Hive does in 3
• Optimization: Partition the data just right
• GeoSpatial
• Mahout – Map/Reduce machine learning
27
27. Parallelism –Unit of Work
• Amdahl’s Law
• Small Units
• That stay small
• One user?
• One day?
• Ten square meter?
28
28. Remember the Basics
• X reduce output is 3X disk IO and 2X network IO
• Less jobs = Less reduces = Less IO = Faster and Scalier
• Know your network and disk throughput
• Have rough idea of ops-per-second
29
31. Tips
• Slowly changing dimensions:
• Load changes
• Merge
• And swap
• Store intermediate results:
• Performance
• Debugging
• Store Source/Derived relation
32
32. Fault and Rebuild
• Tier 0 – raw data
• Tier 1 – cleaned data
• Tier 2 – transformations, lookups and denormalization
• Tier 3 - Aggregations
33
33. Never Metadata I didn’t like
• Metadata is small data
• That changes frequently
• Solutions:
• Oozie
• Directory structure / Partitions
• Outside of HDFS
• HBase
• Cloudera Navigator
34
34. Few words about Real Time ETL
• What does it even mean?
• Fast reporting?
• No delay from OLTP to DWH?
• Micro-batches make more sense:
• Aggregation
• Economy of scale
• Late data happens
• Near-line solutions
35
37. Scaling
• Sqoop works better for some DBs
• Fuse-FS and DB tools give more control
• Load in parallel
• Directly to FS
• Partitions
• Do all formatting on Hadoop
• Do you REALLY need to load that?
38
38. How not to Load
• Most Hadoop customers don’t load data in bulk
• History can stay in Hadoop
• Load only aggregated data
• Or computation results – recommendations, reports.
• Most queries can run in Hadoop
• BI tools often run in Hadoop
39
41. Scaling Challenges
• Keeping track of:
• Code Components
• Metadata
• Integrations and Adapters
• Reports, results, artifacts
• Scheduling and Orchestration
• Cohesive System View
• Life Cycle
• Instrumentation, Measurement and Monitoring
42
42. My Toolbox
• Hue + Oozie:
• Scheduling + Orchestration
• Cohesive system view
• Process repository
• Some metadata
• Some instrumentation
• Cloudera Manager for monitoring
• … and way too many home grown scripts
43
45. A lot is still missing
• Metadata solution
• Instrumentation and monitoring solution
• Lifecycle management
• Lineage tracking
Contributions are welcome
46
Start with a portion of your data for fast iterationsPrototype – with Impala / streamingStart high level – tune as you go
Pregnancy takes 9 month, no matter how many women are assigned to itBut – Elephants are pregnant for two years!
50% of Hadoop systems end up managed by the data warehouse team. And they want to do things as they always did – Kimball methodologies, time dimensions, etc.Not always a good idea, not always scalable, not always possible.