Scaling etl with hadoop   shapira 3
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,785
On Slideshare
3,552
From Embeds
233
Number of Embeds
3

Actions

Shares
Downloads
105
Comments
1
Likes
12

Embeds 233

http://inergy20.wordpress.com 158
https://twitter.com 74
http://tweetedtimes.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Start with a portion of your data for fast iterationsPrototype – with Impala / streamingStart high level – tune as you go
  • Pregnancy takes 9 month, no matter how many women are assigned to itBut – Elephants are pregnant for two years!
  • 50% of Hadoop systems end up managed by the data warehouse team. And they want to do things as they always did – Kimball methodologies, time dimensions, etc.Not always a good idea, not always scalable, not always possible.

Transcript

  • 1. 1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  • 4. Hadoop Is… • HDFS – Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  • 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6. 6 Why ETL with Hadoop?
  • 7. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  • 8. Replace ETL Clusters • Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  • 9. Data Warehouse Offloading • DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  • 10. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  • 11. 12
  • 12. We’ll Discuss: Technologies Speed & Scale Tips & Tricks Extract Transform Load Workflow 13
  • 13. 14 Extract
  • 14. Let me count the ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  • 15. Sqoop – The Balancing Act 16
  • 16. Scale Sqoop Slowly • Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  • 17. When Loading Files: Same principles apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  • 18. Scaling Sqoop • Split column - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  • 19. Ingest Tips • Use file system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  • 20. Ingest Tips • External tables in Hive • Keep raw data • Trigger workflows on file arrival 21
  • 21. 22 Transform
  • 22. Endless Possibilites • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  • 23. Prototype 24
  • 24. Partitioning • Hive • Directory Structure • Pre-filter • Adds metadata 25
  • 25. Tune Data Structures • Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  • 26. Map-Reduce • Assembly language of data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  • 27. Parallelism –Unit of Work • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  • 28. Remember the Basics • X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  • 29. Instrumentation • Optimize the right things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  • 30. Tips Avoid the “old data warehouse guy” syndrome 31
  • 31. Tips • Slowly changing dimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  • 32. Fault and Rebuild • Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  • 33. Never Metadata I didn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  • 34. Few words about Real Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  • 35. 36 Load
  • 36. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • NoSQLs 37
  • 37. Scaling • Sqoop works better for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  • 38. How not to Load • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  • 39. 40 Workflow Management
  • 40. Tools • Oozie • Azkaban • Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  • 41. Scaling Challenges • Keeping track of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  • 42. My Toolbox • Hue + Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  • 43. Hue + Oozie 44
  • 44. 45 — Josh Wills
  • 45. A lot is still missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  • 46. 47