Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling etl with hadoop shapira 3

4,692 views

Published on

Published in: Technology

Scaling etl with hadoop shapira 3

  1. 1. 1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap gshapira@cloudera.com
  2. 2. 2
  3. 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  4. 4. Hadoop Is… • HDFS – Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  5. 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  6. 6. 6 Why ETL with Hadoop?
  7. 7. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  8. 8. Replace ETL Clusters • Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  9. 9. Data Warehouse Offloading • DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  10. 10. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  11. 11. 12
  12. 12. We’ll Discuss: Technologies Speed & Scale Tips & Tricks Extract Transform Load Workflow 13
  13. 13. 14 Extract
  14. 14. Let me count the ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  15. 15. Sqoop – The Balancing Act 16
  16. 16. Scale Sqoop Slowly • Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  17. 17. When Loading Files: Same principles apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  18. 18. Scaling Sqoop • Split column - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  19. 19. Ingest Tips • Use file system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  20. 20. Ingest Tips • External tables in Hive • Keep raw data • Trigger workflows on file arrival 21
  21. 21. 22 Transform
  22. 22. Endless Possibilites • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  23. 23. Prototype 24
  24. 24. Partitioning • Hive • Directory Structure • Pre-filter • Adds metadata 25
  25. 25. Tune Data Structures • Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  26. 26. Map-Reduce • Assembly language of data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  27. 27. Parallelism –Unit of Work • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  28. 28. Remember the Basics • X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  29. 29. Instrumentation • Optimize the right things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  30. 30. Tips Avoid the “old data warehouse guy” syndrome 31
  31. 31. Tips • Slowly changing dimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  32. 32. Fault and Rebuild • Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  33. 33. Never Metadata I didn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  34. 34. Few words about Real Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  35. 35. 36 Load
  36. 36. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • NoSQLs 37
  37. 37. Scaling • Sqoop works better for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  38. 38. How not to Load • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  39. 39. 40 Workflow Management
  40. 40. Tools • Oozie • Azkaban • Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  41. 41. Scaling Challenges • Keeping track of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  42. 42. My Toolbox • Hue + Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  43. 43. Hue + Oozie 44
  44. 44. 45 — Josh Wills
  45. 45. A lot is still missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  46. 46. 47

×