SlideShare a Scribd company logo
1 of 46
1
Scaling ETL
with Hadoop
Gwen Shapira, Solutions Architect
@gwenshap
gshapira@cloudera.com
2
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
Hadoop Is…
• HDFS – Replicated, distributed data storage
• Map-Reduce – Batch oriented data processing at
scale. Parallel programming platform.
4
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
6
Why ETL with Hadoop?
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
8
Replace ETL Clusters
• Informatica is:
• Easy to program
• Complete ETL solution
• Hadoop is:
• Cheaper
• MUCH more flexible
• Faster?
• More scalable?
• You can have both
9
Data Warehouse Offloading
• DWH resources are
EXPENSIVE
• Reduce storage costs
• Release CPU capacity
• Scale
• Better tools
10
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
ETL Cluster
ETL Cluster with
some Hadoop
12
We’ll Discuss:
Technologies Speed & Scale Tips & Tricks
Extract
Transform
Load
Workflow
13
14
Extract
Let me count the ways
• From Databases: Sqoop
• Log Data: Flume + CDK
• Or just write data to HDFS
15
Sqoop – The Balancing Act
16
Scale Sqoop Slowly
• Balance between:
• Maximizing network utilization
• Minimizing database impact
• Start with smallish table (1-10G)
• 1 mapper, 2 mappers, 4 mappers
• Where’s the bottleneck?
• vmtat, iostat, mpstat, netstat, iptraf
17
When Loading Files:
Same principles apply:
• Parallel Copy
• Add Parallelism
• Find Bottlenecks
• Resolve them
• Avoid Self-DDOS
18
Scaling Sqoop
• Split column - match index or partitions
• Compression
• Drivers + Connectors + Direct mode
• Incremental import
19
Ingest Tips
• Use file system tricks to ensure consistency
• Directory structure:
/intent
/category
/application (optional)
/dataset
/partitions
/files
• Examples:
/data/fraud/txs/2011-01-01/20110101-00.avro
/group/research/model-17/training-tx/part-00000.txt
/user/gshapira/scratch/surge/
20
Ingest Tips
• External tables in Hive
• Keep raw data
• Trigger workflows on
file arrival
21
22
Transform
Endless Possibilites
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
• Morphlines
23
Prototype
24
Partitioning
• Hive
• Directory Structure
• Pre-filter
• Adds metadata
25
Tune Data Structures
• Joins are expensive
• Disk space is not
• De-normalize
• Store same data in
multiple formats
26
Map-Reduce
• Assembly language of data processing
• Simple things are hard, hard things are possible
• Use for:
• Optimization: Do in one MR job what Hive does in 3
• Optimization: Partition the data just right
• GeoSpatial
• Mahout – Map/Reduce machine learning
27
Parallelism –Unit of Work
• Amdahl’s Law
• Small Units
• That stay small
• One user?
• One day?
• Ten square meter?
28
Remember the Basics
• X reduce output is 3X disk IO and 2X network IO
• Less jobs = Less reduces = Less IO = Faster and Scalier
• Know your network and disk throughput
• Have rough idea of ops-per-second
29
Instrumentation
• Optimize the right things
• Right jobs
• Right hardware
• 90% of the time –
its not the hardware
30
Tips
Avoid the
“old data warehouse guy”
syndrome
31
Tips
• Slowly changing dimensions:
• Load changes
• Merge
• And swap
• Store intermediate results:
• Performance
• Debugging
• Store Source/Derived relation
32
Fault and Rebuild
• Tier 0 – raw data
• Tier 1 – cleaned data
• Tier 2 – transformations, lookups and denormalization
• Tier 3 - Aggregations
33
Never Metadata I didn’t like
• Metadata is small data
• That changes frequently
• Solutions:
• Oozie
• Directory structure / Partitions
• Outside of HDFS
• HBase
• Cloudera Navigator
34
Few words about Real Time ETL
• What does it even mean?
• Fast reporting?
• No delay from OLTP to DWH?
• Micro-batches make more sense:
• Aggregation
• Economy of scale
• Late data happens
• Near-line solutions
35
36
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• NoSQLs
37
Scaling
• Sqoop works better for some DBs
• Fuse-FS and DB tools give more control
• Load in parallel
• Directly to FS
• Partitions
• Do all formatting on Hadoop
• Do you REALLY need to load that?
38
How not to Load
• Most Hadoop customers don’t load data in bulk
• History can stay in Hadoop
• Load only aggregated data
• Or computation results – recommendations, reports.
• Most queries can run in Hadoop
• BI tools often run in Hadoop
39
40
Workflow Management
Tools
• Oozie
• Azkaban
• Pentaho Kettle
• TalenD
• Informatica
41
} Native Hadoop
} Kinda Open Source
Scaling Challenges
• Keeping track of:
• Code Components
• Metadata
• Integrations and Adapters
• Reports, results, artifacts
• Scheduling and Orchestration
• Cohesive System View
• Life Cycle
• Instrumentation, Measurement and Monitoring
42
My Toolbox
• Hue + Oozie:
• Scheduling + Orchestration
• Cohesive system view
• Process repository
• Some metadata
• Some instrumentation
• Cloudera Manager for monitoring
• … and way too many home grown scripts
43
Hue + Oozie
44
45
— Josh Wills
A lot is still missing
• Metadata solution
• Instrumentation and monitoring solution
• Lifecycle management
• Lineage tracking
Contributions are welcome
46
47

More Related Content

What's hot

Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

What's hot (20)

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 

Viewers also liked

Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
Marcel Franke
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Viewers also liked (12)

A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
ETL Is Dead, Long-live Streams
ETL Is Dead, Long-live StreamsETL Is Dead, Long-live Streams
ETL Is Dead, Long-live Streams
 
Designing and implementing_an_etl_framework
Designing and implementing_an_etl_frameworkDesigning and implementing_an_etl_framework
Designing and implementing_an_etl_framework
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power Center
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
How to document a database
How to document a databaseHow to document a database
How to document a database
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 

Similar to Scaling etl with hadoop shapira 3

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 

Similar to Scaling etl with hadoop shapira 3 (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Apache drill
Apache drillApache drill
Apache drill
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 

More from Gwen (Chen) Shapira

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Ssd collab13
Ssd   collab13Ssd   collab13
Ssd collab13
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Scaling etl with hadoop shapira 3

  • 1. 1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  • 4. Hadoop Is… • HDFS – Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  • 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6. 6 Why ETL with Hadoop?
  • 7. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  • 8. Replace ETL Clusters • Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  • 9. Data Warehouse Offloading • DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  • 10. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  • 11. 12
  • 12. We’ll Discuss: Technologies Speed & Scale Tips & Tricks Extract Transform Load Workflow 13
  • 14. Let me count the ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  • 15. Sqoop – The Balancing Act 16
  • 16. Scale Sqoop Slowly • Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  • 17. When Loading Files: Same principles apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  • 18. Scaling Sqoop • Split column - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  • 19. Ingest Tips • Use file system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  • 20. Ingest Tips • External tables in Hive • Keep raw data • Trigger workflows on file arrival 21
  • 22. Endless Possibilites • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  • 24. Partitioning • Hive • Directory Structure • Pre-filter • Adds metadata 25
  • 25. Tune Data Structures • Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  • 26. Map-Reduce • Assembly language of data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  • 27. Parallelism –Unit of Work • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  • 28. Remember the Basics • X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  • 29. Instrumentation • Optimize the right things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  • 30. Tips Avoid the “old data warehouse guy” syndrome 31
  • 31. Tips • Slowly changing dimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  • 32. Fault and Rebuild • Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  • 33. Never Metadata I didn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  • 34. Few words about Real Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  • 36. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • NoSQLs 37
  • 37. Scaling • Sqoop works better for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  • 38. How not to Load • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  • 40. Tools • Oozie • Azkaban • Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  • 41. Scaling Challenges • Keeping track of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  • 42. My Toolbox • Hue + Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  • 45. A lot is still missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  • 46. 47

Editor's Notes

  1. Start with a portion of your data for fast iterationsPrototype – with Impala / streamingStart high level – tune as you go
  2. Pregnancy takes 9 month, no matter how many women are assigned to itBut – Elephants are pregnant for two years!
  3. 50% of Hadoop systems end up managed by the data warehouse team. And they want to do things as they always did – Kimball methodologies, time dimensions, etc.Not always a good idea, not always scalable, not always possible.