SlideShare a Scribd company logo
The ETL Bottleneck in Big Data Analytics 
The new wave of big data is creating new opportunities and new challenges for                           
businesses across every industry. The challenge of data integration is incorporating data from                         
social media and other unstructured data into a traditional BI environment,and is one of the most                               
urgent issues facing IT Engineers.Apache Hadoop provides a cost­effective and horizontally                     
scalable platform for absorbing big data and preparing it for analysis. Using Hadoop instead of                             
the traditional ETL processes can reduce time to analysis majorly. Running the Hadoop cluster                           
efficiently implies selecting an optimal framework of servers, storage systems, networking                     
devices, and softwares. 
A typical ETL process will extract data from multiple sources, then cleanses, formats,                         
and loads it into a data warehouse for analysis. When the nature source data sets is large in size,                                     
fast growing, and not in structured format, traditional ETL can become the bottleneck, because it                             
is too complex , too expensive, and time consuming to develop,operate and execute. 
 
Fig #1: Depicts the Traditional ETL Process 
 
Fig#2:Depicts ETL offload Hadoop. 
Apache Hadoop for Big Data 
Hadoop is an open source framework which is based on java programming model that                           
supports processing and storing of large data sets in a distributed computing environment​. It runs                             
on a cluster of commodity machines. Hadoop allows you to store petabytes of data reliably on                               
large number of servers while increasing performance cost­effectively by just adding                     
inexpensive nodes to the cluster. The Reason for the scalability of Apache Hadoop is the                             
distributed processing framework known as MapReduce. ​MapReduce is a method to process                       
large sums of data in parallel while the developer only has to write two codes the mapper and                                   
reduce.In the mapping phase, Mapreduce takes the input data and assigns every data element to                             
the mapper. In the reducing phase, the reducer combines all the partial and intermediate outputs                             
from all the mappers and produces a final result. MapReduce is an important advance because it                               
allows developers to use parallel programming constructs without having to know about the                         
complex details of intra­cluster communication, monitoring the tasks, and handling failures. The                       
system breaks the input data­set into multiple chunks, each one of them is assigned a map task                                 
that processes the data in parallel. The map function will read the input in the form of (key,                                   
value) pairs and produce a transformed set of (key, value) pairs as the output. During the process                                 
outputs of the map tasks are shuffled and sorted and the intermediate (key, value) pairs will be                                 
sent to the reduce tasks, which will group the outputs into final results.To perform processing                             
using MapReduce theJobTracker and TaskTracker mechanisms is used to schedule,monitor and                     
restart any of the tasks that fail. Hadoop framework includes the Hadoop Distributed File System                             
(HDFS), that is specially designed file system with streaming access pattern and fault tolerance                           
capability. HDFS stores large amount of data, It divides the data into blocks (usually 64 or 128                                 
MB) and replicates the blocks on the cluster of machines.By default three replications are                           
maintained.Capacity and performance can be increased by adding Data Nodes, and a single                         
NameNode mechanism. 
 
ETL, ELT, and ETLT with Apache Hadoop  
ETL tools migrate data from one place to another by performing three functions: 
•​Extract data from sources like ERP or CRM applications. 
In the extract step, data has to be collected from several source systems and in multiple                               
file formats, like the flat files with (csv) delimiters and files with XML extensions. There may                               
also be a need to get data from legacy systems that store data in formats that are understood by                                     
very few and no one else uses it anymore. 
•​Transform that data into a format that matches other data in the warehouse.  
The transformation process includes many data manipulations, like moving, splitting,                   
translating, merging, sorting, pivoting, and many more.  
•​Loading the data into the data warehouse for analysis. 
This process can be performed through batch files or row by row,in real time. 
All the above processes sound simple but take days to complete the process. 
 
“Power of hadoop with ETL” 
Hadoop brings at least two major advantages to traditional ETL:  
•Ingesting huge amounts of data without having to specify a schema on write. 
​A prime property of Hadoop is the “no schema on­write,”this implies that you don't                             
have to pre­define the data schema before loading data into HDFS. This holds true for both                               
structured data (such as point­of­sale transactions,details of call records, ledger transactions, and                       
even the call center transactions),as well as for unstructured data (like comments from users,                           
doctor’s notes, descriptions on insurance claims, and web logs) and social media data (from                           
websites like Facebook, LinkedIn and Twitter). Irrespective of whether your input data has                         
explicit or implicit structure, one can quickly load it into HDFS, which can then be ready for                                 
downstream analytic further processing.  
•Unload the transformations of input data by parallel processing at scale.  
Once the data is loaded in Hadoop you can perform the traditional ETL tasks like                             
cleansing,aligning,normalizing and combining data by employing the massive scalability of                   
MapReduce function. Hadoop also permits you to keep away from the transformation bottleneck                         
in the old and typical ETLT by unloading the ingestion, transformation, and integration of                           
unstructured data into the data warehouse. Since Hadoop allows you to use more data types than                               
ever before, it enriches your data warehouse which otherwise would not be feasible. Due to its                               
scalable performance, you can appreciably speed up the ETLT jobs. Adding on, since data saved                             
in Hadoop persists for a much longer period, one can provide more granular details of the data                                 
via  EDW for high­fidelity analysis. 

More Related Content

What's hot

Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
The Database Environment Chapter 13
The Database Environment Chapter 13The Database Environment Chapter 13
The Database Environment Chapter 13
Jeanie Arnoco
 
Hadoop
HadoopHadoop
OLAP & Data Warehouse
OLAP & Data WarehouseOLAP & Data Warehouse
OLAP & Data Warehouse
Zalpa Rathod
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
DataWorks Summit
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
Abbas Maazallahi
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
zahid-mian
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata
Punk Milton
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande
 
Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
Dr. C.V. Suresh Babu
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 

What's hot (17)

Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
The Database Environment Chapter 13
The Database Environment Chapter 13The Database Environment Chapter 13
The Database Environment Chapter 13
 
Hadoop
HadoopHadoop
Hadoop
 
OLAP & Data Warehouse
OLAP & Data WarehouseOLAP & Data Warehouse
OLAP & Data Warehouse
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 

Viewers also liked

Lecture1
Lecture1Lecture1
Lecture1
Sunil Chavan
 
Bi
BiBi
The Flashpoint - Reverse Flash
The Flashpoint - Reverse FlashThe Flashpoint - Reverse Flash
The Flashpoint - Reverse Flash
The Magician
 
Road to flashpoint #4
Road to flashpoint #4Road to flashpoint #4
Road to flashpoint #4
The Magician
 
Road to flashpoint #1
Road to flashpoint #1Road to flashpoint #1
Road to flashpoint #1
The Magician
 
Road to flashpoint #3
Road to flashpoint #3Road to flashpoint #3
Road to flashpoint #3
The Magician
 
Road to flashpoint #2
Road to flashpoint #2Road to flashpoint #2
Road to flashpoint #2
The Magician
 
MID MAD SUDIRMAN
MID MAD SUDIRMAN MID MAD SUDIRMAN
MID MAD SUDIRMAN
sudirmanimm
 
The flashpoint #1
The flashpoint #1The flashpoint #1
The flashpoint #1
The Magician
 
Road to Flashpoint - Reverse Flash: Rebirth
Road to Flashpoint - Reverse Flash: RebirthRoad to Flashpoint - Reverse Flash: Rebirth
Road to Flashpoint - Reverse Flash: Rebirth
The Magician
 
ejaan bahasa indonesia
ejaan bahasa indonesiaejaan bahasa indonesia
ejaan bahasa indonesia
sudirmanimm
 
Top 3 Challenges In Business Intelligence
Top 3 Challenges In Business IntelligenceTop 3 Challenges In Business Intelligence
Top 3 Challenges In Business Intelligence
www.panorama.com
 

Viewers also liked (13)

Lecture1
Lecture1Lecture1
Lecture1
 
Bi
BiBi
Bi
 
The Flashpoint - Reverse Flash
The Flashpoint - Reverse FlashThe Flashpoint - Reverse Flash
The Flashpoint - Reverse Flash
 
Road to flashpoint #4
Road to flashpoint #4Road to flashpoint #4
Road to flashpoint #4
 
Road to flashpoint #1
Road to flashpoint #1Road to flashpoint #1
Road to flashpoint #1
 
Road to flashpoint #3
Road to flashpoint #3Road to flashpoint #3
Road to flashpoint #3
 
Road to flashpoint #2
Road to flashpoint #2Road to flashpoint #2
Road to flashpoint #2
 
MID MAD SUDIRMAN
MID MAD SUDIRMAN MID MAD SUDIRMAN
MID MAD SUDIRMAN
 
The flashpoint #1
The flashpoint #1The flashpoint #1
The flashpoint #1
 
Road to Flashpoint - Reverse Flash: Rebirth
Road to Flashpoint - Reverse Flash: RebirthRoad to Flashpoint - Reverse Flash: Rebirth
Road to Flashpoint - Reverse Flash: Rebirth
 
ejaan bahasa indonesia
ejaan bahasa indonesiaejaan bahasa indonesia
ejaan bahasa indonesia
 
Top 3 Challenges In Business Intelligence
Top 3 Challenges In Business IntelligenceTop 3 Challenges In Business Intelligence
Top 3 Challenges In Business Intelligence
 
Letter of Commendation
Letter of CommendationLetter of Commendation
Letter of Commendation
 

Similar to TheETLBottleneckinBigDataAnalytics(1)

Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 
Cppt
CpptCppt
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
avenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
BlibBlobb
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
Uttara University
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
Douglas Bernardini
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Tyrone Systems
 
paper
paperpaper

Similar to TheETLBottleneckinBigDataAnalytics(1) (20)

Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
paper
paperpaper
paper
 

TheETLBottleneckinBigDataAnalytics(1)

  • 1. The ETL Bottleneck in Big Data Analytics  The new wave of big data is creating new opportunities and new challenges for                            businesses across every industry. The challenge of data integration is incorporating data from                          social media and other unstructured data into a traditional BI environment,and is one of the most                                urgent issues facing IT Engineers.Apache Hadoop provides a cost­effective and horizontally                      scalable platform for absorbing big data and preparing it for analysis. Using Hadoop instead of                              the traditional ETL processes can reduce time to analysis majorly. Running the Hadoop cluster                            efficiently implies selecting an optimal framework of servers, storage systems, networking                      devices, and softwares.  A typical ETL process will extract data from multiple sources, then cleanses, formats,                          and loads it into a data warehouse for analysis. When the nature source data sets is large in size,                                      fast growing, and not in structured format, traditional ETL can become the bottleneck, because it                              is too complex , too expensive, and time consuming to develop,operate and execute.    Fig #1: Depicts the Traditional ETL Process    Fig#2:Depicts ETL offload Hadoop. 
  • 2. Apache Hadoop for Big Data  Hadoop is an open source framework which is based on java programming model that                            supports processing and storing of large data sets in a distributed computing environment​. It runs                              on a cluster of commodity machines. Hadoop allows you to store petabytes of data reliably on                                large number of servers while increasing performance cost­effectively by just adding                      inexpensive nodes to the cluster. The Reason for the scalability of Apache Hadoop is the                              distributed processing framework known as MapReduce. ​MapReduce is a method to process                        large sums of data in parallel while the developer only has to write two codes the mapper and                                    reduce.In the mapping phase, Mapreduce takes the input data and assigns every data element to                              the mapper. In the reducing phase, the reducer combines all the partial and intermediate outputs                              from all the mappers and produces a final result. MapReduce is an important advance because it                                allows developers to use parallel programming constructs without having to know about the                          complex details of intra­cluster communication, monitoring the tasks, and handling failures. The                        system breaks the input data­set into multiple chunks, each one of them is assigned a map task                                  that processes the data in parallel. The map function will read the input in the form of (key,                                    value) pairs and produce a transformed set of (key, value) pairs as the output. During the process                                  outputs of the map tasks are shuffled and sorted and the intermediate (key, value) pairs will be                                  sent to the reduce tasks, which will group the outputs into final results.To perform processing                              using MapReduce theJobTracker and TaskTracker mechanisms is used to schedule,monitor and                      restart any of the tasks that fail. Hadoop framework includes the Hadoop Distributed File System                              (HDFS), that is specially designed file system with streaming access pattern and fault tolerance                            capability. HDFS stores large amount of data, It divides the data into blocks (usually 64 or 128                                  MB) and replicates the blocks on the cluster of machines.By default three replications are                            maintained.Capacity and performance can be increased by adding Data Nodes, and a single                          NameNode mechanism.    ETL, ELT, and ETLT with Apache Hadoop   ETL tools migrate data from one place to another by performing three functions:  •​Extract data from sources like ERP or CRM applications.  In the extract step, data has to be collected from several source systems and in multiple                                file formats, like the flat files with (csv) delimiters and files with XML extensions. There may                                also be a need to get data from legacy systems that store data in formats that are understood by                                      very few and no one else uses it anymore.  •​Transform that data into a format that matches other data in the warehouse.   The transformation process includes many data manipulations, like moving, splitting,                    translating, merging, sorting, pivoting, and many more.   •​Loading the data into the data warehouse for analysis.  This process can be performed through batch files or row by row,in real time. 
  • 3. All the above processes sound simple but take days to complete the process.    “Power of hadoop with ETL”  Hadoop brings at least two major advantages to traditional ETL:   •Ingesting huge amounts of data without having to specify a schema on write.  ​A prime property of Hadoop is the “no schema on­write,”this implies that you don't                              have to pre­define the data schema before loading data into HDFS. This holds true for both                                structured data (such as point­of­sale transactions,details of call records, ledger transactions, and                        even the call center transactions),as well as for unstructured data (like comments from users,                            doctor’s notes, descriptions on insurance claims, and web logs) and social media data (from                            websites like Facebook, LinkedIn and Twitter). Irrespective of whether your input data has                          explicit or implicit structure, one can quickly load it into HDFS, which can then be ready for                                  downstream analytic further processing.   •Unload the transformations of input data by parallel processing at scale.   Once the data is loaded in Hadoop you can perform the traditional ETL tasks like                              cleansing,aligning,normalizing and combining data by employing the massive scalability of                    MapReduce function. Hadoop also permits you to keep away from the transformation bottleneck                          in the old and typical ETLT by unloading the ingestion, transformation, and integration of                            unstructured data into the data warehouse. Since Hadoop allows you to use more data types than                                ever before, it enriches your data warehouse which otherwise would not be feasible. Due to its                                scalable performance, you can appreciably speed up the ETLT jobs. Adding on, since data saved                              in Hadoop persists for a much longer period, one can provide more granular details of the data                                  via  EDW for high­fidelity analysis.