TheETLBottleneckinBigDataAnalytics(1)

The ETL Bottleneck in Big Data Analytics
The new wave of big data is creating new opportunities and new challenges for
businesses across every industry. The challenge of data integration is incorporating data from
social media and other unstructured data into a traditional BI environment,and is one of the most
urgent issues facing IT Engineers.Apache Hadoop provides a costeffective and horizontally
scalable platform for absorbing big data and preparing it for analysis. Using Hadoop instead of
the traditional ETL processes can reduce time to analysis majorly. Running the Hadoop cluster
efficiently implies selecting an optimal framework of servers, storage systems, networking
devices, and softwares.
A typical ETL process will extract data from multiple sources, then cleanses, formats,
and loads it into a data warehouse for analysis. When the nature source data sets is large in size,
fast growing, and not in structured format, traditional ETL can become the bottleneck, because it
is too complex , too expensive, and time consuming to develop,operate and execute.

Fig #1: Depicts the Traditional ETL Process

Fig#2:Depicts ETL offload Hadoop.

Apache Hadoop for Big Data
Hadoop is an open source framework which is based on java programming model that
supports processing and storing of large data sets in a distributed computing environment. It runs
on a cluster of commodity machines. Hadoop allows you to store petabytes of data reliably on
large number of servers while increasing performance costeffectively by just adding
inexpensive nodes to the cluster. The Reason for the scalability of Apache Hadoop is the
distributed processing framework known as MapReduce. MapReduce is a method to process
large sums of data in parallel while the developer only has to write two codes the mapper and
reduce.In the mapping phase, Mapreduce takes the input data and assigns every data element to
the mapper. In the reducing phase, the reducer combines all the partial and intermediate outputs
from all the mappers and produces a final result. MapReduce is an important advance because it
allows developers to use parallel programming constructs without having to know about the
complex details of intracluster communication, monitoring the tasks, and handling failures. The
system breaks the input dataset into multiple chunks, each one of them is assigned a map task
that processes the data in parallel. The map function will read the input in the form of (key,
value) pairs and produce a transformed set of (key, value) pairs as the output. During the process
outputs of the map tasks are shuffled and sorted and the intermediate (key, value) pairs will be
sent to the reduce tasks, which will group the outputs into final results.To perform processing
using MapReduce theJobTracker and TaskTracker mechanisms is used to schedule,monitor and
restart any of the tasks that fail. Hadoop framework includes the Hadoop Distributed File System
(HDFS), that is specially designed file system with streaming access pattern and fault tolerance
capability. HDFS stores large amount of data, It divides the data into blocks (usually 64 or 128
MB) and replicates the blocks on the cluster of machines.By default three replications are
maintained.Capacity and performance can be increased by adding Data Nodes, and a single
NameNode mechanism.

ETL, ELT, and ETLT with Apache Hadoop
ETL tools migrate data from one place to another by performing three functions:
•Extract data from sources like ERP or CRM applications.
In the extract step, data has to be collected from several source systems and in multiple
file formats, like the flat files with (csv) delimiters and files with XML extensions. There may
also be a need to get data from legacy systems that store data in formats that are understood by
very few and no one else uses it anymore.
•Transform that data into a format that matches other data in the warehouse.
The transformation process includes many data manipulations, like moving, splitting,
translating, merging, sorting, pivoting, and many more.
•Loading the data into the data warehouse for analysis.
This process can be performed through batch files or row by row,in real time.

All the above processes sound simple but take days to complete the process.

“Power of hadoop with ETL”
Hadoop brings at least two major advantages to traditional ETL:
•Ingesting huge amounts of data without having to specify a schema on write.
A prime property of Hadoop is the “no schema onwrite,”this implies that you don't
have to predefine the data schema before loading data into HDFS. This holds true for both
structured data (such as pointofsale transactions,details of call records, ledger transactions, and
even the call center transactions),as well as for unstructured data (like comments from users,
doctor’s notes, descriptions on insurance claims, and web logs) and social media data (from
websites like Facebook, LinkedIn and Twitter). Irrespective of whether your input data has
explicit or implicit structure, one can quickly load it into HDFS, which can then be ready for
downstream analytic further processing.
•Unload the transformations of input data by parallel processing at scale.
Once the data is loaded in Hadoop you can perform the traditional ETL tasks like
cleansing,aligning,normalizing and combining data by employing the massive scalability of
MapReduce function. Hadoop also permits you to keep away from the transformation bottleneck
in the old and typical ETLT by unloading the ingestion, transformation, and integration of
unstructured data into the data warehouse. Since Hadoop allows you to use more data types than
ever before, it enriches your data warehouse which otherwise would not be feasible. Due to its
scalable performance, you can appreciably speed up the ETLT jobs. Adding on, since data saved
in Hadoop persists for a much longer period, one can provide more granular details of the data
via  EDW for highfidelity analysis.

TheETLBottleneckinBigDataAnalytics(1)

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (13)

Similar to TheETLBottleneckinBigDataAnalytics(1)

Similar to TheETLBottleneckinBigDataAnalytics(1) (20)

TheETLBottleneckinBigDataAnalytics(1)