This presentation aims to compare the performance of ETL pipeline using Spark and Hive under Azure. We will examine the features, strengths, and weaknesses of each tool, and provide recommendations on which one to use based on specific use cases.
2. Outline
Introduction
Objective
Problem statement
Impact and Benefits
Data overview and Description
Methodologies
I. Azure VM Configuration
II. Hadoop Configuration and
Architecture
III. Hive Configuration and Architecture
IV. Spark Configuration and Architecture
Test Method and Analysis
Conclusion
3. Introduction
ETL (Extract, Transform, Load) is a crucial process in
data warehousing and business intelligence. It involves
extracting data from various sources, transforming it
into a format suitable for analysis, and loading it into a
target system. In recent years, Spark and Hive have
emerged as popular ETL tools under Azure.
This presentation aims to compare the performance of
ETL pipeline using Spark and Hive under Azure. We
will examine the features, strengths, and weaknesses
of each tool, and provide recommendations on which
one to use based on specific use cases.
Introduction
4. Objectives
Evaluate the performance
of Spark and Hive in the
context of an ETL pipeline
for large-scale data
processing in the Azure
environment.
Compare the efficiency
and effectiveness of
Spark and Hive in
handling ETL tasks, such
as data ingestion, data
transformation, and data
analysis, with different
size of data scenarios.
Analyze the capabilities of
Spark and Hive in terms of
processing speed and
scalability, to identify their
strengths and weaknesses
for ETL pipelines in an
Azure-based data science
study.
General
Objectives
Demonstrate the use
of Spark and Hive for
processing different
data size in an Azure,
showcasing their
strengths and
limitations for each
data size.
Evaluate the scalability
and efficiency of Apache
Spark and Apache Hive
for processing large-scale
data in Azure, considering
factors such as resource
limits, data partitioning,
and scaling procedures.
Compare the performance
of Spark and Hive for ETL
pipelines in an Azure
environment, considering
factors such as data
segmentation, resource
allocation, and data
processing techniques.
Specific
Objectives
5. Problem Statement
Problem Statement
Impact
Benefits
The purpose of this study is to
compare the performance of ETL
pipelines using Spark and Hive
under Azure VM. Specifically, we
aim to identify which technology
performs better in terms of
speed, scalability, and ease of
use.
This comparison is important
because it can help
organizations make informed
decisions about which
technology to use for their
specific data processing needs.
The impact of this study is
significant because it can help
organizations save time and
resources by choosing the right
technology for their data
processing needs. By identifying
the strengths and weaknesses of
each technology, organizations
can make informed decisions
about which technology to use.
The benefits of this study include
improved data processing
efficiency, faster time-to-insight,
and reduced costs associated
with data processing and
storage.
6. Data overview
and Description Dataset created using Python script & saved as CSV files
01
Five different dataset sizes: 1MB, 10MB, 100MB, 500MB, and
1GB
02
Numpy and Pandas libraries used for data generation and
manipulation
03
Single column data with error values (Null Values,
Strings, Decimal Values, Negative values)
04
8. Total count of rows and error data
percentage for each dataset size:
1MB: 172,089
rows with 1%
error data (1,736
rows)
10MB: 1,720,897
rows with 1%
error data (17,380
rows)
100MB:
17,208,972 rows
with 1% error data
(173,828 rows)
500MB:
86,044,860 rows
with 1% error data
(869,140 rows)
1GB: 167,100,120
rows with 1%
error data
(1,687,880 rows)
9. Continued
• Data created using Python script is copied to the
Azure virtual machine (VM) in the "data" folder.
• Data transferred to Azure VM is then loaded into
HDFS for further processing.
• Hadoop provides command-line tools like Hadoop
HDFS command line interface (CLI) or WebHDFS API
for transferring data to HDFS.
• Data is stored in the "raw" folder in HDFS, which is a
common practice for storing unprocessed or raw data.
• Data is then processed and transformed using
appropriate data processing techniques, such as ETL
(Extract, Transform, Load) processes, data cleansing,
and data enrichment.
• Processed data is then written in the "published"
folder in HDFS, making it accessible for further
analysis or download for various use cases.
10. Methodologies
After conducting
thorough
theoretical and
practical
research, we
arrived at a
recommendation
for the optimal
choice between
Hive and Spark
based on their
performance on
the Azure Virtual
Machine (VM).
04
The Spark and
Hive scripts were
executed on the
Azure Virtual
Machine (VM)
established for
our research
study, and then
job execution
time was
recorded for both
Hive and Spark
for varying data
sizes.
03
We then created
PySpark and Hive
Query Language
(HQL) scripts to
perform ETL
operations on raw
data files of various
sizes (1MB, 10MB,
100MB, 500MB,
and 1GB) stored in
HDFS, applying
transformations
based on our
research
requirements.
02
We have
successfully set
up an Azure
virtual machine
on our local
machine, and
configured it with
Java 8, Hadoop,
Spark, and Hive
to enable efficient
data analysis for
our research
study.
01
11. Azure
VM
• Azure VM’s have flexible scaling of virtualized computing resources on
demand, with a wide range of pre-configured VM sizes or custom sizes.
• It supports for multiple operating system options, including Linux,
Windows, and open-source solutions, enabling OS selection based on
specific needs.
• Easy management and customization through Azure portal, CLI, or
REST API, with built-in security features for data protection, including
NSGs, VPNs, and ACLs.
• It can be configured to run data analysis applications such as Hadoop,
Spark, and Hive.
• We Configured an Azure Virtual Machine (VM) with appropriate
specifications for testing, including a quad-core processor, 16GB of
RAM, and a 500GB hard drive.
• We Set up the Azure VM using the Azure portal, selecting Linux Ubuntu
as the operating system, and configured it with 8GB of memory and 1
virtual CPUs for running Hadoop, Spark, and Hive.
• We then Enabled communication with Hadoop, Spark, and Hive
services by adding inbound port rules for HDFS on port 50070, YARN
on port 8088 on the VM.
12. HADOOP
AND IT’S
ARCHIT
ECTURE
• Hadoop is a distributed computational framework, specially designed for
Batch processing.
• Hadoop architecture is divided into 3 major components HDFS (Storage) ,
MAP-REDUCE(Processing) & YARN(Resource Management).
• HDFS (Hadoop Distributed File System) is used for distributed file storage of
large data sets across multiple nodes in a Hadoop cluster.
• MapReduce is a programming model used in Hadoop for processing large
data sets in parallel across distributed clusters, with two main phases - Map
and Reduce - to process and analyze data efficiently.
• YARN (Yet Another Resource Negotiator) is the resource management
framework in Hadoop that allocates computing resources and schedules
tasks across the cluster for efficient data processing.
• To install Hadoop on an Azure VM, We updated the system by installing Java
8 as a prerequisite for Hadoop, Spark, and Hive and then we installed
Hadoop 2.7.3 Version and set up HDFS and Yarn Cluster.
13. Hive & It’s Utilization
• Hive is built on top of Hadoop and provide a
SQL-like interface to Hadoop Distributed File
System (HDFS) data. It is designed to handle
large datasets and is well-suited for data
warehousing and data analysis tasks.
• The Hive architecture consists of several
components including User Interface,
Metastore, Driver, Compiler, Execution
Engine, Hadoop Distributed File System
(HDFS).
• Hive played a crucial role in our data analysis
efforts and helped us derive meaningful
insights that contributed to the success of our
project.
16. Spark and it’s
architecture
• Spark is a distributed computing framework
that provides fast and efficient processing of
large-scale data.
• It is designed to handle both batch and real-
time processing workloads and supports a
wide range of programming languages
including Java, Scala, and Python.
• One of the key advantages of Spark is its in-
memory processing capabilities, which allows
it to cache data in memory and avoid disk I/O
operations. This results in significantly faster
processing times compared to traditional ETL
tools like Hive.
17. Continued
• We utilized Spark within the Hadoop ecosystem,
leveraging its capabilities for processing raw data of
different file sizes (1MB, 10MB, 100MB, 500MB, and
1GB) in our ETL pipeline.
• We created the spark script in the Azure VM which is
used for data processing & transformations.
• We then ran the script multiple times in the Azure VM and
recoded each time the script is ran in the Yarn.
18. Reading Data: Used PySpark's csv() function to read
CSV files with specified options.
Data Cleaning and Transformation: Performed data
cleaning and transformation using functions like
na.drop(), fillna(), round(), cast(), filter(), select(),
groupBy(), and agg().
Data Analysis: Analyzed processed data using
functions like sort(), cache(), and withColumn().
Writing Data: Used write() function to write processed
data to a new CSV file with specified options.
Deployment: Used spark-submit command with --
deploy-mode client option for distributed processing
using Spark's capabilities.
Spark Script
20. Screenshots of YARN Cluster While Job
is in Running State for Spark and Hive
21. Screenshots of HDFS Published folder for the
processed file for Spark and Hive
22. Average of Job Execution
time for each file size for
both Hive and Spark
• Hive vs. Spark Execution Times: Hive takes longer to execute
compared to Spark for all file sizes, and the difference in execution
time becomes more significant as the file size increases.
• Hive's File Size Impact: Hive's execution time decreases as the file
size decreases, but not as much as Spark. However, as the file size
gets larger, Hive's execution time increases much faster, which may
indicate potential challenges in scalability.
• Spark's File Size Impact: Similar to Hive, Spark's execution time
decreases as the file size decreases, but the decrease is relatively
stable. Moreover, the increase in execution time with larger file sizes
is not as significant as Hive, suggesting better scalability.
• Spark's Consistent Performance: Spark consistently shows shorter
execution times compared to Hive for all file sizes, indicating better
performance in handling big data workloads.
• Conclusion: Based on the data, it appears that Spark performs
better and is more scalable compared to Hive in handling large
datasets and big data workloads, as it consistently shows shorter
execution times and more stable increases in execution time with
larger file sizes.
23. Research Based Differences
between hive and spark
Criteria Hive Spark
User-friendly
Hive uses SQL-like language (HQL) which is
familiar to SQL users.
Spark uses Scala, Java, Python, R and provides a
rich API.
Complexity
Hive is based on Apache Hadoop and has a
complex architecture with multiple components.
Spark is a fast, distributed data processing engine
with a simpler architecture and fewer components.
Performance
Hive is optimized for batch processing and may
have slower performance for real-time or
interactive queries.
Spark is optimized for in-memory processing,
providing faster processing speeds and lower
latencies for certain workloads.
Data Storage
Hive uses Hadoop Distributed File System
(HDFS) for data storage, which is a distributed file
system.
Spark provides its own distributed file system called
Spark Distributed File System (SDFS).
Fault
Tolerance
Hive relies on Hadoop's built-in fault tolerance
features such as HDFS replication and
MapReduce recovery.
Spark provides built-in fault tolerance features, such
as lineage information for RDDs and Spark
Streaming's micro-batch processing to recover lost
data or failed tasks.
24. Conclusion
Based on the comprehensive analysis of job execution times for Hive and Spark on a YARN
cluster with HDFS storage, it can be concluded that Spark outperforms Hive in terms of job
execution times for all file sizes tested.
Across multiple attempts, Spark consistently showed shorter job execution times compared to
Hive, indicating better performance in processing big data.
The results of this study suggest that Spark is a more efficient framework for processing big data
compared to Hive, especially for larger file sizes, when using a YARN cluster with HDFS storage.
Organizations processing big data on a YARN cluster with HDFS storage may benefit from
considering Spark as a more efficient and faster alternative to Hive