SlideShare a Scribd company logo
1 of 25
COMPARING THE
PERFORMANCE OF ETL
PIPELINE USING SPARK
AND HIVE
UNDER AZURE VM
MEGHA SHAH
Outline
 Introduction
 Objective
 Problem statement
 Impact and Benefits
 Data overview and Description
 Methodologies
I. Azure VM Configuration
II. Hadoop Configuration and
Architecture
III. Hive Configuration and Architecture
IV. Spark Configuration and Architecture
 Test Method and Analysis
 Conclusion
Introduction
ETL (Extract, Transform, Load) is a crucial process in
data warehousing and business intelligence. It involves
extracting data from various sources, transforming it
into a format suitable for analysis, and loading it into a
target system. In recent years, Spark and Hive have
emerged as popular ETL tools under Azure.
This presentation aims to compare the performance of
ETL pipeline using Spark and Hive under Azure. We
will examine the features, strengths, and weaknesses
of each tool, and provide recommendations on which
one to use based on specific use cases.
Introduction
Objectives
Evaluate the performance
of Spark and Hive in the
context of an ETL pipeline
for large-scale data
processing in the Azure
environment.
Compare the efficiency
and effectiveness of
Spark and Hive in
handling ETL tasks, such
as data ingestion, data
transformation, and data
analysis, with different
size of data scenarios.
Analyze the capabilities of
Spark and Hive in terms of
processing speed and
scalability, to identify their
strengths and weaknesses
for ETL pipelines in an
Azure-based data science
study.
General
Objectives
Demonstrate the use
of Spark and Hive for
processing different
data size in an Azure,
showcasing their
strengths and
limitations for each
data size.
Evaluate the scalability
and efficiency of Apache
Spark and Apache Hive
for processing large-scale
data in Azure, considering
factors such as resource
limits, data partitioning,
and scaling procedures.
Compare the performance
of Spark and Hive for ETL
pipelines in an Azure
environment, considering
factors such as data
segmentation, resource
allocation, and data
processing techniques.
Specific
Objectives
Problem Statement
Problem Statement
Impact
Benefits
 The purpose of this study is to
compare the performance of ETL
pipelines using Spark and Hive
under Azure VM. Specifically, we
aim to identify which technology
performs better in terms of
speed, scalability, and ease of
use.
 This comparison is important
because it can help
organizations make informed
decisions about which
technology to use for their
specific data processing needs.
 The impact of this study is
significant because it can help
organizations save time and
resources by choosing the right
technology for their data
processing needs. By identifying
the strengths and weaknesses of
each technology, organizations
can make informed decisions
about which technology to use.
 The benefits of this study include
improved data processing
efficiency, faster time-to-insight,
and reduced costs associated
with data processing and
storage.
Data overview
and Description Dataset created using Python script & saved as CSV files
01
Five different dataset sizes: 1MB, 10MB, 100MB, 500MB, and
1GB
02
Numpy and Pandas libraries used for data generation and
manipulation
03
Single column data with error values (Null Values,
Strings, Decimal Values, Negative values)
04
Fig 2: Screenshot data created file.
Fig 1: Data Creation Script
Total count of rows and error data
percentage for each dataset size:
1MB: 172,089
rows with 1%
error data (1,736
rows)
10MB: 1,720,897
rows with 1%
error data (17,380
rows)
100MB:
17,208,972 rows
with 1% error data
(173,828 rows)
500MB:
86,044,860 rows
with 1% error data
(869,140 rows)
1GB: 167,100,120
rows with 1%
error data
(1,687,880 rows)
Continued
• Data created using Python script is copied to the
Azure virtual machine (VM) in the "data" folder.
• Data transferred to Azure VM is then loaded into
HDFS for further processing.
• Hadoop provides command-line tools like Hadoop
HDFS command line interface (CLI) or WebHDFS API
for transferring data to HDFS.
• Data is stored in the "raw" folder in HDFS, which is a
common practice for storing unprocessed or raw data.
• Data is then processed and transformed using
appropriate data processing techniques, such as ETL
(Extract, Transform, Load) processes, data cleansing,
and data enrichment.
• Processed data is then written in the "published"
folder in HDFS, making it accessible for further
analysis or download for various use cases.
Methodologies
After conducting
thorough
theoretical and
practical
research, we
arrived at a
recommendation
for the optimal
choice between
Hive and Spark
based on their
performance on
the Azure Virtual
Machine (VM).
04
The Spark and
Hive scripts were
executed on the
Azure Virtual
Machine (VM)
established for
our research
study, and then
job execution
time was
recorded for both
Hive and Spark
for varying data
sizes.
03
We then created
PySpark and Hive
Query Language
(HQL) scripts to
perform ETL
operations on raw
data files of various
sizes (1MB, 10MB,
100MB, 500MB,
and 1GB) stored in
HDFS, applying
transformations
based on our
research
requirements.
02
We have
successfully set
up an Azure
virtual machine
on our local
machine, and
configured it with
Java 8, Hadoop,
Spark, and Hive
to enable efficient
data analysis for
our research
study.
01
Azure
VM
• Azure VM’s have flexible scaling of virtualized computing resources on
demand, with a wide range of pre-configured VM sizes or custom sizes.
• It supports for multiple operating system options, including Linux,
Windows, and open-source solutions, enabling OS selection based on
specific needs.
• Easy management and customization through Azure portal, CLI, or
REST API, with built-in security features for data protection, including
NSGs, VPNs, and ACLs.
• It can be configured to run data analysis applications such as Hadoop,
Spark, and Hive.
• We Configured an Azure Virtual Machine (VM) with appropriate
specifications for testing, including a quad-core processor, 16GB of
RAM, and a 500GB hard drive.
• We Set up the Azure VM using the Azure portal, selecting Linux Ubuntu
as the operating system, and configured it with 8GB of memory and 1
virtual CPUs for running Hadoop, Spark, and Hive.
• We then Enabled communication with Hadoop, Spark, and Hive
services by adding inbound port rules for HDFS on port 50070, YARN
on port 8088 on the VM.
HADOOP
AND IT’S
ARCHIT
ECTURE
• Hadoop is a distributed computational framework, specially designed for
Batch processing.
• Hadoop architecture is divided into 3 major components HDFS (Storage) ,
MAP-REDUCE(Processing) & YARN(Resource Management).
• HDFS (Hadoop Distributed File System) is used for distributed file storage of
large data sets across multiple nodes in a Hadoop cluster.
• MapReduce is a programming model used in Hadoop for processing large
data sets in parallel across distributed clusters, with two main phases - Map
and Reduce - to process and analyze data efficiently.
• YARN (Yet Another Resource Negotiator) is the resource management
framework in Hadoop that allocates computing resources and schedules
tasks across the cluster for efficient data processing.
• To install Hadoop on an Azure VM, We updated the system by installing Java
8 as a prerequisite for Hadoop, Spark, and Hive and then we installed
Hadoop 2.7.3 Version and set up HDFS and Yarn Cluster.
Hive & It’s Utilization
• Hive is built on top of Hadoop and provide a
SQL-like interface to Hadoop Distributed File
System (HDFS) data. It is designed to handle
large datasets and is well-suited for data
warehousing and data analysis tasks.
• The Hive architecture consists of several
components including User Interface,
Metastore, Driver, Compiler, Execution
Engine, Hadoop Distributed File System
(HDFS).
• Hive played a crucial role in our data analysis
efforts and helped us derive meaningful
insights that contributed to the success of our
project.
Hive architecture diagram
Script for Creation of Hive and
running in virtual machine
Spark and it’s
architecture
• Spark is a distributed computing framework
that provides fast and efficient processing of
large-scale data.
• It is designed to handle both batch and real-
time processing workloads and supports a
wide range of programming languages
including Java, Scala, and Python.
• One of the key advantages of Spark is its in-
memory processing capabilities, which allows
it to cache data in memory and avoid disk I/O
operations. This results in significantly faster
processing times compared to traditional ETL
tools like Hive.
Continued
• We utilized Spark within the Hadoop ecosystem,
leveraging its capabilities for processing raw data of
different file sizes (1MB, 10MB, 100MB, 500MB, and
1GB) in our ETL pipeline.
• We created the spark script in the Azure VM which is
used for data processing & transformations.
• We then ran the script multiple times in the Azure VM and
recoded each time the script is ran in the Yarn.
 Reading Data: Used PySpark's csv() function to read
CSV files with specified options.
 Data Cleaning and Transformation: Performed data
cleaning and transformation using functions like
na.drop(), fillna(), round(), cast(), filter(), select(),
groupBy(), and agg().
 Data Analysis: Analyzed processed data using
functions like sort(), cache(), and withColumn().
 Writing Data: Used write() function to write processed
data to a new CSV file with specified options.
 Deployment: Used spark-submit command with --
deploy-mode client option for distributed processing
using Spark's capabilities.
Spark Script
Test Method and Analysis
Screenshots of YARN Cluster While Job
is in Running State for Spark and Hive
Screenshots of HDFS Published folder for the
processed file for Spark and Hive
Average of Job Execution
time for each file size for
both Hive and Spark
• Hive vs. Spark Execution Times: Hive takes longer to execute
compared to Spark for all file sizes, and the difference in execution
time becomes more significant as the file size increases.
• Hive's File Size Impact: Hive's execution time decreases as the file
size decreases, but not as much as Spark. However, as the file size
gets larger, Hive's execution time increases much faster, which may
indicate potential challenges in scalability.
• Spark's File Size Impact: Similar to Hive, Spark's execution time
decreases as the file size decreases, but the decrease is relatively
stable. Moreover, the increase in execution time with larger file sizes
is not as significant as Hive, suggesting better scalability.
• Spark's Consistent Performance: Spark consistently shows shorter
execution times compared to Hive for all file sizes, indicating better
performance in handling big data workloads.
• Conclusion: Based on the data, it appears that Spark performs
better and is more scalable compared to Hive in handling large
datasets and big data workloads, as it consistently shows shorter
execution times and more stable increases in execution time with
larger file sizes.
Research Based Differences
between hive and spark
Criteria Hive Spark
User-friendly
Hive uses SQL-like language (HQL) which is
familiar to SQL users.
Spark uses Scala, Java, Python, R and provides a
rich API.
Complexity
Hive is based on Apache Hadoop and has a
complex architecture with multiple components.
Spark is a fast, distributed data processing engine
with a simpler architecture and fewer components.
Performance
Hive is optimized for batch processing and may
have slower performance for real-time or
interactive queries.
Spark is optimized for in-memory processing,
providing faster processing speeds and lower
latencies for certain workloads.
Data Storage
Hive uses Hadoop Distributed File System
(HDFS) for data storage, which is a distributed file
system.
Spark provides its own distributed file system called
Spark Distributed File System (SDFS).
Fault
Tolerance
Hive relies on Hadoop's built-in fault tolerance
features such as HDFS replication and
MapReduce recovery.
Spark provides built-in fault tolerance features, such
as lineage information for RDDs and Spark
Streaming's micro-batch processing to recover lost
data or failed tasks.
Conclusion
 Based on the comprehensive analysis of job execution times for Hive and Spark on a YARN
cluster with HDFS storage, it can be concluded that Spark outperforms Hive in terms of job
execution times for all file sizes tested.
 Across multiple attempts, Spark consistently showed shorter job execution times compared to
Hive, indicating better performance in processing big data.
 The results of this study suggest that Spark is a more efficient framework for processing big data
compared to Hive, especially for larger file sizes, when using a YARN cluster with HDFS storage.
 Organizations processing big data on a YARN cluster with HDFS storage may benefit from
considering Spark as a more efficient and faster alternative to Hive
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE VM.pptx

More Related Content

Similar to COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE VM.pptx

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfan
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Principled Technologies
 

Similar to COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE VM.pptx (20)

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE VM.pptx

  • 1. COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE VM MEGHA SHAH
  • 2. Outline  Introduction  Objective  Problem statement  Impact and Benefits  Data overview and Description  Methodologies I. Azure VM Configuration II. Hadoop Configuration and Architecture III. Hive Configuration and Architecture IV. Spark Configuration and Architecture  Test Method and Analysis  Conclusion
  • 3. Introduction ETL (Extract, Transform, Load) is a crucial process in data warehousing and business intelligence. It involves extracting data from various sources, transforming it into a format suitable for analysis, and loading it into a target system. In recent years, Spark and Hive have emerged as popular ETL tools under Azure. This presentation aims to compare the performance of ETL pipeline using Spark and Hive under Azure. We will examine the features, strengths, and weaknesses of each tool, and provide recommendations on which one to use based on specific use cases. Introduction
  • 4. Objectives Evaluate the performance of Spark and Hive in the context of an ETL pipeline for large-scale data processing in the Azure environment. Compare the efficiency and effectiveness of Spark and Hive in handling ETL tasks, such as data ingestion, data transformation, and data analysis, with different size of data scenarios. Analyze the capabilities of Spark and Hive in terms of processing speed and scalability, to identify their strengths and weaknesses for ETL pipelines in an Azure-based data science study. General Objectives Demonstrate the use of Spark and Hive for processing different data size in an Azure, showcasing their strengths and limitations for each data size. Evaluate the scalability and efficiency of Apache Spark and Apache Hive for processing large-scale data in Azure, considering factors such as resource limits, data partitioning, and scaling procedures. Compare the performance of Spark and Hive for ETL pipelines in an Azure environment, considering factors such as data segmentation, resource allocation, and data processing techniques. Specific Objectives
  • 5. Problem Statement Problem Statement Impact Benefits  The purpose of this study is to compare the performance of ETL pipelines using Spark and Hive under Azure VM. Specifically, we aim to identify which technology performs better in terms of speed, scalability, and ease of use.  This comparison is important because it can help organizations make informed decisions about which technology to use for their specific data processing needs.  The impact of this study is significant because it can help organizations save time and resources by choosing the right technology for their data processing needs. By identifying the strengths and weaknesses of each technology, organizations can make informed decisions about which technology to use.  The benefits of this study include improved data processing efficiency, faster time-to-insight, and reduced costs associated with data processing and storage.
  • 6. Data overview and Description Dataset created using Python script & saved as CSV files 01 Five different dataset sizes: 1MB, 10MB, 100MB, 500MB, and 1GB 02 Numpy and Pandas libraries used for data generation and manipulation 03 Single column data with error values (Null Values, Strings, Decimal Values, Negative values) 04
  • 7. Fig 2: Screenshot data created file. Fig 1: Data Creation Script
  • 8. Total count of rows and error data percentage for each dataset size: 1MB: 172,089 rows with 1% error data (1,736 rows) 10MB: 1,720,897 rows with 1% error data (17,380 rows) 100MB: 17,208,972 rows with 1% error data (173,828 rows) 500MB: 86,044,860 rows with 1% error data (869,140 rows) 1GB: 167,100,120 rows with 1% error data (1,687,880 rows)
  • 9. Continued • Data created using Python script is copied to the Azure virtual machine (VM) in the "data" folder. • Data transferred to Azure VM is then loaded into HDFS for further processing. • Hadoop provides command-line tools like Hadoop HDFS command line interface (CLI) or WebHDFS API for transferring data to HDFS. • Data is stored in the "raw" folder in HDFS, which is a common practice for storing unprocessed or raw data. • Data is then processed and transformed using appropriate data processing techniques, such as ETL (Extract, Transform, Load) processes, data cleansing, and data enrichment. • Processed data is then written in the "published" folder in HDFS, making it accessible for further analysis or download for various use cases.
  • 10. Methodologies After conducting thorough theoretical and practical research, we arrived at a recommendation for the optimal choice between Hive and Spark based on their performance on the Azure Virtual Machine (VM). 04 The Spark and Hive scripts were executed on the Azure Virtual Machine (VM) established for our research study, and then job execution time was recorded for both Hive and Spark for varying data sizes. 03 We then created PySpark and Hive Query Language (HQL) scripts to perform ETL operations on raw data files of various sizes (1MB, 10MB, 100MB, 500MB, and 1GB) stored in HDFS, applying transformations based on our research requirements. 02 We have successfully set up an Azure virtual machine on our local machine, and configured it with Java 8, Hadoop, Spark, and Hive to enable efficient data analysis for our research study. 01
  • 11. Azure VM • Azure VM’s have flexible scaling of virtualized computing resources on demand, with a wide range of pre-configured VM sizes or custom sizes. • It supports for multiple operating system options, including Linux, Windows, and open-source solutions, enabling OS selection based on specific needs. • Easy management and customization through Azure portal, CLI, or REST API, with built-in security features for data protection, including NSGs, VPNs, and ACLs. • It can be configured to run data analysis applications such as Hadoop, Spark, and Hive. • We Configured an Azure Virtual Machine (VM) with appropriate specifications for testing, including a quad-core processor, 16GB of RAM, and a 500GB hard drive. • We Set up the Azure VM using the Azure portal, selecting Linux Ubuntu as the operating system, and configured it with 8GB of memory and 1 virtual CPUs for running Hadoop, Spark, and Hive. • We then Enabled communication with Hadoop, Spark, and Hive services by adding inbound port rules for HDFS on port 50070, YARN on port 8088 on the VM.
  • 12. HADOOP AND IT’S ARCHIT ECTURE • Hadoop is a distributed computational framework, specially designed for Batch processing. • Hadoop architecture is divided into 3 major components HDFS (Storage) , MAP-REDUCE(Processing) & YARN(Resource Management). • HDFS (Hadoop Distributed File System) is used for distributed file storage of large data sets across multiple nodes in a Hadoop cluster. • MapReduce is a programming model used in Hadoop for processing large data sets in parallel across distributed clusters, with two main phases - Map and Reduce - to process and analyze data efficiently. • YARN (Yet Another Resource Negotiator) is the resource management framework in Hadoop that allocates computing resources and schedules tasks across the cluster for efficient data processing. • To install Hadoop on an Azure VM, We updated the system by installing Java 8 as a prerequisite for Hadoop, Spark, and Hive and then we installed Hadoop 2.7.3 Version and set up HDFS and Yarn Cluster.
  • 13. Hive & It’s Utilization • Hive is built on top of Hadoop and provide a SQL-like interface to Hadoop Distributed File System (HDFS) data. It is designed to handle large datasets and is well-suited for data warehousing and data analysis tasks. • The Hive architecture consists of several components including User Interface, Metastore, Driver, Compiler, Execution Engine, Hadoop Distributed File System (HDFS). • Hive played a crucial role in our data analysis efforts and helped us derive meaningful insights that contributed to the success of our project.
  • 15. Script for Creation of Hive and running in virtual machine
  • 16. Spark and it’s architecture • Spark is a distributed computing framework that provides fast and efficient processing of large-scale data. • It is designed to handle both batch and real- time processing workloads and supports a wide range of programming languages including Java, Scala, and Python. • One of the key advantages of Spark is its in- memory processing capabilities, which allows it to cache data in memory and avoid disk I/O operations. This results in significantly faster processing times compared to traditional ETL tools like Hive.
  • 17. Continued • We utilized Spark within the Hadoop ecosystem, leveraging its capabilities for processing raw data of different file sizes (1MB, 10MB, 100MB, 500MB, and 1GB) in our ETL pipeline. • We created the spark script in the Azure VM which is used for data processing & transformations. • We then ran the script multiple times in the Azure VM and recoded each time the script is ran in the Yarn.
  • 18.  Reading Data: Used PySpark's csv() function to read CSV files with specified options.  Data Cleaning and Transformation: Performed data cleaning and transformation using functions like na.drop(), fillna(), round(), cast(), filter(), select(), groupBy(), and agg().  Data Analysis: Analyzed processed data using functions like sort(), cache(), and withColumn().  Writing Data: Used write() function to write processed data to a new CSV file with specified options.  Deployment: Used spark-submit command with -- deploy-mode client option for distributed processing using Spark's capabilities. Spark Script
  • 19. Test Method and Analysis
  • 20. Screenshots of YARN Cluster While Job is in Running State for Spark and Hive
  • 21. Screenshots of HDFS Published folder for the processed file for Spark and Hive
  • 22. Average of Job Execution time for each file size for both Hive and Spark • Hive vs. Spark Execution Times: Hive takes longer to execute compared to Spark for all file sizes, and the difference in execution time becomes more significant as the file size increases. • Hive's File Size Impact: Hive's execution time decreases as the file size decreases, but not as much as Spark. However, as the file size gets larger, Hive's execution time increases much faster, which may indicate potential challenges in scalability. • Spark's File Size Impact: Similar to Hive, Spark's execution time decreases as the file size decreases, but the decrease is relatively stable. Moreover, the increase in execution time with larger file sizes is not as significant as Hive, suggesting better scalability. • Spark's Consistent Performance: Spark consistently shows shorter execution times compared to Hive for all file sizes, indicating better performance in handling big data workloads. • Conclusion: Based on the data, it appears that Spark performs better and is more scalable compared to Hive in handling large datasets and big data workloads, as it consistently shows shorter execution times and more stable increases in execution time with larger file sizes.
  • 23. Research Based Differences between hive and spark Criteria Hive Spark User-friendly Hive uses SQL-like language (HQL) which is familiar to SQL users. Spark uses Scala, Java, Python, R and provides a rich API. Complexity Hive is based on Apache Hadoop and has a complex architecture with multiple components. Spark is a fast, distributed data processing engine with a simpler architecture and fewer components. Performance Hive is optimized for batch processing and may have slower performance for real-time or interactive queries. Spark is optimized for in-memory processing, providing faster processing speeds and lower latencies for certain workloads. Data Storage Hive uses Hadoop Distributed File System (HDFS) for data storage, which is a distributed file system. Spark provides its own distributed file system called Spark Distributed File System (SDFS). Fault Tolerance Hive relies on Hadoop's built-in fault tolerance features such as HDFS replication and MapReduce recovery. Spark provides built-in fault tolerance features, such as lineage information for RDDs and Spark Streaming's micro-batch processing to recover lost data or failed tasks.
  • 24. Conclusion  Based on the comprehensive analysis of job execution times for Hive and Spark on a YARN cluster with HDFS storage, it can be concluded that Spark outperforms Hive in terms of job execution times for all file sizes tested.  Across multiple attempts, Spark consistently showed shorter job execution times compared to Hive, indicating better performance in processing big data.  The results of this study suggest that Spark is a more efficient framework for processing big data compared to Hive, especially for larger file sizes, when using a YARN cluster with HDFS storage.  Organizations processing big data on a YARN cluster with HDFS storage may benefit from considering Spark as a more efficient and faster alternative to Hive