FAST DATA PROCESSING WITH APACHE SPARK

•

0 likes•349 views

Kamal Pradhan

FAST DATA PROCESSING WITH APACHE SPARK writeup

Data & Analytics

FAST DATA PROCESSING
WITH APACHE SPARK
A case study of faster data processing using Apache Spark in
AWS cluster.
Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/
K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H
The is an Australia based
organisation specialised in data
insights.
The client's organisation is responsible
for extracting meaningful patterns
from the pharmaceutical data.
The data is collected
regularly from various drug store
across Australia. The data contains the
drug details prescribed to each patient.
The task was to process multiple
batches of the pharmaceutical data.
Reduced Processing Time
A 62% performance Boost
was Achieved as the current
solution was able to process
1.2 billions of data in 1 hour.
The processing time was reduced up to
62%.
B I G D A T A & A N A L Y T I C S
Each batch could contain up-to
billion of records.
Processing of the data included
multiple order by and group by
operation. The result of each record
was also dependent on the results of
preceding and succeeding records
due to which all the records had to be
processed which was a bottleneck.
The existing solution was running on
a 5 node Vertica cluster which took
2.2 hours to process billion records.
The key objectives was to migrate the existing platform to Apache Spark Cluster to
improve the processing time, reduce the IT costs and easy adaptibility to new features
in a futuristic perspective.
P E R F O R M A N C E A N D B E N E F I T S
Reduced It costs
The use of opensource technologies
effectively reduced the it costs.
Fault Tolerant and HA
The solution is able to handle massive
data, is highly scalable and fault
tolerant. The use of yarn cluster
ensures the high availability of the
environment.
Client
In order to migrate the environment , several steps were carried out to bring the best
out of Apache Spark. The following methodologies were used for the solution.
The data is ingested from both Database and HDFS source using spark data source
API.
As data is in structured tabular format, So the DataFrames are used to store data
instead of traditional RDD's. DataFrame work efficiently for structured relational
data which helped to reduce the processing time.
Procedures that did the processing earlier in vertica were replaced by UDFs (User
Defined Functions) in spark.
Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also
used to perform various joins, order by and order by operations faster.
The DataFrame was partitioned to perform the processing across all the nodes in
parallel manner.
The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on
AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource
manager instead of Sparks resource manager to ensure high availability of cluster . Shell
scripts are used for deploying and automating the spark jobs.
62%

What's hot

Analysis of historical movie data by BHADRABhadra Gowdra

CCA175 Exam Cheat SheetDineshkumar S

CDSSAvinash Hanwate

1.demystifying big data & hadoopdatabloginfo

An introduction to Apache Hadoop HiveMike Frampton

An Introduction to Apache SparkElvis Saravia

Apache Spark 101Abdullah Çetin ÇAVDAR

Hadoopsiva shankari

Apache Spark PDFNaresh Rupareliya

Combining Big Data and HPC in a GRIDScalar Environmentinside-BigData.com

Big data and hadoopPrashanth Yennampelli

13 09-28 hadoop-in_taiwan_2013_openingJazz Yao-Tsung Wang

Big data-at-detikk4ndar

Introducing Data LakesPravin Kumar Singh, PMP, PSM

Big data Analytics HadoopMishika Bharadwaj

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Specialties of next generation hyper-connected data center servicesAbhi Roy

Introducing Big DataPravin Kumar Singh, PMP, PSM

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Apache hadoop & map reduceMd. Mahedi Mahfuj

What's hot (20)

Analysis of historical movie data by BHADRA

CCA175 Exam Cheat Sheet

CDSS

1.demystifying big data & hadoop

An introduction to Apache Hadoop Hive

An Introduction to Apache Spark

Apache Spark 101

Hadoop

Apache Spark PDF

Combining Big Data and HPC in a GRIDScalar Environment

Big data and hadoop

13 09-28 hadoop-in_taiwan_2013_opening

Big data-at-detik

Introducing Data Lakes

Big data Analytics Hadoop

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Specialties of next generation hyper-connected data center services

Introducing Big Data

Big data vahidamiri-tabriz-13960226-datastack.ir

Apache hadoop & map reduce

Similar to FAST DATA PROCESSING WITH APACHE SPARK

Big data with javaStefan Angelov

finap ppt conference.pptxSukhpreetSingh519414

Comparison among rdbms, hadoop and sparkAgnihotriGhosh2

Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache SparkIRJET Journal

Real time analyticsLeandro Totino Pereira

IJET-V3I2P14IJET - International Journal of Engineering and Techniques

Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed

Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup

Deploying Apache Spark and testing big data applications on servers powered b...Principled Technologies

Hadoop infoNikita Sure

Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla

Big Data: RDBMS vs. Hadoop vs. SparkGraisy Biswal

Big data talking stories in Healthcare Mostafa

Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...Shadab Ali Khan

Big data processing with apache sparksarith divakar

Introduction to Apache hadoopOmar Jaber

Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Principled Technologies

RDBMS vs Hadoop vs SparkLaxmi8

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Hadoop vs sparkamarkayam

Similar to FAST DATA PROCESSING WITH APACHE SPARK (20)

Big data with java

finap ppt conference.pptx

Comparison among rdbms, hadoop and spark

Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark

Real time analytics

IJET-V3I2P14

Analyzing Big data in R and Scala using Apache Spark 17-7-19

Lighting up Big Data Analytics with Apache Spark in Azure

Deploying Apache Spark and testing big data applications on servers powered b...

Hadoop info

Apache Big Data Europa- How to make money with your own data

Big Data: RDBMS vs. Hadoop vs. Spark

Big data talking stories in Healthcare

Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...

Big data processing with apache spark

Introduction to Apache hadoop

Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...

RDBMS vs Hadoop vs Spark

Evolution of spark framework for simplifying data analysis.

Hadoop vs spark

Recently uploaded

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Mature dropshipping via API with DroFx.pptxolyaivanovalion

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Week-01-2.ppt BBB human Computer interactionfulawalesam

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Invezz.com - Grow your wealth with trading signalsInvezz1

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Discover Why Less is More in B2B Researchmichael115558

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Halmar dropshipping via API with DroFxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Recently uploaded (20)

Sampling (random) method and Non random.ppt

Mature dropshipping via API with DroFx.pptx

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Week-01-2.ppt BBB human Computer interaction

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Smarteg dropshipping via API with DroFx.pptx

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Invezz.com - Grow your wealth with trading signals

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Discover Why Less is More in B2B Research

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Generative AI on Enterprise Cloud with NiFi and Milvus

Halmar dropshipping via API with DroFx

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

FAST DATA PROCESSING WITH APACHE SPARK

1. FAST DATA PROCESSING WITH APACHE SPARK A case study of faster data processing using Apache Spark in AWS cluster. Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/ K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H The is an Australia based organisation specialised in data insights. The client's organisation is responsible for extracting meaningful patterns from the pharmaceutical data. The data is collected regularly from various drug store across Australia. The data contains the drug details prescribed to each patient. The task was to process multiple batches of the pharmaceutical data. Reduced Processing Time A 62% performance Boost was Achieved as the current solution was able to process 1.2 billions of data in 1 hour. The processing time was reduced up to 62%. B I G D A T A & A N A L Y T I C S Each batch could contain up-to billion of records. Processing of the data included multiple order by and group by operation. The result of each record was also dependent on the results of preceding and succeeding records due to which all the records had to be processed which was a bottleneck. The existing solution was running on a 5 node Vertica cluster which took 2.2 hours to process billion records. The key objectives was to migrate the existing platform to Apache Spark Cluster to improve the processing time, reduce the IT costs and easy adaptibility to new features in a futuristic perspective. P E R F O R M A N C E A N D B E N E F I T S Reduced It costs The use of opensource technologies effectively reduced the it costs. Fault Tolerant and HA The solution is able to handle massive data, is highly scalable and fault tolerant. The use of yarn cluster ensures the high availability of the environment. Client In order to migrate the environment , several steps were carried out to bring the best out of Apache Spark. The following methodologies were used for the solution. The data is ingested from both Database and HDFS source using spark data source API. As data is in structured tabular format, So the DataFrames are used to store data instead of traditional RDD's. DataFrame work efficiently for structured relational data which helped to reduce the processing time. Procedures that did the processing earlier in vertica were replaced by UDFs (User Defined Functions) in spark. Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also used to perform various joins, order by and order by operations faster. The DataFrame was partitioned to perform the processing across all the nodes in parallel manner. The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource manager instead of Sparks resource manager to ensure high availability of cluster . Shell scripts are used for deploying and automating the spark jobs. 62%

FAST DATA PROCESSING WITH APACHE SPARK

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FAST DATA PROCESSING WITH APACHE SPARK

Similar to FAST DATA PROCESSING WITH APACHE SPARK (20)

More from Kamal Pradhan

More from Kamal Pradhan (6)

Recently uploaded

Recently uploaded (20)

FAST DATA PROCESSING WITH APACHE SPARK