SlideShare a Scribd company logo
1 of 1
Download to read offline
FAST DATA PROCESSING
WITH APACHE SPARK
A case study of faster data processing using Apache Spark in
AWS cluster.
Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/
K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H
The is an Australia based
organisation specialised in data
insights.
The client's organisation is responsible
for extracting meaningful patterns
from the pharmaceutical data.
The data is collected
regularly from various drug store
across Australia. The data contains the
drug details prescribed to each patient.
The task was to process multiple
batches of the pharmaceutical data.
Reduced Processing Time
A 62% performance Boost
was Achieved as the current
solution was able to process
1.2 billions of data in 1 hour.
The processing time was reduced up to
62%.
B I G D A T A & A N A L Y T I C S
Each batch could contain up-to
billion of records.
Processing of the data included
multiple order by and group by
operation. The result of each record
was also dependent on the results of
preceding and succeeding records
due to which all the records had to be
processed which was a bottleneck.
The existing solution was running on
a 5 node Vertica cluster which took
2.2 hours to process billion records.
The key objectives was to migrate the existing platform to Apache Spark Cluster to
improve the processing time, reduce the IT costs and easy adaptibility to new features
in a futuristic perspective.
P E R F O R M A N C E A N D B E N E F I T S
Reduced It costs
The use of opensource technologies
effectively reduced the it costs.
Fault Tolerant and HA
The solution is able to handle massive
data, is highly scalable and fault
tolerant. The use of yarn cluster
ensures the high availability of the
environment.
Client
In order to migrate the environment , several steps were carried out to bring the best
out of Apache Spark. The following methodologies were used for the solution.
The data is ingested from both Database and HDFS source using spark data source
API.
As data is in structured tabular format, So the DataFrames are used to store data
instead of traditional RDD's. DataFrame work efficiently for structured relational
data which helped to reduce the processing time.
Procedures that did the processing earlier in vertica were replaced by UDFs (User
Defined Functions) in spark.
Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also
used to perform various joins, order by and order by operations faster.
The DataFrame was partitioned to perform the processing across all the nodes in
parallel manner.
The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on
AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource
manager instead of Sparks resource manager to ensure high availability of cluster . Shell
scripts are used for deploying and automating the spark jobs.
62%

More Related Content

What's hot

Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
CCA175 Exam Cheat Sheet
CCA175 Exam Cheat SheetCCA175 Exam Cheat Sheet
CCA175 Exam Cheat SheetDineshkumar S
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoopdatabloginfo
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkElvis Saravia
 
Combining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar EnvironmentCombining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar Environmentinside-BigData.com
 
13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_openingJazz Yao-Tsung Wang
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detikk4ndar
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
Specialties of next generation hyper-connected data center services
Specialties of next generation hyper-connected data center servicesSpecialties of next generation hyper-connected data center services
Specialties of next generation hyper-connected data center servicesAbhi Roy
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 

What's hot (20)

Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
CCA175 Exam Cheat Sheet
CCA175 Exam Cheat SheetCCA175 Exam Cheat Sheet
CCA175 Exam Cheat Sheet
 
CDSS
CDSSCDSS
CDSS
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Combining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar EnvironmentCombining Big Data and HPC in a GRIDScalar Environment
Combining Big Data and HPC in a GRIDScalar Environment
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
Specialties of next generation hyper-connected data center services
Specialties of next generation hyper-connected data center servicesSpecialties of next generation hyper-connected data center services
Specialties of next generation hyper-connected data center services
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Apache hadoop & map reduce
Apache hadoop & map reduceApache hadoop & map reduce
Apache hadoop & map reduce
 

Similar to FAST DATA PROCESSING WITH APACHE SPARK

Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache SparkBig Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache SparkIRJET Journal
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Principled Technologies
 
Apache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataApache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkGraisy Biswal
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare Mostafa
 
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...Shadab Ali Khan
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Principled Technologies
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 

Similar to FAST DATA PROCESSING WITH APACHE SPARK (20)

Big data with java
Big data with javaBig data with java
Big data with java
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache SparkBig Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Apache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataApache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own data
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 

More from Kamal Pradhan

Dense wavelength division multiplexing (DWDM): A Review
Dense wavelength division multiplexing (DWDM): A Review Dense wavelength division multiplexing (DWDM): A Review
Dense wavelength division multiplexing (DWDM): A Review Kamal Pradhan
 
Mathematical modeling and parameter estimation for water quality management s...
Mathematical modeling and parameter estimation for water quality management s...Mathematical modeling and parameter estimation for water quality management s...
Mathematical modeling and parameter estimation for water quality management s...Kamal Pradhan
 
Android Operated Wireless Robot Using 8051 MCU
Android Operated Wireless Robot Using 8051 MCUAndroid Operated Wireless Robot Using 8051 MCU
Android Operated Wireless Robot Using 8051 MCUKamal Pradhan
 
Securing Web Communication Using Three Layer Image Shielding
Securing Web Communication Using Three Layer Image ShieldingSecuring Web Communication Using Three Layer Image Shielding
Securing Web Communication Using Three Layer Image ShieldingKamal Pradhan
 
Color based image processing , tracking and automation using matlab
Color based image processing , tracking and automation using matlabColor based image processing , tracking and automation using matlab
Color based image processing , tracking and automation using matlabKamal Pradhan
 

More from Kamal Pradhan (6)

Smart grid summary
Smart grid summarySmart grid summary
Smart grid summary
 
Dense wavelength division multiplexing (DWDM): A Review
Dense wavelength division multiplexing (DWDM): A Review Dense wavelength division multiplexing (DWDM): A Review
Dense wavelength division multiplexing (DWDM): A Review
 
Mathematical modeling and parameter estimation for water quality management s...
Mathematical modeling and parameter estimation for water quality management s...Mathematical modeling and parameter estimation for water quality management s...
Mathematical modeling and parameter estimation for water quality management s...
 
Android Operated Wireless Robot Using 8051 MCU
Android Operated Wireless Robot Using 8051 MCUAndroid Operated Wireless Robot Using 8051 MCU
Android Operated Wireless Robot Using 8051 MCU
 
Securing Web Communication Using Three Layer Image Shielding
Securing Web Communication Using Three Layer Image ShieldingSecuring Web Communication Using Three Layer Image Shielding
Securing Web Communication Using Three Layer Image Shielding
 
Color based image processing , tracking and automation using matlab
Color based image processing , tracking and automation using matlabColor based image processing , tracking and automation using matlab
Color based image processing , tracking and automation using matlab
 

Recently uploaded

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

FAST DATA PROCESSING WITH APACHE SPARK

  • 1. FAST DATA PROCESSING WITH APACHE SPARK A case study of faster data processing using Apache Spark in AWS cluster. Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/ K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H The is an Australia based organisation specialised in data insights. The client's organisation is responsible for extracting meaningful patterns from the pharmaceutical data. The data is collected regularly from various drug store across Australia. The data contains the drug details prescribed to each patient. The task was to process multiple batches of the pharmaceutical data. Reduced Processing Time A 62% performance Boost was Achieved as the current solution was able to process 1.2 billions of data in 1 hour. The processing time was reduced up to 62%. B I G D A T A & A N A L Y T I C S Each batch could contain up-to billion of records. Processing of the data included multiple order by and group by operation. The result of each record was also dependent on the results of preceding and succeeding records due to which all the records had to be processed which was a bottleneck. The existing solution was running on a 5 node Vertica cluster which took 2.2 hours to process billion records. The key objectives was to migrate the existing platform to Apache Spark Cluster to improve the processing time, reduce the IT costs and easy adaptibility to new features in a futuristic perspective. P E R F O R M A N C E A N D B E N E F I T S Reduced It costs The use of opensource technologies effectively reduced the it costs. Fault Tolerant and HA The solution is able to handle massive data, is highly scalable and fault tolerant. The use of yarn cluster ensures the high availability of the environment. Client In order to migrate the environment , several steps were carried out to bring the best out of Apache Spark. The following methodologies were used for the solution. The data is ingested from both Database and HDFS source using spark data source API. As data is in structured tabular format, So the DataFrames are used to store data instead of traditional RDD's. DataFrame work efficiently for structured relational data which helped to reduce the processing time. Procedures that did the processing earlier in vertica were replaced by UDFs (User Defined Functions) in spark. Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also used to perform various joins, order by and order by operations faster. The DataFrame was partitioned to perform the processing across all the nodes in parallel manner. The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource manager instead of Sparks resource manager to ensure high availability of cluster . Shell scripts are used for deploying and automating the spark jobs. 62%