SlideShare a Scribd company logo
Apache Spark
Riya Singhal
Agenda
● What is Big Data?
● What is the solution of Big data?
● How Apache Spark can help us?
● Apache Spark advantages over Hadoop MapReduce
What is Big Data?
● Lots of Data (Terabytes or Petabytes).
● Large and complex.
● Difficult to deal using Relational Databases.
● Challenges faced in - searching, storing, transfer, analysis, visualisation.
● Require Parallel processing on 100s of machines.
Hadoop MapReduce
● Allows distributed processing of large datasets across clusters.
● It is open source database management with scale out storage and distributed
processing.
● Characteristics:
○ Economical
○ Scalable
○ Reliable
○ Flexible
MapReduce
● Map - Data is converted into tuples (key/value pair).
● Reduce - Takes input from map and combines input from map to form smaller set of
tuples.
● Advantages
○ Scale data
○ Parallel Processing
○ Fast
○ Built in fault tolerant
MapReduce
Shortcomings of MapReduce
1. Slow for Iterative Jobs.
2. Slow for Interactive Ad-hoc queries.
3. Operations - Forces task be of type Map and Reduce.
4. Difficult to program - Even simple join operations also require extensive code.
Lacks data sharing. Data sharing done through stable storage (HDFS)→ slow.
Slow due to replication and Disk I/O but it is essential for fault tolerance.
Can we use memory? How will it be fault tolerant?
Apache Spark
● Developed in 2009 by UC Berkeley.
● Processing engine.
● Used for speed, ease of use, and sophisticated analytics.
● It is based on Hadoop MapReduce but it extends MapReduce for performing
more types of computations.
● Spark participated in Daytona Gray category, Spark sorted 100 TB of data (1
trillion records) the same data three time faster using ten times fewer
machines as compared to Hadoop.
Apache Spark
● Improves efficiency through
○ In-memory data sharing.
○ General computation graph.
● Improves usability through
○ Rich APIs in Java, Scala, Python.
○ Interactive Shell.
HOW ??
Upto 100x faster in memory
and 10x faster on disk
Upto 2-5x less code
Resilient Distributed Dataset (RDD)
● Fundamental Data Structure of Apache Spark.
● Read-only collection of objects partitioned across a set of machines.
● Perform In-memory Computation.
● Build on transformation operations like map, filter etc.
● Fault tolerant through lineage.
● Features:
○ Immutable
○ Parallel
○ Cacheable
○ Lazy Evaluated
Resilient Distributed Dataset (RDD)
Two types of operation can be performed:
● Transformation
○ Create new RDD from existing RDD.
○ Creates DAG.
○ Lazily evaluated.
○ Increases efficiency by not returning large dataset.
○ Eg. GroupByKey, ReduceByKey, filter.
● Action
○ All queries are executed.
○ Performs computation.
○ Returns result to driver program.
○ Eg. collect, count, take.
Ready for some programming…..
(using python)
Creating RDD
# Creates a list of animal.
animals = ['cat', 'dog', 'elephant', 'cat', 'mouse', ’cat’]
# Parallelize method is used to create RDD from list. Here “animalRDD” is created.
#sc is Object of Spark Context.
animalRDD = sc.parallelize(animals)
# Since RDD is lazily evaluated, to print it we perform an action operation, i.e.
collect() which is used to print the RDD.
print animalRDD.collect()
Output - ['cat', 'dog', 'elephant', 'cat', 'mouse', 'cat']
Creating RDD from file
#The file words.txt has names of animals through which animalsRDD is made.
animalsRDD = sc.textFile('/path/to/file/words.txt')
#collect() is the action operation.
print animalsRDD.collect()
Map operation on RDD
‘’’’’ To count the frequency of animals, we make (key/value) pair - (animal,1) for all
the animals and then perform reduce operation which counts all the values.
Lambda is used to write inline functions in python.
‘’’’’
mapRDD = animalRDD.map(lambda x:(x,1))
print mapRDD.collect()
Output - [('cat',1), ('dog',1), ('elephant',1), ('cat',1), ('mouse',1), ('cat',1)]
Reduce operation on RDD
‘’’’’ reduceByKey is used to perform reduce operation on same key. So in its
arguments, we have defined a function to add the values for same key. Hence, we
get the count of animals.
‘’’’’
reduceRDD = mapRDD.reduceByKey(lambda x,y:x+y)
print reduceRDD.collect()
Output - [('cat',3), ('dog',1), ('elephant',1), ('mouse',1)]
Filter operation on RDD
‘’’’’ Filter all the animals obtained from reducedRDD with count greater than 2. x is
a tuple made of (animal, count), i.e. x[0]=animal name and x[1]=count of animal.
Therefore we filter the reduceRDD based on x[1]>2.
‘’’’’
filterRDD = reduceRDD.filter(lambda x:x[1]>2)
print filterRDD.collect()
Output - [('cat',3)]
Please refer http://spark.apache.org/docs/latest/programming-guide.html for
more about programming in Apache Spark.
APACHE SPARK
VS
HADOOP MAPREDUCE
Spark vs. Hadoop
● Performance
○ Spark better as it does in-memory computation.
○ Hadoop is good for one pass ETL jobs and where data does not fit in memory.
● Ease of use
○ Spark is easier to program and provides API in Java, Scala, R, Python.
○ Spark has an interactive mode.
○ Hadoop MapReduce is more difficult to program but many tools are available to
make it easier.
● Cost
○ Spark is cost effective according to benchmark, though staffing can be costly.
● Compatibility
○ Compatibility to data types and data sources is the same for both.
Spark vs. Hadoop
● Data Processing
○ Spark can perform real time processing and batch processing.
○ Hadoop MapReduce is good for batch processing. Hadoop requires storm for real
time processing, Giraph for graph processing, Mahout for machine learning.
● Fault tolerant
○ Hadoop MapReduce is slightly more tolerant.
● Caching
○ Spark can cache the input data.
Applications
Companies that uses Hadoop and Spark are:
● Hadoop - Hadoop is used good for static operation.
○ Dell, IBM, Cloudera, AWS and many more.
● Spark
○ Real-time marketing campaign, online product recommendations etc.
○ eBay, Amazon, Yahoo, Nokia and many more.
○ Data mining 40x times faster than Hadoop (Conviva).
○ Traffic Prediction via EM (Mobile Millennium).
○ DNA Sequence Analysis (SNAP).
○ Twitter Spam Classification (Monarch).
Apache Spark helping companies grow in
their business
● Spark Helps Pinterest Identify Trends - Using Spark, Pinterest is able to
identify—and react to—developing trends as they happen.
● Netflix Leans on Spark for Personalization Aid - Netflix uses Spark to support
real-time stream processing for online recommendations and data monitoring.
Libraries of Apache Spark
Spark provides libraries to provide generality. We can combine these libraries
seamlessly in the same application to provide more functionality.
Libraries provided by Apache Spark are:
1. Spark Streaming - It supports scalable and fault tolerant processing of
streaming data.
2. Spark SQL - It allows spark to work with structured data.
3. Spark MLlib - It provides scalable machine learning library and has machine
learning and statistical algorithms.
4. Spark GraphX - It is used to compute graphs over data.
Refer http://spark.apache.org/docs/latest/ for more information.
THANK YOU

More Related Content

What's hot

9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
Adam Doyle
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
Tugdual Grall
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Databricks
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business Analyst
Gustaf Cavanaugh
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
punesparkmeetup
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
amarsri
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
Nimrita Koul
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Big Data Spain
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
Ted Dunning
 

What's hot (20)

9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business Analyst
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 

Viewers also liked

What's new in Android Lollipop
What's new in Android LollipopWhat's new in Android Lollipop
What's new in Android Lollipop
Abdellah SELASSI
 
[@NaukriEngineering] Icon fonts & vector drawable in iOS apps
[@NaukriEngineering] Icon fonts & vector drawable in iOS apps[@NaukriEngineering] Icon fonts & vector drawable in iOS apps
[@NaukriEngineering] Icon fonts & vector drawable in iOS apps
Naukri.com
 
Android - What's new?
Android -  What's new? Android -  What's new?
Android - What's new?
Moyinoluwa Adeyemi
 
Android Development
Android DevelopmentAndroid Development
Android Development
Pluu love
 
Android Vector drawable
Android Vector drawableAndroid Vector drawable
Android Vector drawable
Oleg Osipenko
 
Lecture5 graphics
Lecture5   graphicsLecture5   graphics
Lecture5 graphicsMr SMAK
 
Random scan displays and raster scan displays
Random scan displays and raster scan displaysRandom scan displays and raster scan displays
Random scan displays and raster scan displays
Somya Bagai
 
Adobe illustrator 1
Adobe illustrator 1Adobe illustrator 1
Adobe illustrator 1
ilaazmil2
 

Viewers also liked (8)

What's new in Android Lollipop
What's new in Android LollipopWhat's new in Android Lollipop
What's new in Android Lollipop
 
[@NaukriEngineering] Icon fonts & vector drawable in iOS apps
[@NaukriEngineering] Icon fonts & vector drawable in iOS apps[@NaukriEngineering] Icon fonts & vector drawable in iOS apps
[@NaukriEngineering] Icon fonts & vector drawable in iOS apps
 
Android - What's new?
Android -  What's new? Android -  What's new?
Android - What's new?
 
Android Development
Android DevelopmentAndroid Development
Android Development
 
Android Vector drawable
Android Vector drawableAndroid Vector drawable
Android Vector drawable
 
Lecture5 graphics
Lecture5   graphicsLecture5   graphics
Lecture5 graphics
 
Random scan displays and raster scan displays
Random scan displays and raster scan displaysRandom scan displays and raster scan displays
Random scan displays and raster scan displays
 
Adobe illustrator 1
Adobe illustrator 1Adobe illustrator 1
Adobe illustrator 1
 

Similar to [@NaukriEngineering] Apache Spark

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
punesparkmeetup
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Spark
SparkSpark
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
Joseph Adler
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
Kumari Surabhi
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 

Similar to [@NaukriEngineering] Apache Spark (20)

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Data Science
Data ScienceData Science
Data Science
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Spark
SparkSpark
Spark
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 

More from Naukri.com

[@NaukriEngineering] Deferred deep linking in iOS
[@NaukriEngineering] Deferred deep linking in iOS[@NaukriEngineering] Deferred deep linking in iOS
[@NaukriEngineering] Deferred deep linking in iOS
Naukri.com
 
[@NaukriEngineering] Instant Apps
[@NaukriEngineering] Instant Apps[@NaukriEngineering] Instant Apps
[@NaukriEngineering] Instant Apps
Naukri.com
 
[@NaukriEngineering] Video handlings on apple platforms
[@NaukriEngineering] Video handlings on apple platforms[@NaukriEngineering] Video handlings on apple platforms
[@NaukriEngineering] Video handlings on apple platforms
Naukri.com
 
[@NaukriEngineering] Introduction to Android O
[@NaukriEngineering] Introduction to Android O[@NaukriEngineering] Introduction to Android O
[@NaukriEngineering] Introduction to Android O
Naukri.com
 
[@NaukriEngineering] MVVM in iOS
[@NaukriEngineering] MVVM in iOS[@NaukriEngineering] MVVM in iOS
[@NaukriEngineering] MVVM in iOS
Naukri.com
 
[@NaukriEngineering] Introduction to Galera cluster
[@NaukriEngineering] Introduction to Galera cluster[@NaukriEngineering] Introduction to Galera cluster
[@NaukriEngineering] Introduction to Galera cluster
Naukri.com
 
[@NaukriEngineering] Inbound Emails for Every Web App: Angle
[@NaukriEngineering] Inbound Emails for Every Web App: Angle[@NaukriEngineering] Inbound Emails for Every Web App: Angle
[@NaukriEngineering] Inbound Emails for Every Web App: Angle
Naukri.com
 
[@NaukriEngineering] BDD implementation using Cucumber
[@NaukriEngineering] BDD implementation using Cucumber[@NaukriEngineering] BDD implementation using Cucumber
[@NaukriEngineering] BDD implementation using Cucumber
Naukri.com
 
[@NaukriEngineering] Feature Toggles
[@NaukriEngineering] Feature Toggles[@NaukriEngineering] Feature Toggles
[@NaukriEngineering] Feature Toggles
Naukri.com
 
[@NaukriEngineering] AppTracer
[@NaukriEngineering] AppTracer[@NaukriEngineering] AppTracer
[@NaukriEngineering] AppTracer
Naukri.com
 
[@NaukriEngineering] Flux Architecture
[@NaukriEngineering] Flux Architecture[@NaukriEngineering] Flux Architecture
[@NaukriEngineering] Flux Architecture
Naukri.com
 
[@NaukriEngineering] Mobile Web app scripts execution using Appium
[@NaukriEngineering] Mobile Web app scripts execution using Appium[@NaukriEngineering] Mobile Web app scripts execution using Appium
[@NaukriEngineering] Mobile Web app scripts execution using Appium
Naukri.com
 
[@NaukriEngineering] Messaging Queues
[@NaukriEngineering] Messaging Queues[@NaukriEngineering] Messaging Queues
[@NaukriEngineering] Messaging Queues
Naukri.com
 
[@NaukriEngineering] Docker 101
[@NaukriEngineering] Docker 101[@NaukriEngineering] Docker 101
[@NaukriEngineering] Docker 101
Naukri.com
 
[@NaukriEngineering] Git Basic Commands and Hacks
[@NaukriEngineering] Git Basic Commands and Hacks[@NaukriEngineering] Git Basic Commands and Hacks
[@NaukriEngineering] Git Basic Commands and Hacks
Naukri.com
 
[@NaukriEngineering] IndexedDB
[@NaukriEngineering] IndexedDB[@NaukriEngineering] IndexedDB
[@NaukriEngineering] IndexedDB
Naukri.com
 
[@NaukriEngineering] CSS4 Selectors – Part 1
[@NaukriEngineering] CSS4 Selectors – Part 1[@NaukriEngineering] CSS4 Selectors – Part 1
[@NaukriEngineering] CSS4 Selectors – Part 1
Naukri.com
 

More from Naukri.com (17)

[@NaukriEngineering] Deferred deep linking in iOS
[@NaukriEngineering] Deferred deep linking in iOS[@NaukriEngineering] Deferred deep linking in iOS
[@NaukriEngineering] Deferred deep linking in iOS
 
[@NaukriEngineering] Instant Apps
[@NaukriEngineering] Instant Apps[@NaukriEngineering] Instant Apps
[@NaukriEngineering] Instant Apps
 
[@NaukriEngineering] Video handlings on apple platforms
[@NaukriEngineering] Video handlings on apple platforms[@NaukriEngineering] Video handlings on apple platforms
[@NaukriEngineering] Video handlings on apple platforms
 
[@NaukriEngineering] Introduction to Android O
[@NaukriEngineering] Introduction to Android O[@NaukriEngineering] Introduction to Android O
[@NaukriEngineering] Introduction to Android O
 
[@NaukriEngineering] MVVM in iOS
[@NaukriEngineering] MVVM in iOS[@NaukriEngineering] MVVM in iOS
[@NaukriEngineering] MVVM in iOS
 
[@NaukriEngineering] Introduction to Galera cluster
[@NaukriEngineering] Introduction to Galera cluster[@NaukriEngineering] Introduction to Galera cluster
[@NaukriEngineering] Introduction to Galera cluster
 
[@NaukriEngineering] Inbound Emails for Every Web App: Angle
[@NaukriEngineering] Inbound Emails for Every Web App: Angle[@NaukriEngineering] Inbound Emails for Every Web App: Angle
[@NaukriEngineering] Inbound Emails for Every Web App: Angle
 
[@NaukriEngineering] BDD implementation using Cucumber
[@NaukriEngineering] BDD implementation using Cucumber[@NaukriEngineering] BDD implementation using Cucumber
[@NaukriEngineering] BDD implementation using Cucumber
 
[@NaukriEngineering] Feature Toggles
[@NaukriEngineering] Feature Toggles[@NaukriEngineering] Feature Toggles
[@NaukriEngineering] Feature Toggles
 
[@NaukriEngineering] AppTracer
[@NaukriEngineering] AppTracer[@NaukriEngineering] AppTracer
[@NaukriEngineering] AppTracer
 
[@NaukriEngineering] Flux Architecture
[@NaukriEngineering] Flux Architecture[@NaukriEngineering] Flux Architecture
[@NaukriEngineering] Flux Architecture
 
[@NaukriEngineering] Mobile Web app scripts execution using Appium
[@NaukriEngineering] Mobile Web app scripts execution using Appium[@NaukriEngineering] Mobile Web app scripts execution using Appium
[@NaukriEngineering] Mobile Web app scripts execution using Appium
 
[@NaukriEngineering] Messaging Queues
[@NaukriEngineering] Messaging Queues[@NaukriEngineering] Messaging Queues
[@NaukriEngineering] Messaging Queues
 
[@NaukriEngineering] Docker 101
[@NaukriEngineering] Docker 101[@NaukriEngineering] Docker 101
[@NaukriEngineering] Docker 101
 
[@NaukriEngineering] Git Basic Commands and Hacks
[@NaukriEngineering] Git Basic Commands and Hacks[@NaukriEngineering] Git Basic Commands and Hacks
[@NaukriEngineering] Git Basic Commands and Hacks
 
[@NaukriEngineering] IndexedDB
[@NaukriEngineering] IndexedDB[@NaukriEngineering] IndexedDB
[@NaukriEngineering] IndexedDB
 
[@NaukriEngineering] CSS4 Selectors – Part 1
[@NaukriEngineering] CSS4 Selectors – Part 1[@NaukriEngineering] CSS4 Selectors – Part 1
[@NaukriEngineering] CSS4 Selectors – Part 1
 

Recently uploaded

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 

Recently uploaded (20)

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 

[@NaukriEngineering] Apache Spark

  • 2. Agenda ● What is Big Data? ● What is the solution of Big data? ● How Apache Spark can help us? ● Apache Spark advantages over Hadoop MapReduce
  • 3. What is Big Data? ● Lots of Data (Terabytes or Petabytes). ● Large and complex. ● Difficult to deal using Relational Databases. ● Challenges faced in - searching, storing, transfer, analysis, visualisation. ● Require Parallel processing on 100s of machines.
  • 4. Hadoop MapReduce ● Allows distributed processing of large datasets across clusters. ● It is open source database management with scale out storage and distributed processing. ● Characteristics: ○ Economical ○ Scalable ○ Reliable ○ Flexible
  • 5. MapReduce ● Map - Data is converted into tuples (key/value pair). ● Reduce - Takes input from map and combines input from map to form smaller set of tuples. ● Advantages ○ Scale data ○ Parallel Processing ○ Fast ○ Built in fault tolerant
  • 7. Shortcomings of MapReduce 1. Slow for Iterative Jobs. 2. Slow for Interactive Ad-hoc queries. 3. Operations - Forces task be of type Map and Reduce. 4. Difficult to program - Even simple join operations also require extensive code. Lacks data sharing. Data sharing done through stable storage (HDFS)→ slow. Slow due to replication and Disk I/O but it is essential for fault tolerance. Can we use memory? How will it be fault tolerant?
  • 8. Apache Spark ● Developed in 2009 by UC Berkeley. ● Processing engine. ● Used for speed, ease of use, and sophisticated analytics. ● It is based on Hadoop MapReduce but it extends MapReduce for performing more types of computations. ● Spark participated in Daytona Gray category, Spark sorted 100 TB of data (1 trillion records) the same data three time faster using ten times fewer machines as compared to Hadoop.
  • 9. Apache Spark ● Improves efficiency through ○ In-memory data sharing. ○ General computation graph. ● Improves usability through ○ Rich APIs in Java, Scala, Python. ○ Interactive Shell. HOW ?? Upto 100x faster in memory and 10x faster on disk Upto 2-5x less code
  • 10. Resilient Distributed Dataset (RDD) ● Fundamental Data Structure of Apache Spark. ● Read-only collection of objects partitioned across a set of machines. ● Perform In-memory Computation. ● Build on transformation operations like map, filter etc. ● Fault tolerant through lineage. ● Features: ○ Immutable ○ Parallel ○ Cacheable ○ Lazy Evaluated
  • 11. Resilient Distributed Dataset (RDD) Two types of operation can be performed: ● Transformation ○ Create new RDD from existing RDD. ○ Creates DAG. ○ Lazily evaluated. ○ Increases efficiency by not returning large dataset. ○ Eg. GroupByKey, ReduceByKey, filter. ● Action ○ All queries are executed. ○ Performs computation. ○ Returns result to driver program. ○ Eg. collect, count, take.
  • 12.
  • 13.
  • 14.
  • 15. Ready for some programming….. (using python)
  • 16. Creating RDD # Creates a list of animal. animals = ['cat', 'dog', 'elephant', 'cat', 'mouse', ’cat’] # Parallelize method is used to create RDD from list. Here “animalRDD” is created. #sc is Object of Spark Context. animalRDD = sc.parallelize(animals) # Since RDD is lazily evaluated, to print it we perform an action operation, i.e. collect() which is used to print the RDD. print animalRDD.collect() Output - ['cat', 'dog', 'elephant', 'cat', 'mouse', 'cat']
  • 17. Creating RDD from file #The file words.txt has names of animals through which animalsRDD is made. animalsRDD = sc.textFile('/path/to/file/words.txt') #collect() is the action operation. print animalsRDD.collect()
  • 18. Map operation on RDD ‘’’’’ To count the frequency of animals, we make (key/value) pair - (animal,1) for all the animals and then perform reduce operation which counts all the values. Lambda is used to write inline functions in python. ‘’’’’ mapRDD = animalRDD.map(lambda x:(x,1)) print mapRDD.collect() Output - [('cat',1), ('dog',1), ('elephant',1), ('cat',1), ('mouse',1), ('cat',1)]
  • 19. Reduce operation on RDD ‘’’’’ reduceByKey is used to perform reduce operation on same key. So in its arguments, we have defined a function to add the values for same key. Hence, we get the count of animals. ‘’’’’ reduceRDD = mapRDD.reduceByKey(lambda x,y:x+y) print reduceRDD.collect() Output - [('cat',3), ('dog',1), ('elephant',1), ('mouse',1)]
  • 20. Filter operation on RDD ‘’’’’ Filter all the animals obtained from reducedRDD with count greater than 2. x is a tuple made of (animal, count), i.e. x[0]=animal name and x[1]=count of animal. Therefore we filter the reduceRDD based on x[1]>2. ‘’’’’ filterRDD = reduceRDD.filter(lambda x:x[1]>2) print filterRDD.collect() Output - [('cat',3)]
  • 21. Please refer http://spark.apache.org/docs/latest/programming-guide.html for more about programming in Apache Spark.
  • 23.
  • 24. Spark vs. Hadoop ● Performance ○ Spark better as it does in-memory computation. ○ Hadoop is good for one pass ETL jobs and where data does not fit in memory. ● Ease of use ○ Spark is easier to program and provides API in Java, Scala, R, Python. ○ Spark has an interactive mode. ○ Hadoop MapReduce is more difficult to program but many tools are available to make it easier. ● Cost ○ Spark is cost effective according to benchmark, though staffing can be costly. ● Compatibility ○ Compatibility to data types and data sources is the same for both.
  • 25. Spark vs. Hadoop ● Data Processing ○ Spark can perform real time processing and batch processing. ○ Hadoop MapReduce is good for batch processing. Hadoop requires storm for real time processing, Giraph for graph processing, Mahout for machine learning. ● Fault tolerant ○ Hadoop MapReduce is slightly more tolerant. ● Caching ○ Spark can cache the input data.
  • 26. Applications Companies that uses Hadoop and Spark are: ● Hadoop - Hadoop is used good for static operation. ○ Dell, IBM, Cloudera, AWS and many more. ● Spark ○ Real-time marketing campaign, online product recommendations etc. ○ eBay, Amazon, Yahoo, Nokia and many more. ○ Data mining 40x times faster than Hadoop (Conviva). ○ Traffic Prediction via EM (Mobile Millennium). ○ DNA Sequence Analysis (SNAP). ○ Twitter Spam Classification (Monarch).
  • 27. Apache Spark helping companies grow in their business ● Spark Helps Pinterest Identify Trends - Using Spark, Pinterest is able to identify—and react to—developing trends as they happen. ● Netflix Leans on Spark for Personalization Aid - Netflix uses Spark to support real-time stream processing for online recommendations and data monitoring.
  • 28. Libraries of Apache Spark Spark provides libraries to provide generality. We can combine these libraries seamlessly in the same application to provide more functionality. Libraries provided by Apache Spark are: 1. Spark Streaming - It supports scalable and fault tolerant processing of streaming data. 2. Spark SQL - It allows spark to work with structured data. 3. Spark MLlib - It provides scalable machine learning library and has machine learning and statistical algorithms. 4. Spark GraphX - It is used to compute graphs over data. Refer http://spark.apache.org/docs/latest/ for more information.