SlideShare a Scribd company logo
APACHE SPARK
✓ Need for spark
✓ Introducton to Apache Spark
✓ Spark features
✓ Spark architecture
✓ What is RDDs
✓ Transformations & Actions
✓ Spark execution model
✓ Spark ecosystem
2
Why spark?
Need for general purpose cluster computing system
as:
➢MapReduce limited to batch processing
➢Storm limited to real time stream processing
➢Impala/Tez limited to interactive processing
➢Neo4J/Giraph limited to graph processing
3
Need for Spark
• Need for a powerful engine that can process
the data in real time(streaming) as well as in
batch mode
• Need for a powerful engine that can respond in
sub-seconds and perform in-memory analytics
• Apache Spark is a powerful open source engine
that provides real-time(stream), interactive,
graph, in-memory as well as batch processing
with speed, ease of use & sophisticated
analytics.
4
What is Apache Spark
Lightning fast and general purpose cluster
computing system
5
Introduction to Apache Spark
➢Apache Spark is lightning-fast cluster computing
tool
➢General purpose distributed system
➢Up to 100 times faster than MapReduce
➢Written in Scala
➢Provides APIs in Scala, Java and python
➢Integrate with Hadoop and can process existing
data
6
History
• Introduced by UC Berkeley’s in 2009
• Open sourced in 2010
• Donated to the Apache in 2013,beacme top-level
project in 2014
• Became most active project at Apache in 2015
7
Sort Record
8
Apache Spark features
• Speed
• Ease of use
• Low latency
• Integration with Hadoop
• Rich set of operators
• Fault tolerant
• Generalized execution model
9
Spark Architecture
• Works in master and slave fashion
– Master node
– Slave node
10
Spark Nodes
11
Master node
• Manager node
• Assign the work to slave nodes
• Management, monitoring, maintenance of
slaves, assign work to them, keep track of
work
• Master daemon -runs on master node
12
Slave Nodes
• Worker nodes
• Does the work assigned by master
• Slave daemon-runs on all the slave nodes
13
Basic Spark Architecture
14
• User develop the work/application
• Submit work on the master
• Master will divide the work
• And submit it to all the nodes on the cluster
• All the slaves are doing sub-works
– In this manner Spark enjoys Distributed
Computing , parallel processing
15
Resilient Distributed Dataset
• Basic core abstraction in spark
– Resilient – if data is lost it will be recreated
automatically(fault tolerant )
– Distributed – data is distributedly stored/processed
– Dataset – data can come from different data-stores
16
• RDD is a simple and immutable collection of
objects
• RDD can contain any type of (Scala, Java,
Python and R)objects
• Each RDD is split-up into different partitions ,
which may be computed on different nodes of
clusters
17
What is RDD?
• RDDs are the fundamental unit of data in Spark
• Core spark abstraction
• Enable parallel processing on dataset
• Immutable, recomputable, fault tolerant
• During spark programming we perform
operations on RDDs
• Transformations and actions are used to process
RDDs
18
RDD operations
• Two types of operations
▪ Transformation
- Create a new RDD from the existing one
- Eg : map, filterMap, join ..etc
▪ Action
- Return a result or write it to storage
- Eg: count, collect, save..etc
19
• Lazy evaluation
– the execution will not start until an action is
triggered
20
Spark context
• Spark context is an object
• Every spark application requires a spark context
• Main entry point for spark application
• Interact with cluster manager
• Specify spark how to access the cluster
• RDDs are created using spark context
21
Spark execution model
22
• Developer develops the application/program
• Needs the spark context object, the main
entry point of spark application, which can
interact with cluster manager
• Data nodes, slaves of HDFS
• Worker nodes, slaves of Spark
• Cluster manager will interact with the worker
node and get the resources
• Executer is the distributed agent responsible
for the execution of tasks
23
The driver program
• The driver program runs the main () function
of the application and is the place where the
Spark Context is created
• The driver program that runs on the master
node of the spark cluster schedules the job
execution and negotiates with the cluster
manager
24
Executor
• Executor is a distributed agent responsible for
the execution of tasks
• Every spark applications has its own executor
process
• Executor performs all the data processing.
• Reads from and Writes data to external
sources.
• Executor stores the computation results data
in-memory, cache or on hard disk drives.
• Interacts with the storage systems.
25
Cluster manager
• An external service responsible for acquiring
resources on the spark cluster and allocating
them to a spark job
26
Spark ecosystem
27
Spark core
• Main spark engine
• Kernel of spark
• it is in charge of essential I/O functionalities
28
Spark SQL
• Enables users to run sql queries
• Can handle structured or semi-structured data
• One of the most popular sql engine in big data
29
Spark streaming
• Can handle live streams without any latency
• A powerful interactive and analytical
application
• Can process near real-time data from multiple
sources
• Internally convert the streams into micro
batches, process the in cluster, pushes to
data-stores
30
MLlib
• Machine Learning Library, scalable
• Used for advanced analytics
31
GraphX
• Enable users to handles the graph data processing
• We can represent our data in terms of graph
• Eg:
– in LinkedIn degree of connections, 1st degree, 2nd
degree connections
– In Facebook, friends of friends
Such type of requirements can be handle efficiently by the
Graph engine
32
Storage system
• Spark is dependent on third party storage
system, like:
– HDFS
– HBASE
– CASSANDRA
– AMAZON S3 and so on
33
Use cases
34
Companies using Spark
35
Disadvantages
• No File Management System
• Expensive
• Near Real-time Processing
36
37

More Related Content

What's hot

Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
Venkateswaran Kandasamy
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
Shashi Prakash
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
Edureka!
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Module01
 Module01 Module01
Module01
NPN Training
 

What's hot (20)

Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Module01
 Module01 Module01
Module01
 

Similar to An Introduction to Apache Spark

Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Spark 101
Spark 101Spark 101
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 

Similar to An Introduction to Apache Spark (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 

Recently uploaded

Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
ZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptxZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptx
dot55audits
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
spdendr
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Leena Ghag-Sakpal
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 

Recently uploaded (20)

Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
ZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptxZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptx
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 

An Introduction to Apache Spark

  • 2. ✓ Need for spark ✓ Introducton to Apache Spark ✓ Spark features ✓ Spark architecture ✓ What is RDDs ✓ Transformations & Actions ✓ Spark execution model ✓ Spark ecosystem 2
  • 3. Why spark? Need for general purpose cluster computing system as: ➢MapReduce limited to batch processing ➢Storm limited to real time stream processing ➢Impala/Tez limited to interactive processing ➢Neo4J/Giraph limited to graph processing 3
  • 4. Need for Spark • Need for a powerful engine that can process the data in real time(streaming) as well as in batch mode • Need for a powerful engine that can respond in sub-seconds and perform in-memory analytics • Apache Spark is a powerful open source engine that provides real-time(stream), interactive, graph, in-memory as well as batch processing with speed, ease of use & sophisticated analytics. 4
  • 5. What is Apache Spark Lightning fast and general purpose cluster computing system 5
  • 6. Introduction to Apache Spark ➢Apache Spark is lightning-fast cluster computing tool ➢General purpose distributed system ➢Up to 100 times faster than MapReduce ➢Written in Scala ➢Provides APIs in Scala, Java and python ➢Integrate with Hadoop and can process existing data 6
  • 7. History • Introduced by UC Berkeley’s in 2009 • Open sourced in 2010 • Donated to the Apache in 2013,beacme top-level project in 2014 • Became most active project at Apache in 2015 7
  • 9. Apache Spark features • Speed • Ease of use • Low latency • Integration with Hadoop • Rich set of operators • Fault tolerant • Generalized execution model 9
  • 10. Spark Architecture • Works in master and slave fashion – Master node – Slave node 10
  • 12. Master node • Manager node • Assign the work to slave nodes • Management, monitoring, maintenance of slaves, assign work to them, keep track of work • Master daemon -runs on master node 12
  • 13. Slave Nodes • Worker nodes • Does the work assigned by master • Slave daemon-runs on all the slave nodes 13
  • 15. • User develop the work/application • Submit work on the master • Master will divide the work • And submit it to all the nodes on the cluster • All the slaves are doing sub-works – In this manner Spark enjoys Distributed Computing , parallel processing 15
  • 16. Resilient Distributed Dataset • Basic core abstraction in spark – Resilient – if data is lost it will be recreated automatically(fault tolerant ) – Distributed – data is distributedly stored/processed – Dataset – data can come from different data-stores 16
  • 17. • RDD is a simple and immutable collection of objects • RDD can contain any type of (Scala, Java, Python and R)objects • Each RDD is split-up into different partitions , which may be computed on different nodes of clusters 17
  • 18. What is RDD? • RDDs are the fundamental unit of data in Spark • Core spark abstraction • Enable parallel processing on dataset • Immutable, recomputable, fault tolerant • During spark programming we perform operations on RDDs • Transformations and actions are used to process RDDs 18
  • 19. RDD operations • Two types of operations ▪ Transformation - Create a new RDD from the existing one - Eg : map, filterMap, join ..etc ▪ Action - Return a result or write it to storage - Eg: count, collect, save..etc 19
  • 20. • Lazy evaluation – the execution will not start until an action is triggered 20
  • 21. Spark context • Spark context is an object • Every spark application requires a spark context • Main entry point for spark application • Interact with cluster manager • Specify spark how to access the cluster • RDDs are created using spark context 21
  • 23. • Developer develops the application/program • Needs the spark context object, the main entry point of spark application, which can interact with cluster manager • Data nodes, slaves of HDFS • Worker nodes, slaves of Spark • Cluster manager will interact with the worker node and get the resources • Executer is the distributed agent responsible for the execution of tasks 23
  • 24. The driver program • The driver program runs the main () function of the application and is the place where the Spark Context is created • The driver program that runs on the master node of the spark cluster schedules the job execution and negotiates with the cluster manager 24
  • 25. Executor • Executor is a distributed agent responsible for the execution of tasks • Every spark applications has its own executor process • Executor performs all the data processing. • Reads from and Writes data to external sources. • Executor stores the computation results data in-memory, cache or on hard disk drives. • Interacts with the storage systems. 25
  • 26. Cluster manager • An external service responsible for acquiring resources on the spark cluster and allocating them to a spark job 26
  • 28. Spark core • Main spark engine • Kernel of spark • it is in charge of essential I/O functionalities 28
  • 29. Spark SQL • Enables users to run sql queries • Can handle structured or semi-structured data • One of the most popular sql engine in big data 29
  • 30. Spark streaming • Can handle live streams without any latency • A powerful interactive and analytical application • Can process near real-time data from multiple sources • Internally convert the streams into micro batches, process the in cluster, pushes to data-stores 30
  • 31. MLlib • Machine Learning Library, scalable • Used for advanced analytics 31
  • 32. GraphX • Enable users to handles the graph data processing • We can represent our data in terms of graph • Eg: – in LinkedIn degree of connections, 1st degree, 2nd degree connections – In Facebook, friends of friends Such type of requirements can be handle efficiently by the Graph engine 32
  • 33. Storage system • Spark is dependent on third party storage system, like: – HDFS – HBASE – CASSANDRA – AMAZON S3 and so on 33
  • 36. Disadvantages • No File Management System • Expensive • Near Real-time Processing 36
  • 37. 37