SlideShare a Scribd company logo
1 of 23
Download to read offline
Introduction to Apache Spark
Olalekan Fuad Elesin, Data Engineer
https://twitter.com/elesinOlalekan
https://github.com/OElesin
https://www.linkedin.com/in/elesinolalekan
00: Getting Started
Introduction
Necessary downloads and installations
Intro: Achievements
By the end of this session, you will be comfortable
with the following:
• open a Spark Shell
• explore data sets loaded from HDFS, etc.
• review Spark SQL, Spark Streaming,
• use the Spark Notebook
• developer community resources, etc.
• return to workplace and demo use of Spark!
Intro: Preliminaries
I believe we all have
basic Scala programming skills
01: Getting Started
Installations
hands-on: 5 mins (max)
Installation:
Step 1: Install JDK 7/8 on MacOs or Windows or Linux
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads
Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo
(for session, please copy all installations from the USB disk or hard drive )
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()Check point 1
What was your result?
02: Why Spark
Why Spark
Talk time: 6 mins (max)
Why Spark
• Most machine learning algorithms are iterative
• A large number of computations on data are also iterative
• With Disk based approached in Hadoop MapReduce, each iteration is written to
disk. This makes process very slow
Input Data
on Disk
Tuples
(On Disk)
Tuples
(On Disk)
Tuples
(On Disk)
Output Data
on Disk
http://www.wiziq.com/blog/hype-around-apache-spark/
Input Data
on Disk
RDD1
(in memory)
RDD2
(in memory)
RDD3
(in memory)
Output Data
on Disk
Hadoop Execution Flow
Spark Execution Flow
03: About Apache Spark
About Apache Spark
Talk time: 4 mins (max)
About Apache Spark
• Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia
• Fast and general purpose cluster computing system
• 10x (on disk) - 100x (In memory) faster than MapReduce
• Popular for running iterative machine learning algorithms, batch and streaming
computations on data, its SQL interface and data frames.
• Can also be used for Graph processing
• Provides high level API in:
• Scala
• Java
• Python
• R
• Seamless integration with Hadoop and its Ecosystem. Can also read data from a
number of existing data sources
• More info: http://spark.apache.org/
04: Spark Stack
Spark Built-in Libraries
Talk time: 4 mins (max)
Spark Stack
• Spark SQL lets you query
structured data inside Spark
programs, using either SQL or a
familiar DataFrame API.
• Spark Streaming lets you write
streaming jobs the same way you
write batch jobs.
• Spark MLlib & ML: Machine
learning algorithms
• Graphx unifies ETL, exploratory
analysis, and iterative graph
computation within a single system
05: Spark Execution Flow
Spark Execution Flow
Talk time: 4 mins (max)
Execution Flow
http://spark.apache.org/docs/latest/cluster-overview.html
Image Courtesy: Apache Spark Website
06: Terminology
BuzzWords !!!
Talk time: 6 mins (max)
Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver outside
of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the
driver's logs.
07: Resilient Distributed Dataset
RDD !!!
Talk time: 4 mins (max)
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark
• Immutable, Partitioned collection of elements that can be operated in parallel
• Basic Operations
– map
– filter
– persist
• Multiple Implementation
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join
– DoubleRDDFunctions : Operation related to double values
– SequenceFileRDDFunctions : Operation related to SequenceFiles
• RDD main characteristics:
– A list of partitions – A function for computing each split
– A list of dependencies on other RDDs
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned)
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
• Custom RDD can be also implemented (by overriding functions)
Resilient Distributed Dataset
08: Cluster Deployment
Cluster Deployment
Talk time: 3 mins (max)
• Standalone Deploy Mode
– simplest way to deploy Spark on a private cluster
• Amazon EC2
– EC2 scripts are available
– Very quick launching a new cluster
• Apache Mesos
• Hadoop YARN
Cluster Deployment
09: Monitoring
Monitoring Spark with
WebUI
Talk time: 2 mins (max)
09: Hand-Ons Time
Hands-on with Spark-shell
and
Spark Notebook
Talk time: ~ mins (max)

More Related Content

What's hot

Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 

What's hot (20)

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 

Similar to Introduction to Apache Spark :: Lagos Scala Meetup session 2

Similar to Introduction to Apache Spark :: Lagos Scala Meetup session 2 (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark core
Spark coreSpark core
Spark core
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 

More from Olalekan Fuad Elesin

More from Olalekan Fuad Elesin (6)

How to go from Idea to First Customer
How to go from Idea to First CustomerHow to go from Idea to First Customer
How to go from Idea to First Customer
 
Platform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprisePlatform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprise
 
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
 
Graphs
GraphsGraphs
Graphs
 
Predictive Analytics for Non-programmers
Predictive Analytics for Non-programmersPredictive Analytics for Non-programmers
Predictive Analytics for Non-programmers
 
CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Introduction to Apache Spark :: Lagos Scala Meetup session 2

  • 1. Introduction to Apache Spark Olalekan Fuad Elesin, Data Engineer https://twitter.com/elesinOlalekan https://github.com/OElesin https://www.linkedin.com/in/elesinolalekan
  • 2. 00: Getting Started Introduction Necessary downloads and installations
  • 3. Intro: Achievements By the end of this session, you will be comfortable with the following: • open a Spark Shell • explore data sets loaded from HDFS, etc. • review Spark SQL, Spark Streaming, • use the Spark Notebook • developer community resources, etc. • return to workplace and demo use of Spark!
  • 4. Intro: Preliminaries I believe we all have basic Scala programming skills
  • 6. Installation: Step 1: Install JDK 7/8 on MacOs or Windows or Linux http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo (for session, please copy all installations from the USB disk or hard drive ) Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()
  • 7. Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()Check point 1 What was your result?
  • 8. 02: Why Spark Why Spark Talk time: 6 mins (max)
  • 9. Why Spark • Most machine learning algorithms are iterative • A large number of computations on data are also iterative • With Disk based approached in Hadoop MapReduce, each iteration is written to disk. This makes process very slow Input Data on Disk Tuples (On Disk) Tuples (On Disk) Tuples (On Disk) Output Data on Disk http://www.wiziq.com/blog/hype-around-apache-spark/ Input Data on Disk RDD1 (in memory) RDD2 (in memory) RDD3 (in memory) Output Data on Disk Hadoop Execution Flow Spark Execution Flow
  • 10. 03: About Apache Spark About Apache Spark Talk time: 4 mins (max)
  • 11. About Apache Spark • Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In memory) faster than MapReduce • Popular for running iterative machine learning algorithms, batch and streaming computations on data, its SQL interface and data frames. • Can also be used for Graph processing • Provides high level API in: • Scala • Java • Python • R • Seamless integration with Hadoop and its Ecosystem. Can also read data from a number of existing data sources • More info: http://spark.apache.org/
  • 12. 04: Spark Stack Spark Built-in Libraries Talk time: 4 mins (max)
  • 13. Spark Stack • Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. • Spark Streaming lets you write streaming jobs the same way you write batch jobs. • Spark MLlib & ML: Machine learning algorithms • Graphx unifies ETL, exploratory analysis, and iterative graph computation within a single system
  • 14. 05: Spark Execution Flow Spark Execution Flow Talk time: 4 mins (max)
  • 17. Term Meaning Application User program built on Spark. Consists of a driver program and executors on the cluster. Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime. Driver program The process running the main() function of the application and creating the SparkContext Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. Worker node Any node that can run application code in the cluster Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Task A unit of work that will be sent to one executor Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 18. 07: Resilient Distributed Dataset RDD !!! Talk time: 4 mins (max)
  • 19. • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions) Resilient Distributed Dataset
  • 20. 08: Cluster Deployment Cluster Deployment Talk time: 3 mins (max)
  • 21. • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN Cluster Deployment
  • 22. 09: Monitoring Monitoring Spark with WebUI Talk time: 2 mins (max)
  • 23. 09: Hand-Ons Time Hands-on with Spark-shell and Spark Notebook Talk time: ~ mins (max)