SlideShare a Scribd company logo
Introduction to Apache Spark
Olalekan Fuad Elesin, Data Engineer
https://twitter.com/elesinOlalekan
https://github.com/OElesin
https://www.linkedin.com/in/elesinolalekan
00: Getting Started
Introduction
Necessary downloads and installations
Intro: Achievements
By the end of this session, you will be comfortable
with the following:
• open a Spark Shell
• explore data sets loaded from HDFS, etc.
• review Spark SQL, Spark Streaming,
• use the Spark Notebook
• developer community resources, etc.
• return to workplace and demo use of Spark!
Intro: Preliminaries
I believe we all have
basic Scala programming skills
01: Getting Started
Installations
hands-on: 5 mins (max)
Installation:
Step 1: Install JDK 7/8 on MacOs or Windows or Linux
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads
Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo
(for session, please copy all installations from the USB disk or hard drive )
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()Check point 1
What was your result?
02: Why Spark
Why Spark
Talk time: 6 mins (max)
Why Spark
• Most machine learning algorithms are iterative
• A large number of computations on data are also iterative
• With Disk based approached in Hadoop MapReduce, each iteration is written to
disk. This makes process very slow
Input Data
on Disk
Tuples
(On Disk)
Tuples
(On Disk)
Tuples
(On Disk)
Output Data
on Disk
http://www.wiziq.com/blog/hype-around-apache-spark/
Input Data
on Disk
RDD1
(in memory)
RDD2
(in memory)
RDD3
(in memory)
Output Data
on Disk
Hadoop Execution Flow
Spark Execution Flow
03: About Apache Spark
About Apache Spark
Talk time: 4 mins (max)
About Apache Spark
• Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia
• Fast and general purpose cluster computing system
• 10x (on disk) - 100x (In memory) faster than MapReduce
• Popular for running iterative machine learning algorithms, batch and streaming
computations on data, its SQL interface and data frames.
• Can also be used for Graph processing
• Provides high level API in:
• Scala
• Java
• Python
• R
• Seamless integration with Hadoop and its Ecosystem. Can also read data from a
number of existing data sources
• More info: http://spark.apache.org/
04: Spark Stack
Spark Built-in Libraries
Talk time: 4 mins (max)
Spark Stack
• Spark SQL lets you query
structured data inside Spark
programs, using either SQL or a
familiar DataFrame API.
• Spark Streaming lets you write
streaming jobs the same way you
write batch jobs.
• Spark MLlib & ML: Machine
learning algorithms
• Graphx unifies ETL, exploratory
analysis, and iterative graph
computation within a single system
05: Spark Execution Flow
Spark Execution Flow
Talk time: 4 mins (max)
Execution Flow
http://spark.apache.org/docs/latest/cluster-overview.html
Image Courtesy: Apache Spark Website
06: Terminology
BuzzWords !!!
Talk time: 6 mins (max)
Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver outside
of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the
driver's logs.
07: Resilient Distributed Dataset
RDD !!!
Talk time: 4 mins (max)
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark
• Immutable, Partitioned collection of elements that can be operated in parallel
• Basic Operations
– map
– filter
– persist
• Multiple Implementation
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join
– DoubleRDDFunctions : Operation related to double values
– SequenceFileRDDFunctions : Operation related to SequenceFiles
• RDD main characteristics:
– A list of partitions – A function for computing each split
– A list of dependencies on other RDDs
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned)
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
• Custom RDD can be also implemented (by overriding functions)
Resilient Distributed Dataset
08: Cluster Deployment
Cluster Deployment
Talk time: 3 mins (max)
• Standalone Deploy Mode
– simplest way to deploy Spark on a private cluster
• Amazon EC2
– EC2 scripts are available
– Very quick launching a new cluster
• Apache Mesos
• Hadoop YARN
Cluster Deployment
09: Monitoring
Monitoring Spark with
WebUI
Talk time: 2 mins (max)
09: Hand-Ons Time
Hands-on with Spark-shell
and
Spark Notebook
Talk time: ~ mins (max)

More Related Content

What's hot

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
Sigmoid
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Spark core
Spark coreSpark core
Spark core
Freeman Zhang
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Alex Moundalexis
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

What's hot (20)

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 

Similar to Introduction to Apache Spark :: Lagos Scala Meetup session 2

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
Frank Schroeter
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 

Similar to Introduction to Apache Spark :: Lagos Scala Meetup session 2 (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark core
Spark coreSpark core
Spark core
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 

More from Olalekan Fuad Elesin

How to go from Idea to First Customer
How to go from Idea to First CustomerHow to go from Idea to First Customer
How to go from Idea to First Customer
Olalekan Fuad Elesin
 
Platform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprisePlatform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprise
Olalekan Fuad Elesin
 
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
Olalekan Fuad Elesin
 
Graphs
GraphsGraphs
Predictive Analytics for Non-programmers
Predictive Analytics for Non-programmersPredictive Analytics for Non-programmers
Predictive Analytics for Non-programmers
Olalekan Fuad Elesin
 
CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)
Olalekan Fuad Elesin
 

More from Olalekan Fuad Elesin (6)

How to go from Idea to First Customer
How to go from Idea to First CustomerHow to go from Idea to First Customer
How to go from Idea to First Customer
 
Platform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprisePlatform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprise
 
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
 
Graphs
GraphsGraphs
Graphs
 
Predictive Analytics for Non-programmers
Predictive Analytics for Non-programmersPredictive Analytics for Non-programmers
Predictive Analytics for Non-programmers
 
CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

Introduction to Apache Spark :: Lagos Scala Meetup session 2

  • 1. Introduction to Apache Spark Olalekan Fuad Elesin, Data Engineer https://twitter.com/elesinOlalekan https://github.com/OElesin https://www.linkedin.com/in/elesinolalekan
  • 2. 00: Getting Started Introduction Necessary downloads and installations
  • 3. Intro: Achievements By the end of this session, you will be comfortable with the following: • open a Spark Shell • explore data sets loaded from HDFS, etc. • review Spark SQL, Spark Streaming, • use the Spark Notebook • developer community resources, etc. • return to workplace and demo use of Spark!
  • 4. Intro: Preliminaries I believe we all have basic Scala programming skills
  • 6. Installation: Step 1: Install JDK 7/8 on MacOs or Windows or Linux http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo (for session, please copy all installations from the USB disk or hard drive ) Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()
  • 7. Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()Check point 1 What was your result?
  • 8. 02: Why Spark Why Spark Talk time: 6 mins (max)
  • 9. Why Spark • Most machine learning algorithms are iterative • A large number of computations on data are also iterative • With Disk based approached in Hadoop MapReduce, each iteration is written to disk. This makes process very slow Input Data on Disk Tuples (On Disk) Tuples (On Disk) Tuples (On Disk) Output Data on Disk http://www.wiziq.com/blog/hype-around-apache-spark/ Input Data on Disk RDD1 (in memory) RDD2 (in memory) RDD3 (in memory) Output Data on Disk Hadoop Execution Flow Spark Execution Flow
  • 10. 03: About Apache Spark About Apache Spark Talk time: 4 mins (max)
  • 11. About Apache Spark • Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In memory) faster than MapReduce • Popular for running iterative machine learning algorithms, batch and streaming computations on data, its SQL interface and data frames. • Can also be used for Graph processing • Provides high level API in: • Scala • Java • Python • R • Seamless integration with Hadoop and its Ecosystem. Can also read data from a number of existing data sources • More info: http://spark.apache.org/
  • 12. 04: Spark Stack Spark Built-in Libraries Talk time: 4 mins (max)
  • 13. Spark Stack • Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. • Spark Streaming lets you write streaming jobs the same way you write batch jobs. • Spark MLlib & ML: Machine learning algorithms • Graphx unifies ETL, exploratory analysis, and iterative graph computation within a single system
  • 14. 05: Spark Execution Flow Spark Execution Flow Talk time: 4 mins (max)
  • 17. Term Meaning Application User program built on Spark. Consists of a driver program and executors on the cluster. Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime. Driver program The process running the main() function of the application and creating the SparkContext Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. Worker node Any node that can run application code in the cluster Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Task A unit of work that will be sent to one executor Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 18. 07: Resilient Distributed Dataset RDD !!! Talk time: 4 mins (max)
  • 19. • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions) Resilient Distributed Dataset
  • 20. 08: Cluster Deployment Cluster Deployment Talk time: 3 mins (max)
  • 21. • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN Cluster Deployment
  • 22. 09: Monitoring Monitoring Spark with WebUI Talk time: 2 mins (max)
  • 23. 09: Hand-Ons Time Hands-on with Spark-shell and Spark Notebook Talk time: ~ mins (max)