SlideShare a Scribd company logo
1 of 21
Download to read offline
Introduction to Spark
Sriram and Amritendu
DOS Lab, IIT Madras
“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons
Attribution 4.0 International License.
Motivation
• In Hadoop, programmer writes job using Map
Reduce abstraction
• Runtime distributes work and handles fault-tolerance
Makes analysis of large-data sets easy and reliable
Emerging Class of Applications
Machine learning
• K-means clustering
.
.
Graph Algorithms
• Page-rank
.
.
DOS Lab, IIT Madras
Intermediate results are reused across multiple
computations
Nature of the emerging class of applications
Iterative
Computation
DOS Lab, IIT Madras
Problem with Hadoop MapReduce
HDFS
R R R
Iteration 1W W W
HDFS
R R R
HDFS
W W W
Iteration 2
Results are written to HDFS
New job is launched for each
iteration
Incurs substantial storage and job launch overheads
DOS Lab, IIT Madras
Can we do away with these overheads?
Persist intermediate
results in memory
What if a node fails?
HDFS
L L L
Iteration 1
Memory is 10-100X faster
than disk/network Iteration 2
X
Challenge: how to handle faults efficiently?
W
R R R
W W
W W W
RR R
DOS Lab, IIT Madras
Approaches to handle faults
• Replication
Issues:
– Requires more storage
– More network traffic
– Log the operation
– Re-compute lost partitions
using lineage information
Master
W
M R
Replica 1
R
Replica 2
X
Can tolerate ‘r-1’
failures
• Using Lineage
D1 D2 D3
C1 C2
X
D2 D3
C2
Issues:
Recovery time can be high if re-
computation is very costly
– high iteration time
– wide dependencies
Wide
dependencies
DOS Lab, IIT Madras
Spark
• RDD – Resilient Distributed Datasets
– Read-only, partitioned collection of records
– Supports only coarse-grained operations
• e.g. map and group-by transformations, reduce action
– Uses lineage graph to recover from faults
D12
D11
D13
3 partitions
DOS Lab, IIT Madras
Val
Spark contd.
• Control placement of partitions of RDD
– can specify number of partitions
– can partition based on a key in each record
• useful in joins
• In-memory storage
– Up to 100X speedup over Hadoop for iterative
applications
• Spark can run on Hadoop YARN and read files
from HDFS
• Spark is coded using Scala
DOS Lab, IIT Madras
SCALA overview
• Functional programming meets object
orientation
• “No side effects” aids concurrent
programming
• Every variable is an object
• Every function is a value
DOS Lab, IIT Madras
Variables and Functions
var obj : java.lang.String = “Hello”
var x = new A()
def square(x: Int) : Int={
x * x
}
Return
type
DOS Lab, IIT Madras
Execution of a function
scala> square(2)
res0:Int = 4
scala-> square(square(6))
res1:Int = 1296
def square(x: Int) : Int={
x * x
}
DOS Lab, IIT Madras
Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
fact(i, 1)
}
DOS Lab, IIT Madras
Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
 fact(i, 1)
}
DOS Lab, IIT Madras
Higher order map functions
val add = (x: Int) => x+1
val lst = list(1,2,3)
lst.map(add) : list(2,3,4)
lst.map(x => x+1) : list(2,3,4)
lst.map( _ + 1) : list(2,3,4)
DOS Lab, IIT Madras
Defining Objects
object Example{
def main(args: Array[String]) {
val logData = sc.textFile(logFile, 2).cache()
-------
-------
}
}
Example.main(
(“master”,”noOfMap”,”noOfReducer”) )
DOS Lab, IIT Madras
Spark: Filter transformation in RDD
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line =>line.contains("a"))
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
Give me those lines which contains ‘a’
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
DOS Lab, IIT Madras
Count
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(
line =>line.contains("a"))
numAs.count()
5
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
DOS Lab, IIT Madras
Flatmap
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.flatMap(line => line.split(" "))
Take each line, split based on space and give me the array
Here is a example of filter map ( Here, is, a, example, of, filter,map )
DOS Lab, IIT Madras
Wordcount Example in Spark
new SparkContext(master, appName, [sparkHome],
[jars])
val file = spark.textFile("hdfs://[input_path_to_textfile]")
val counts = file.flatMap (line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://[output_path]")
DOS Lab, IIT Madras
Limitations
• RDDs are not suitable for applications that
require fine-grained updates
– e.g. web storage system
DOS Lab, IIT Madras
References
• http://www.slideshare.net/tpunder/a-brief-intro-to-scala
• Scala in depth by Joshua D. Suereth
• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica
“Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing”, In Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation (NSDI'12). USENIX
Association, Berkeley, CA, USA, 2012.
• Pictures:
– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg
– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R
AM_Chip_Brand_New_Chip.jpg
DOS Lab, IIT Madras

More Related Content

What's hot

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R LanguageGaurang Dobariya
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionFEG
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
R basics
R basicsR basics
R basicsFAO
 

What's hot (20)

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introduction
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spark
SparkSpark
Spark
 
4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
R basics
R basicsR basics
R basics
 
Chapter 10 ds
Chapter 10 dsChapter 10 ds
Chapter 10 ds
 
Ch13
Ch13Ch13
Ch13
 

Viewers also liked

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Ortho Molecular Product ortho Biotic
Ortho Molecular Product ortho BioticOrtho Molecular Product ortho Biotic
Ortho Molecular Product ortho BioticDyna Smith
 
2016 February Announcements
2016 February Announcements2016 February Announcements
2016 February AnnouncementsWayne Irwin
 
Shawn 1 30-13a short
Shawn 1 30-13a shortShawn 1 30-13a short
Shawn 1 30-13a shortsunger01
 
Hurricane Katrina - America's Most Destructive Hurricane
Hurricane Katrina - America's Most Destructive HurricaneHurricane Katrina - America's Most Destructive Hurricane
Hurricane Katrina - America's Most Destructive HurricanePeter Killcommons
 
페이스북개발 트렌드 130313
페이스북개발 트렌드 130313페이스북개발 트렌드 130313
페이스북개발 트렌드 130313Seong Whan Park
 
திறம்பட கற்றல்
திறம்பட கற்றல்திறம்பட கற்றல்
திறம்பட கற்றல்Kaviarasi Selvaraju
 
Motivational quotations
Motivational quotations Motivational quotations
Motivational quotations Sarwan Singh
 
Atividades
AtividadesAtividades
Atividadesblog2012
 
Análisis de de textos revisados en la construcción de la historia del arte de...
Análisis de de textos revisados en la construcción de la historia del arte de...Análisis de de textos revisados en la construcción de la historia del arte de...
Análisis de de textos revisados en la construcción de la historia del arte de...cediel1952
 
A Historical Glimpse at Jerusalem’s Western Wall
A Historical Glimpse at Jerusalem’s Western WallA Historical Glimpse at Jerusalem’s Western Wall
A Historical Glimpse at Jerusalem’s Western WallLeib Tropper
 
Исследование производной
Исследование производнойИсследование производной
Исследование производнойagafonovalv
 
Animal classification based on Job 39
Animal classification based on Job 39Animal classification based on Job 39
Animal classification based on Job 39Kathy Page-Applebee
 
Student induction 2013-14
Student induction 2013-14Student induction 2013-14
Student induction 2013-14doogstone
 
2013 03-08 [開発中] node-sacloud
2013 03-08 [開発中] node-sacloud2013 03-08 [開発中] node-sacloud
2013 03-08 [開発中] node-sacloudYuki KAN
 

Viewers also liked (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
1082016
10820161082016
1082016
 
Ortho Molecular Product ortho Biotic
Ortho Molecular Product ortho BioticOrtho Molecular Product ortho Biotic
Ortho Molecular Product ortho Biotic
 
2016 February Announcements
2016 February Announcements2016 February Announcements
2016 February Announcements
 
Shawn 1 30-13a short
Shawn 1 30-13a shortShawn 1 30-13a short
Shawn 1 30-13a short
 
Hurricane Katrina - America's Most Destructive Hurricane
Hurricane Katrina - America's Most Destructive HurricaneHurricane Katrina - America's Most Destructive Hurricane
Hurricane Katrina - America's Most Destructive Hurricane
 
페이스북개발 트렌드 130313
페이스북개발 트렌드 130313페이스북개발 트렌드 130313
페이스북개발 트렌드 130313
 
திறம்பட கற்றல்
திறம்பட கற்றல்திறம்பட கற்றல்
திறம்பட கற்றல்
 
LandReformsekta
LandReformsektaLandReformsekta
LandReformsekta
 
Rupee & dollar
Rupee & dollarRupee & dollar
Rupee & dollar
 
Motivational quotations
Motivational quotations Motivational quotations
Motivational quotations
 
Atividades
AtividadesAtividades
Atividades
 
Análisis de de textos revisados en la construcción de la historia del arte de...
Análisis de de textos revisados en la construcción de la historia del arte de...Análisis de de textos revisados en la construcción de la historia del arte de...
Análisis de de textos revisados en la construcción de la historia del arte de...
 
A Historical Glimpse at Jerusalem’s Western Wall
A Historical Glimpse at Jerusalem’s Western WallA Historical Glimpse at Jerusalem’s Western Wall
A Historical Glimpse at Jerusalem’s Western Wall
 
Исследование производной
Исследование производнойИсследование производной
Исследование производной
 
Animal classification based on Job 39
Animal classification based on Job 39Animal classification based on Job 39
Animal classification based on Job 39
 
Student induction 2013-14
Student induction 2013-14Student induction 2013-14
Student induction 2013-14
 
2013 03-08 [開発中] node-sacloud
2013 03-08 [開発中] node-sacloud2013 03-08 [開発中] node-sacloud
2013 03-08 [開発中] node-sacloud
 

Similar to Introduction to Spark

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 

Similar to Introduction to Spark (20)

SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
RDD
RDDRDD
RDD
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Introduction to Spark

  • 1. Introduction to Spark Sriram and Amritendu DOS Lab, IIT Madras “Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.
  • 2. Motivation • In Hadoop, programmer writes job using Map Reduce abstraction • Runtime distributes work and handles fault-tolerance Makes analysis of large-data sets easy and reliable Emerging Class of Applications Machine learning • K-means clustering . . Graph Algorithms • Page-rank . . DOS Lab, IIT Madras
  • 3. Intermediate results are reused across multiple computations Nature of the emerging class of applications Iterative Computation DOS Lab, IIT Madras
  • 4. Problem with Hadoop MapReduce HDFS R R R Iteration 1W W W HDFS R R R HDFS W W W Iteration 2 Results are written to HDFS New job is launched for each iteration Incurs substantial storage and job launch overheads DOS Lab, IIT Madras
  • 5. Can we do away with these overheads? Persist intermediate results in memory What if a node fails? HDFS L L L Iteration 1 Memory is 10-100X faster than disk/network Iteration 2 X Challenge: how to handle faults efficiently? W R R R W W W W W RR R DOS Lab, IIT Madras
  • 6. Approaches to handle faults • Replication Issues: – Requires more storage – More network traffic – Log the operation – Re-compute lost partitions using lineage information Master W M R Replica 1 R Replica 2 X Can tolerate ‘r-1’ failures • Using Lineage D1 D2 D3 C1 C2 X D2 D3 C2 Issues: Recovery time can be high if re- computation is very costly – high iteration time – wide dependencies Wide dependencies DOS Lab, IIT Madras
  • 7. Spark • RDD – Resilient Distributed Datasets – Read-only, partitioned collection of records – Supports only coarse-grained operations • e.g. map and group-by transformations, reduce action – Uses lineage graph to recover from faults D12 D11 D13 3 partitions DOS Lab, IIT Madras Val
  • 8. Spark contd. • Control placement of partitions of RDD – can specify number of partitions – can partition based on a key in each record • useful in joins • In-memory storage – Up to 100X speedup over Hadoop for iterative applications • Spark can run on Hadoop YARN and read files from HDFS • Spark is coded using Scala DOS Lab, IIT Madras
  • 9. SCALA overview • Functional programming meets object orientation • “No side effects” aids concurrent programming • Every variable is an object • Every function is a value DOS Lab, IIT Madras
  • 10. Variables and Functions var obj : java.lang.String = “Hello” var x = new A() def square(x: Int) : Int={ x * x } Return type DOS Lab, IIT Madras
  • 11. Execution of a function scala> square(2) res0:Int = 4 scala-> square(square(6)) res1:Int = 1296 def square(x: Int) : Int={ x * x } DOS Lab, IIT Madras
  • 12. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) } fact(i, 1) } DOS Lab, IIT Madras
  • 13. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) }  fact(i, 1) } DOS Lab, IIT Madras
  • 14. Higher order map functions val add = (x: Int) => x+1 val lst = list(1,2,3) lst.map(add) : list(2,3,4) lst.map(x => x+1) : list(2,3,4) lst.map( _ + 1) : list(2,3,4) DOS Lab, IIT Madras
  • 15. Defining Objects object Example{ def main(args: Array[String]) { val logData = sc.textFile(logFile, 2).cache() ------- ------- } } Example.main( (“master”,”noOfMap”,”noOfReducer”) ) DOS Lab, IIT Madras
  • 16. Spark: Filter transformation in RDD val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a")) Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test Give me those lines which contains ‘a’ Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD DOS Lab, IIT Madras
  • 17. Count val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter( line =>line.contains("a")) numAs.count() 5 Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test DOS Lab, IIT Madras
  • 18. Flatmap val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" ")) Take each line, split based on space and give me the array Here is a example of filter map ( Here, is, a, example, of, filter,map ) DOS Lab, IIT Madras
  • 19. Wordcount Example in Spark new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]") val counts = file.flatMap (line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://[output_path]") DOS Lab, IIT Madras
  • 20. Limitations • RDDs are not suitable for applications that require fine-grained updates – e.g. web storage system DOS Lab, IIT Madras
  • 21. References • http://www.slideshare.net/tpunder/a-brief-intro-to-scala • Scala in depth by Joshua D. Suereth • Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012. • Pictures: – http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg – http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R AM_Chip_Brand_New_Chip.jpg DOS Lab, IIT Madras