SlideShare a Scribd company logo
1 of 23
Download to read offline
Spark : Cluster Computing
with Working Sets
1
Matei Zaharia, Mosharaf Chowdhury,
Michael J. Franklin, Scott Shenker, Ion Stoica
University of California, Berkeley
Abstract
This paper focuses on one such class of
application: those that reuse a working
set of data across multiple parallel
operations, This included many iterative
machine learning algorithms, as well as
interactive data analysis tools. We
propose a new framework called Spark that
supports these application while retaining
the scalability and fault tolerance of
MapReduce called resilient distributed
datasets (RDDs)
2
Introduction
Class of application that reuse a working
set of data across multiple parallel
operations. This includes two use cases
where we have seen Hadoop users report
that MapReduce is deficient:
• iterative jobs
• interactive analytics
3
Programming Model
4
Programming Model
• Driver Program: implements the high-
level control flow of their application
and launches various operations in
parallel.
5
Resilient distributed datasets
(RDDs)
Represented by a Scala object and constructed:
• From a file in a shared filesystem (e.g., HDFS)

• By “parallelizing” a Scala collection (e.g., an
array) in the driver program

• By “transforming” an existing Distributed
Dataset (e.g., using a user-provided function A
=> List[B]

• By “changing” the persistence of an existing RDD
- The cache action
- the save action
6
Parallel operations
•reduce: Combines dataset elements using
an associative function to produce a
result at the driver program. 

•collect: Sends all elements of the
dataset to the driver program

•foreach: Passes each element through a
user provided function.
7
Shared Variables
Broadcast variables: Used for large read-
only data (e.g., lookup table) in
multiple parallel operations
Accumulators: Variables that works can
only “add” to using an associative
operation, and only the driver program can
read
8
Examples
9
Text Search
Suppose that we wish to count the lines
containing errors in a large log file
stored in HDFS. This can be implemented by
starting with a file dataset object as
follows:
10
Text Search
11
Worker
Worker
Worker
Driver
Block 1
Block 2
Block 3
]
Cache 2
Cache 3
Cache 1
task
task
task
result
result
result
Logistic Regression
Goal: find best line separating two sets
of points
12
Logistic Regression
13
Alternating Least Squares
Collaborative Filtering: Predict movie
ratings for a set of users based on their
past ratings
14
Implementation
Spark to run alongside existing cluster
computing frameworks, such as Mesos ports
of Hadoop and MPI, and share data with
them.
15
Spark Hadoop MPI
Mesos
Node Node Node Node
…
Implementation
These datasets will be stored as a chain
of objects capturing the lineage of each
RDD.
16
Implementation
17
Result
• Logistic Regression: using a 29 GB
dataset on 20 “m1.xlarge” EC2 nodes
with 4 cores each.
18
127s/iteration
first iteration 174s
further iterations 6s
Result
Alternating Least Squares: We found that without
using broadcast variables, the time to resend the
ratings matrix R on each iteration dominated the
job’s running time. Furthermore, with a naıve
implementation of broadcast (using HDFS or NFS),
the broadcast time grew linearly with the number
of nodes, limiting the scalability of the job. We
implemented an application-level multicast system
to mitigate this. However, even with fast broad
cast, resending R on each iteration is costly.
Caching R in memory on the workers using a
broadcast variable improved performance by 2.8x in
an experiment with 5000 movies and 15000 users on
a 30-node EC2 cluster.
19
Result
• Interactive Spark: We used the Spark
interpreter to load a 39 GB dump of
Wikipedia in memory across 15
“m1.xlarge” EC2 machines and query it
interactively. The first time the
dataset is queried, it takes roughly 35
seconds, comparable to running a Hadoop
job on it. However, subsequent queries
take only 0.5 to 1 seconds, even if they
scan all the data. This provides a
qualitatively different experience,
comparable to working with local data.
20
Discussion
Spark provides three simple data
abstractions for programming clusters:
resilient distributed datasets (RDDs),
and two restricted types of shared
variables: broadcast variables and
accumulators. While these abstractions
are limited, we have found that they are
powerful enough to express several
applications that pose challenges for
existing cluster computing frameworks,
including iterative and interactive
computations.
21
Future Work
in future work, we plan to focus on four
areas
• Formally characterize the properties of RDDs
and Spark’s other abstractions
• Enhance the RDD abstraction on allow
programmers to trade between storage cost and
reconstruction cost
• Define new operations to transform RDDs (e.g.
shuffle)
• Provide higher-level interactive interfaces on
top of the spark interpreter, such as SQL and
R shells
22
Question?
23

More Related Content

What's hot

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

What's hot (20)

Spark core
Spark coreSpark core
Spark core
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
RDD
RDDRDD
RDD
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark
SparkSpark
Spark
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 

Viewers also liked

Computer Languages
Computer LanguagesComputer Languages
Computer Languages
Veniman
 
Real_Estate_Script
Real_Estate_ScriptReal_Estate_Script
Real_Estate_Script
Jeff Kent
 
Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015
Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015
Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015
Jonathan Edgecombe
 
Industrial report on fairness cream
Industrial report on fairness creamIndustrial report on fairness cream
Industrial report on fairness cream
Kumari Pswn
 
Tracking Variation in Systemic Risk-2 8-3
Tracking Variation in Systemic Risk-2 8-3Tracking Variation in Systemic Risk-2 8-3
Tracking Variation in Systemic Risk-2 8-3
edward kane
 
nyaruto_sajtokiajanlo_nagy
nyaruto_sajtokiajanlo_nagynyaruto_sajtokiajanlo_nagy
nyaruto_sajtokiajanlo_nagy
Gyuricza Eszter
 

Viewers also liked (20)

Adam
AdamAdam
Adam
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyon
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Scala+spark 2nd
Scala+spark 2ndScala+spark 2nd
Scala+spark 2nd
 
Computer Languages
Computer LanguagesComputer Languages
Computer Languages
 
Real_Estate_Script
Real_Estate_ScriptReal_Estate_Script
Real_Estate_Script
 
Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015
Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015
Jonathan M Edgecombe_Resume_March 9 2015 (1)new2015
 
My notes
My notesMy notes
My notes
 
Industrial report on fairness cream
Industrial report on fairness creamIndustrial report on fairness cream
Industrial report on fairness cream
 
Tide ghoshal sir
Tide ghoshal sirTide ghoshal sir
Tide ghoshal sir
 
Tracking Variation in Systemic Risk-2 8-3
Tracking Variation in Systemic Risk-2 8-3Tracking Variation in Systemic Risk-2 8-3
Tracking Variation in Systemic Risk-2 8-3
 
Resume1 -Team leader
Resume1 -Team leaderResume1 -Team leader
Resume1 -Team leader
 
第4回プログラミングカフェ_テキスト
第4回プログラミングカフェ_テキスト第4回プログラミングカフェ_テキスト
第4回プログラミングカフェ_テキスト
 
Afl presentation assessment 2b
Afl presentation  assessment 2bAfl presentation  assessment 2b
Afl presentation assessment 2b
 
STCW Certificates
STCW CertificatesSTCW Certificates
STCW Certificates
 
API Authentication
API AuthenticationAPI Authentication
API Authentication
 
nyaruto_sajtokiajanlo_nagy
nyaruto_sajtokiajanlo_nagynyaruto_sajtokiajanlo_nagy
nyaruto_sajtokiajanlo_nagy
 

Similar to Spark

Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 

Similar to Spark (20)

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 

Spark

  • 1. Spark : Cluster Computing with Working Sets 1 Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley
  • 2. Abstract This paper focuses on one such class of application: those that reuse a working set of data across multiple parallel operations, This included many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these application while retaining the scalability and fault tolerance of MapReduce called resilient distributed datasets (RDDs) 2
  • 3. Introduction Class of application that reuse a working set of data across multiple parallel operations. This includes two use cases where we have seen Hadoop users report that MapReduce is deficient: • iterative jobs • interactive analytics 3
  • 5. Programming Model • Driver Program: implements the high- level control flow of their application and launches various operations in parallel. 5
  • 6. Resilient distributed datasets (RDDs) Represented by a Scala object and constructed: • From a file in a shared filesystem (e.g., HDFS)
 • By “parallelizing” a Scala collection (e.g., an array) in the driver program
 • By “transforming” an existing Distributed Dataset (e.g., using a user-provided function A => List[B]
 • By “changing” the persistence of an existing RDD - The cache action - the save action 6
  • 7. Parallel operations •reduce: Combines dataset elements using an associative function to produce a result at the driver program. 
 •collect: Sends all elements of the dataset to the driver program
 •foreach: Passes each element through a user provided function. 7
  • 8. Shared Variables Broadcast variables: Used for large read- only data (e.g., lookup table) in multiple parallel operations Accumulators: Variables that works can only “add” to using an associative operation, and only the driver program can read 8
  • 10. Text Search Suppose that we wish to count the lines containing errors in a large log file stored in HDFS. This can be implemented by starting with a file dataset object as follows: 10
  • 11. Text Search 11 Worker Worker Worker Driver Block 1 Block 2 Block 3 ] Cache 2 Cache 3 Cache 1 task task task result result result
  • 12. Logistic Regression Goal: find best line separating two sets of points 12
  • 14. Alternating Least Squares Collaborative Filtering: Predict movie ratings for a set of users based on their past ratings 14
  • 15. Implementation Spark to run alongside existing cluster computing frameworks, such as Mesos ports of Hadoop and MPI, and share data with them. 15 Spark Hadoop MPI Mesos Node Node Node Node …
  • 16. Implementation These datasets will be stored as a chain of objects capturing the lineage of each RDD. 16
  • 18. Result • Logistic Regression: using a 29 GB dataset on 20 “m1.xlarge” EC2 nodes with 4 cores each. 18 127s/iteration first iteration 174s further iterations 6s
  • 19. Result Alternating Least Squares: We found that without using broadcast variables, the time to resend the ratings matrix R on each iteration dominated the job’s running time. Furthermore, with a naıve implementation of broadcast (using HDFS or NFS), the broadcast time grew linearly with the number of nodes, limiting the scalability of the job. We implemented an application-level multicast system to mitigate this. However, even with fast broad cast, resending R on each iteration is costly. Caching R in memory on the workers using a broadcast variable improved performance by 2.8x in an experiment with 5000 movies and 15000 users on a 30-node EC2 cluster. 19
  • 20. Result • Interactive Spark: We used the Spark interpreter to load a 39 GB dump of Wikipedia in memory across 15 “m1.xlarge” EC2 machines and query it interactively. The first time the dataset is queried, it takes roughly 35 seconds, comparable to running a Hadoop job on it. However, subsequent queries take only 0.5 to 1 seconds, even if they scan all the data. This provides a qualitatively different experience, comparable to working with local data. 20
  • 21. Discussion Spark provides three simple data abstractions for programming clusters: resilient distributed datasets (RDDs), and two restricted types of shared variables: broadcast variables and accumulators. While these abstractions are limited, we have found that they are powerful enough to express several applications that pose challenges for existing cluster computing frameworks, including iterative and interactive computations. 21
  • 22. Future Work in future work, we plan to focus on four areas • Formally characterize the properties of RDDs and Spark’s other abstractions • Enhance the RDD abstraction on allow programmers to trade between storage cost and reconstruction cost • Define new operations to transform RDDs (e.g. shuffle) • Provide higher-level interactive interfaces on top of the spark interpreter, such as SQL and R shells 22