SlideShare a Scribd company logo
1 of 25
Resilient Distributed Datasets
A Fault­­Tolerant Abstraction for
In­Memory Cluster Computing
Motivation
•RDDs are motivated by two types of applications that current computing
frameworks handle inefficiently:
1. Iterative algorithms:
­iterative machine learning
­graph algorithms
2. Interative data mining
­ad­hoc query
•In MapReduce, the only way to share data across jobs is stable storage
slow!
Examples
Slow due to replication and disk I/O, but
necessary for fault tolerance
Goal:In-Memory Data Sharing
Solution: Resilient
Distributed Datasets (RDDs)
•Restriced form of distributed shared memory
­­ Immutable,partitioned collections of records
­­ Can only be built through coarse­grained derterminstic
transformations(map,filter,join,…)
•Efficient fault recovery using lineage
­­log one operation to apply to many elenments
­­Recompute lost partitions on failure
­­No cost if nonthing fails
Solution: Resilient
Distributed Datasets (RDDs)
• Allow apps to keep working sets in memory
for efficient reuse
• Retain the attractive properties of MapReduce
– Fault tolerance, data locality, scalability
• Support a wide range of applications
• Control of each RDD’s partitioning (layout
across nodes) and persistence (storage in
RAM,on disk,etc)
RDD Operations
Transformations
(define a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Load error messages from a log into memory, then
interactively search for various patterns
9
Fault Recovery
• RDD track the grapth of transformations that
built them (their lineage) to rebuild lost data
10
Example:PageRank
Optimizing Placement
links & ranks repeatedly joined
Can co-partition them (e.g.hash
both on URL) to avoid shuffles
Can also use app knowledge,
e.g.,hash on DNS name
links = links.partitionBy(new
URLPartitioner())
PageRank Performance
Representing RDDs
• a set of partitions, which are atomic pieces of the
dataset
• a set of dependencies on parent RDDs
• a function for computing the dataset based on its
parents
• metadata about its partitioning scheme
• data placement
04/25/14
Representing RDDs
04/25/14
Operation Meanning
partitions() Return a list of Partition objects
preferredLocations(p) List nodes where partition p can
be accessed faster due to data
locality
dependencies() Return a list of dependencies
iterator(p, parentIters) Compute the elements of
partition p given iterators for its
parent partitions
partitioner() Return metadata specifying
whether the RDD is hash/range
partitioned
Interface used to represent RDDs in Spark
Dependencies
• narrow dependencies
---where each partition of the parent RDD is used by
at most one partition of the child RDD
• wide dependencies
---where multiple child partitions may depend on it.
• For example
---map leads to a narrow dependency,
---while join leads to wide dependencies (unless the parents are
hash-partitioned)
04/25/14
Dependencies
04/25/14
Examples of narrow and wide dependencies. Each box is an RDD, with
partitions shown as shaded rectangles
Narrow VS Wide dependencies
• Narrow dependencies
---allow for pipelined execution on one cluster node, which can compute all the
parent partitions.
---recovery after a node failure is more efficient, as only the lost parent partitions
need to be recomputed, can be recomputed in parallel on different nodes
• Wide dependencies
--- require data from all parent partitions to be available and to be shuffled across
the nodes using a MapReduce-like operation
--- in a lineage graph, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution
04/25/14
Job Scheduler
• Similar to Dryad’s, but takes into account which partitions of persistent
RDDS available in memory
• When runs an action (e.g., count or save) on an RDD, the scheduler
examines that RDD’s lineage graph to build a DAG of stages to execute
• Each stage contains as many pipelined transformations with narrow
dependencies as possible
Boundary of the stages
---shuffle operations required for wide dependencies
---any already computed partitions(shortcircuit the computation of a
parent RDD)
• The scheduler then launches tasks to compute missing partitions from
each stage until it has computed the target RDD
04/25/14
Job Scheduler
04/25/14
Dryad-like DAGs
Pipelines functions
within a stage
Locality & data
reuse aware
Partitioning-aware
to avoid shuffles
Task Assignment
• scheduler assigns tasks to machines based on data locality
using delay scheduling
---if a task needs to process a partition that is available in
memory on a node, then send it to that node
---otherwise, a task processes a partition for which the
containing RDD provides preferred locations (e.g., an HDFS
file), then send it to those
04/25/14
Memory Management
• in-memory storage as deserialized Java objects
---The first option provides the fastest performance, because the Java
VM can access each RDD element natively
• in-memory storage as serialized data
---The second option lets users choose a more memory-efficient
representation than Java object graphs when space is limited, at the
cost of lower performance
• on-disk storage
---The third option is useful for RDDs that are too large to keep in RAM
but costly to recompute on each use.
04/25/14
Not Suitable for RDDs
• RDDs are best suited for batch applications that apply the same
operation to all elements of a dataset
• RDDs would be less suitable for applications that make asynchronous
fine-grained updates to shared state, such as a storage system for a web
application or an incremental web crawler
04/25/14
04/25/14
Programming Models
Implemented on Spark
RDDs can express many existing parallel models
04/25/14
Open Source Community
15contributors,5+companies using Spark,
3+applications projects at Berkeley
User applications:
» Data mining 40x faster than Hadoop(Conviva)
» Exploratory log analysis (Foursquare)
» Traffic prediction via EM(Mobile Millennium)
» Twitter spam classification (Monarch)
» DNA sequence analysis(SNAP)
04/25/14
Conclusion
RDDs offer a simple and efficient programming model for a broad range of
Applications(immutable nature and coarse-grained transformations, suitable
for a wide class of applications)
Leverage the coarse-grained nature of many parallel algorithms for low-
overhead recovery
Let user controls each RDD’s partitioning (layout across nodes) and
persistence (storage in RAM,on disk,etc)

More Related Content

What's hot

An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsNikolaos Konstantinou
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose themDatio Big Data
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDaniel Marcous
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915Dan Han
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-llachudhivi
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
assignment3
assignment3assignment3
assignment3Kirti J
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Nikolaos Konstantinou
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Kiruthikak14
 

What's hot (18)

An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF Graphs
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
C0312023
C0312023C0312023
C0312023
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
MapReduce
MapReduceMapReduce
MapReduce
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
 
Cppt
CpptCppt
Cppt
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
assignment3
assignment3assignment3
assignment3
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 

Similar to Resilient Distributed Datasets (RDDs) Fault-Tolerant Abstraction

Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 

Similar to Resilient Distributed Datasets (RDDs) Fault-Tolerant Abstraction (20)

Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Resilient Distributed Datasets (RDDs) Fault-Tolerant Abstraction

  • 1. Resilient Distributed Datasets A Fault­­Tolerant Abstraction for In­Memory Cluster Computing
  • 2. Motivation •RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: 1. Iterative algorithms: ­iterative machine learning ­graph algorithms 2. Interative data mining ­ad­hoc query •In MapReduce, the only way to share data across jobs is stable storage slow!
  • 3. Examples Slow due to replication and disk I/O, but necessary for fault tolerance
  • 5. Solution: Resilient Distributed Datasets (RDDs) •Restriced form of distributed shared memory ­­ Immutable,partitioned collections of records ­­ Can only be built through coarse­grained derterminstic transformations(map,filter,join,…) •Efficient fault recovery using lineage ­­log one operation to apply to many elenments ­­Recompute lost partitions on failure ­­No cost if nonthing fails
  • 6. Solution: Resilient Distributed Datasets (RDDs) • Allow apps to keep working sets in memory for efficient reuse • Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability • Support a wide range of applications • Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)
  • 7. RDD Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey
  • 8. Example: Log Mining lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Load error messages from a log into memory, then interactively search for various patterns
  • 9. 9 Fault Recovery • RDD track the grapth of transformations that built them (their lineage) to rebuild lost data
  • 11. Optimizing Placement links & ranks repeatedly joined Can co-partition them (e.g.hash both on URL) to avoid shuffles Can also use app knowledge, e.g.,hash on DNS name links = links.partitionBy(new URLPartitioner())
  • 13. Representing RDDs • a set of partitions, which are atomic pieces of the dataset • a set of dependencies on parent RDDs • a function for computing the dataset based on its parents • metadata about its partitioning scheme • data placement 04/25/14
  • 14. Representing RDDs 04/25/14 Operation Meanning partitions() Return a list of Partition objects preferredLocations(p) List nodes where partition p can be accessed faster due to data locality dependencies() Return a list of dependencies iterator(p, parentIters) Compute the elements of partition p given iterators for its parent partitions partitioner() Return metadata specifying whether the RDD is hash/range partitioned Interface used to represent RDDs in Spark
  • 15. Dependencies • narrow dependencies ---where each partition of the parent RDD is used by at most one partition of the child RDD • wide dependencies ---where multiple child partitions may depend on it. • For example ---map leads to a narrow dependency, ---while join leads to wide dependencies (unless the parents are hash-partitioned) 04/25/14
  • 16. Dependencies 04/25/14 Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles
  • 17. Narrow VS Wide dependencies • Narrow dependencies ---allow for pipelined execution on one cluster node, which can compute all the parent partitions. ---recovery after a node failure is more efficient, as only the lost parent partitions need to be recomputed, can be recomputed in parallel on different nodes • Wide dependencies --- require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation --- in a lineage graph, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution 04/25/14
  • 18. Job Scheduler • Similar to Dryad’s, but takes into account which partitions of persistent RDDS available in memory • When runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute • Each stage contains as many pipelined transformations with narrow dependencies as possible Boundary of the stages ---shuffle operations required for wide dependencies ---any already computed partitions(shortcircuit the computation of a parent RDD) • The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD 04/25/14
  • 19. Job Scheduler 04/25/14 Dryad-like DAGs Pipelines functions within a stage Locality & data reuse aware Partitioning-aware to avoid shuffles
  • 20. Task Assignment • scheduler assigns tasks to machines based on data locality using delay scheduling ---if a task needs to process a partition that is available in memory on a node, then send it to that node ---otherwise, a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), then send it to those 04/25/14
  • 21. Memory Management • in-memory storage as deserialized Java objects ---The first option provides the fastest performance, because the Java VM can access each RDD element natively • in-memory storage as serialized data ---The second option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance • on-disk storage ---The third option is useful for RDDs that are too large to keep in RAM but costly to recompute on each use. 04/25/14
  • 22. Not Suitable for RDDs • RDDs are best suited for batch applications that apply the same operation to all elements of a dataset • RDDs would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler 04/25/14
  • 23. 04/25/14 Programming Models Implemented on Spark RDDs can express many existing parallel models
  • 24. 04/25/14 Open Source Community 15contributors,5+companies using Spark, 3+applications projects at Berkeley User applications: » Data mining 40x faster than Hadoop(Conviva) » Exploratory log analysis (Foursquare) » Traffic prediction via EM(Mobile Millennium) » Twitter spam classification (Monarch) » DNA sequence analysis(SNAP)
  • 25. 04/25/14 Conclusion RDDs offer a simple and efficient programming model for a broad range of Applications(immutable nature and coarse-grained transformations, suitable for a wide class of applications) Leverage the coarse-grained nature of many parallel algorithms for low- overhead recovery Let user controls each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)

Editor's Notes

  1. Key idea: add “variables” to the “functions” in functional programming
  2. Pepieline execution: For example, one can apply a map followed by a filter on an element-by-element basis
  3. Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD G, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3.
  4. 自己总结: 1.简单 高效 应用范围较广 2.降低了粗粒度并行算法容恢复的代价 3.由用户决定哪些数据是需要重复利用而需要长久保存以及保存的策略,用户可以控制数据分布的策略来避免shuffle以提高效率(如co-partition,shuffle的过程是比较慢,比较耗时间的操作) 4.比一般的模型更通用,现有的模型大多解决的是MapReduce在某些领域性能表现的不好而专门位置设计的专用模型,如Google的Pregel,与之相比,Pregel提供的数据共享模型隐含的适用于图计算的模型,而RDD的模型则提供了一种更通用的数据共享模型(不仅仅能表达出Pregel的计算模型,还能用在其他的应用场景,更通用,更灵活。) 与Pregel的区别: A third class of systems provide high-level interfaces for specific classes of applications requiring data sharing. For example, Pregel [22] supports iterative graph applications, while Twister [11] and HaLoop [7] are iterative MapReduce runtimes. However, these frameworks perform data sharing implicitly for the pattern of computation they support, and do not provide a general abstraction that the user can employ to share data of her choice among operations of her choice. For example, a user cannot use Pregel or Twister to load a dataset into memory and then decide what query to run on it. RDDs provide a distributed storage abstraction explicitly and can thus support applications that these specialized systems do not capture, such as interactive data mining. 与 MR的区别(shark论文总结): 1. Like Dryad and Tenzing [17, 9], it supports general computation DAGs, not just the two-stage MapReduce topology. 2. It provides an in-memory storage abstraction called Resilient Distributed Datasets (RDDs) that lets applications keep data in memory across queries, and automatically reconstructs it after failures [33]. 3. The engine is optimized for low latency. It can efficiently manage tasks as short as 100 milliseconds on clusters of thousands of cores, while engines like Hadoop incur a latency of 5–10 seconds to launch each task. RDD的四个特点(shark论文总结): The RDD model offers several key benefits our large-scale in memory computing setting. First, RDDs can be written at the speed of DRAM instead of the speed of the network, because there is no need to replicate each byte written to another machine for fault tolerance. DRAM in a modern server is over 10 faster than even a 10-Gigabit network. Second, Spark can keep just one copy of each RDD partition in memory, saving precious memory over a replicated system, since it can always recover lost data using lineage. Third, when a node fails, its lost RDD partitions can be rebuilt in parallel across the other nodes, allowing speedy recovery. Fourth,even if a node is just slow (a “straggler”), we can recompute necessary partitions on other nodes because RDDs are immutable so there are no consistency concerns with having two copies of a partition.