SlideShare a Scribd company logo
1 of 36
Download to read offline
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING 
MateiZahariaet al. 
Universityof California, Berkeley
Alessandro MenabòPolitecnico di Torino, Italy
INTRODUCTION
Motivations 
Interactive (real-time) data mining 
Reuseof intermediate results(iterative algorithms) 
Examples: 
Machine learning 
K-meansclustering 
PageRank
Limitationsof currentframeworks 
Data reuseusuallythroughdisk storage 
Disk IO latencyand serialization 
Too high-levelabstractions 
Implicitmemorymanagement 
Implicitwork distribution 
Fault tolerancethroughdata replicationand logging 
High network traffic
Goals 
Keepfrequentlyuseddata in mainmemory 
Efficientfault recovery 
Log data transformationsratherthandata itself 
User control
RESILIENT DISTRIBUTED DATASETS (RDDs)
Whatisan RDD? 
Read-only, partitionedcollectionof recordsin key-valueform 
Createdthroughtransformations 
From storeddata or otherRDDs 
Coarse-grained: sameoperationon the wholedataset 
Examples: map, filter, join 
Lineage: sequenceof transformationsthatcreatedthe RDD 
Keyto efficientfault recovery 
Usedthroughactions 
Return a resultor storedata 
Examples: count, collect, save
Whatisan RDD? (cont’d) 
Lazycomputation 
RDDsare computedonlywhenthe first actionisinvoked 
Persistencecontrol 
ChooseRDDsto be reused, and howto storethem(e.g. in memory) 
Partitioningcontrol 
Definehowto distributeRDDsacrosscluster nodes 
Minimizeinter-nodecommunication
Implementation 
Apache Sparkcluster computingframework 
Open source 
Basedon HadoopDistributed File System (HDFS) (by Apache) 
Scala programminglanguage 
Derivedfrom Java, compilesto Java bytecode 
Object-orientedand functionalprogramming 
Staticallytyped, efficientand concise
Sparkprogramminginterface 
Driver program 
Definesand invokesactionson RDDs 
TracksRDDs’ lineage 
Assignsworkloadto workers 
Workers 
Persistentprocesseson cluster nodes 
Performactionson data 
Can storepartitionsof RDDsin RAM
Example: PageRank 
Iterative algorithm 
Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
Example: PageRank(cont’d) 
The graphgrowswith the numberof iterations 
Replicate some intermediate resultsto speedupfault recovery 
Reduce communicationoverhead 
Partitionbothlinksand ranksby URL in the sameway 
Joiningthemcan be doneon the samenode
RDD representation 
Goals 
Easilytracklineage 
Supportrichset of transformations 
Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) 
Graph-basedstructure 
Set of partitions(piecesof the dataset) 
Set of dependencieson parentRDDs 
Functionfor computingthe datasetfrom parentRDDs 
Metadataaboutpartitioningand data location
Dependencies 
Narrowdependencies 
Eachpartitionof the parentisusedby atmostonepartitionof the child 
Example: map, filter, union 
Wide dependencies 
Eachpartitionof the parentmaybe usedby manypartitionsof the child 
Example: join, groupByKey
Dependencies(cont’d) 
Normalexecution 
Narrowpipelined(e.g. map+ filteroneelementata time) 
Wide serial (allparentsneedto be availablebeforecomputationstarts) 
Fault recovery 
Narrowfast (onlyoneparentpartitionhasto be recomputed) 
Wide full (onefailednodemayrequireallparentsto be recomputed)
OVERVIEW OF SPARK
Scheduling 
Tracksin-memorypartitions 
On actionrequest: 
Examineslineageand buildsa DAG of executionstages 
Eachstage containsasmanytransformationswith narrowdependenciesaspossible 
Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions 
Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed 
Tasksassignedaccordingto in-memorydata locality 
Otherwiseassignto RDD’spreferredlocation (user-specified)
Scheduling(cont’d) 
On task failure, re-runiton anothernodeifallparentsare stillavailable 
Ifstagesbecomeunavailable, re-runparenttasksin parallel 
Schedulerfailuresnotaddressed 
Replicate lineagegraph?
Interactivity 
Desirablegivenlow-latencyin-memorycapabilities 
Scala shellintegration 
Eachline iscompiledintoa Java classand runin JVM 
Bytecodeshippedto workersvia HTTP
Memory management 
PersistentRDDsstoragemodes: 
In-memory, deserializedobject: fastest(native supportby JVM) 
In-memory, serializedobject: more memory-efficient, butslower 
On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime 
LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM 
Unlessthe new partitionbelongsto the LRU RDD 
Separate memoryspaceon eachnode
Checkpointing 
Save intermediate RDDsto disk (replication) 
Speeduprecoveryof RDDswith long lineageor wide dependencies 
Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) 
Notstrictlyrequired, butniceto have 
Easy becauseRDDsare read-only 
No consistencyissuesor distributedcoordinationrequired 
Donein the background, programsdo nothaveto be suspended 
Controlledby the user, no automaticcheckpointingyet
EVALUATION
Testingenvironment 
Amazon ElasticCompute Cloud(EC2) 
m1.xlarge nodes 
4 cores/ node 
15 GB of RAM / node 
HDFS with 256 MB blocks
Iterative machine learning 
10 iterationson 100 GB of data 
Runon 25, 50, 100 nodes
Iterative machine learning(cont’d) 
Differentalgorithms 
K-meansismore compute-intensive 
Logisticregressionismore sensitive to IO and deserialization 
Minimum overheadin Spark 
25.3×/ 20.7×with logisticregression 
3.2×/ 1.9×with K-means 
OutperformsevenHadoopBinMem(in-memorybinarydata)
PageRank 
10 iterationson a 54 GB Wikipedia dump 
Approximately4 millionarticles 
Runon 30 and 60 nodes 
Linear speedupwith numberof nodes 
2.4×with in-memorystorageonly 
7.4×with partitioncontrollingtoo
Fault recovery 
10 iterationsof K-meanswith 100 GB of data on 75 nodes 
Failureat6thiteration
Fault recovery(cont’d) 
Lossof tasksand partitionson failednode 
Task rescheduledon differentnodes 
Missingpartitionsrecomputedin parallel 
Lineagegraphslessthan10 KB 
Checkpointingwouldrequire 
Runningseveraliterationsagain 
Replicate all100 GB over the network 
Consumetwicethe memoryor writeall100 GB to disk
Lowmemory 
Logisticregressionwith variousamountsof RAM 
Gracefuldegradationwith lessspace
Interactive data mining 
1 TB of Wikipedia page viewlogs(2 yearsof data) 
Runon 100 m2.4xlarge nodes 
8 coresand 68 GB of RAM per node 
True interactivity(lessthan7 s) 
Queryingfrom disk took170 s
CONCLUSIONS
Applications 
Nothingnew under the sun 
In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) 
RDDscan provideallthesefeaturesin a single framework 
RDDscan express existingcluster programmingmodels 
Sameoutput, betterperformance 
Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
Advantages 
Dramaticspeedupwith reuseddata (dependingon application) 
Fast fault recoverythanksto lightweightloggingof transformations 
Efficiencyunder control of user(storage, partitioning) 
Gracefulperformance degradationwith lowRAM 
High expressivity 
Versatility 
Interactivity 
Open source 
Limitations 
Notsuitedfor fine-grainedtransformations 
Overheadfrom loggingtoomanylineagegraphs 
Traditionaldata loggingand checkpointingperformbetter
Thanks!

More Related Content

What's hot

ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtapADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtapVikas Jagtap
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Advance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseAdvance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseSonali Parab
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Ogsa ogsi-a more detailed view
Ogsa ogsi-a more detailed viewOgsa ogsi-a more detailed view
Ogsa ogsi-a more detailed viewPooja Dixit
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSKishan Patel
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalizationAmeenababs
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 
Operating System-Memory Management
Operating System-Memory ManagementOperating System-Memory Management
Operating System-Memory ManagementAkmal Cikmat
 
Database backup and recovery basics
Database backup and recovery basicsDatabase backup and recovery basics
Database backup and recovery basicsShahed Mohamed
 

What's hot (20)

ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtapADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Advance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseAdvance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In Database
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Ogsa ogsi-a more detailed view
Ogsa ogsi-a more detailed viewOgsa ogsi-a more detailed view
Ogsa ogsi-a more detailed view
 
Design patterns
Design patternsDesign patterns
Design patterns
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESS
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
OLAP v/s OLTP
OLAP v/s OLTPOLAP v/s OLTP
OLAP v/s OLTP
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalization
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Music data mining
Music  data miningMusic  data mining
Music data mining
 
Operating System-Memory Management
Operating System-Memory ManagementOperating System-Memory Management
Operating System-Memory Management
 
Os presentation
Os presentationOs presentation
Os presentation
 
Database backup and recovery basics
Database backup and recovery basicsDatabase backup and recovery basics
Database backup and recovery basics
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
C# Private assembly
C# Private assemblyC# Private assembly
C# Private assembly
 

Viewers also liked

IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)Spark Summit
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with SparkGhulam Imaduddin
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 

Viewers also liked (6)

IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 

Similar to Resilient Distributed Datasets

Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internalsnarsiman
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperAlex Ehrnschwender
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Baruch Sadogursky
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoopAlex Jiang
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageNilesh Salpe
 

Similar to Resilient Distributed Datasets (20)

Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
RDD
RDDRDD
RDD
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
MYSQL
MYSQLMYSQL
MYSQL
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache Zookeeper
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoop
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Resilient Distributed Datasets

  • 1. RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MateiZahariaet al. Universityof California, Berkeley
  • 4. Motivations Interactive (real-time) data mining Reuseof intermediate results(iterative algorithms) Examples: Machine learning K-meansclustering PageRank
  • 5. Limitationsof currentframeworks Data reuseusuallythroughdisk storage Disk IO latencyand serialization Too high-levelabstractions Implicitmemorymanagement Implicitwork distribution Fault tolerancethroughdata replicationand logging High network traffic
  • 6. Goals Keepfrequentlyuseddata in mainmemory Efficientfault recovery Log data transformationsratherthandata itself User control
  • 8. Whatisan RDD? Read-only, partitionedcollectionof recordsin key-valueform Createdthroughtransformations From storeddata or otherRDDs Coarse-grained: sameoperationon the wholedataset Examples: map, filter, join Lineage: sequenceof transformationsthatcreatedthe RDD Keyto efficientfault recovery Usedthroughactions Return a resultor storedata Examples: count, collect, save
  • 9. Whatisan RDD? (cont’d) Lazycomputation RDDsare computedonlywhenthe first actionisinvoked Persistencecontrol ChooseRDDsto be reused, and howto storethem(e.g. in memory) Partitioningcontrol Definehowto distributeRDDsacrosscluster nodes Minimizeinter-nodecommunication
  • 10. Implementation Apache Sparkcluster computingframework Open source Basedon HadoopDistributed File System (HDFS) (by Apache) Scala programminglanguage Derivedfrom Java, compilesto Java bytecode Object-orientedand functionalprogramming Staticallytyped, efficientand concise
  • 11. Sparkprogramminginterface Driver program Definesand invokesactionson RDDs TracksRDDs’ lineage Assignsworkloadto workers Workers Persistentprocesseson cluster nodes Performactionson data Can storepartitionsof RDDsin RAM
  • 12. Example: PageRank Iterative algorithm Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
  • 13. Example: PageRank(cont’d) The graphgrowswith the numberof iterations Replicate some intermediate resultsto speedupfault recovery Reduce communicationoverhead Partitionbothlinksand ranksby URL in the sameway Joiningthemcan be doneon the samenode
  • 14. RDD representation Goals Easilytracklineage Supportrichset of transformations Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) Graph-basedstructure Set of partitions(piecesof the dataset) Set of dependencieson parentRDDs Functionfor computingthe datasetfrom parentRDDs Metadataaboutpartitioningand data location
  • 15. Dependencies Narrowdependencies Eachpartitionof the parentisusedby atmostonepartitionof the child Example: map, filter, union Wide dependencies Eachpartitionof the parentmaybe usedby manypartitionsof the child Example: join, groupByKey
  • 16. Dependencies(cont’d) Normalexecution Narrowpipelined(e.g. map+ filteroneelementata time) Wide serial (allparentsneedto be availablebeforecomputationstarts) Fault recovery Narrowfast (onlyoneparentpartitionhasto be recomputed) Wide full (onefailednodemayrequireallparentsto be recomputed)
  • 18. Scheduling Tracksin-memorypartitions On actionrequest: Examineslineageand buildsa DAG of executionstages Eachstage containsasmanytransformationswith narrowdependenciesaspossible Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed Tasksassignedaccordingto in-memorydata locality Otherwiseassignto RDD’spreferredlocation (user-specified)
  • 19. Scheduling(cont’d) On task failure, re-runiton anothernodeifallparentsare stillavailable Ifstagesbecomeunavailable, re-runparenttasksin parallel Schedulerfailuresnotaddressed Replicate lineagegraph?
  • 20. Interactivity Desirablegivenlow-latencyin-memorycapabilities Scala shellintegration Eachline iscompiledintoa Java classand runin JVM Bytecodeshippedto workersvia HTTP
  • 21. Memory management PersistentRDDsstoragemodes: In-memory, deserializedobject: fastest(native supportby JVM) In-memory, serializedobject: more memory-efficient, butslower On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM Unlessthe new partitionbelongsto the LRU RDD Separate memoryspaceon eachnode
  • 22. Checkpointing Save intermediate RDDsto disk (replication) Speeduprecoveryof RDDswith long lineageor wide dependencies Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) Notstrictlyrequired, butniceto have Easy becauseRDDsare read-only No consistencyissuesor distributedcoordinationrequired Donein the background, programsdo nothaveto be suspended Controlledby the user, no automaticcheckpointingyet
  • 24. Testingenvironment Amazon ElasticCompute Cloud(EC2) m1.xlarge nodes 4 cores/ node 15 GB of RAM / node HDFS with 256 MB blocks
  • 25. Iterative machine learning 10 iterationson 100 GB of data Runon 25, 50, 100 nodes
  • 26. Iterative machine learning(cont’d) Differentalgorithms K-meansismore compute-intensive Logisticregressionismore sensitive to IO and deserialization Minimum overheadin Spark 25.3×/ 20.7×with logisticregression 3.2×/ 1.9×with K-means OutperformsevenHadoopBinMem(in-memorybinarydata)
  • 27. PageRank 10 iterationson a 54 GB Wikipedia dump Approximately4 millionarticles Runon 30 and 60 nodes Linear speedupwith numberof nodes 2.4×with in-memorystorageonly 7.4×with partitioncontrollingtoo
  • 28. Fault recovery 10 iterationsof K-meanswith 100 GB of data on 75 nodes Failureat6thiteration
  • 29. Fault recovery(cont’d) Lossof tasksand partitionson failednode Task rescheduledon differentnodes Missingpartitionsrecomputedin parallel Lineagegraphslessthan10 KB Checkpointingwouldrequire Runningseveraliterationsagain Replicate all100 GB over the network Consumetwicethe memoryor writeall100 GB to disk
  • 30. Lowmemory Logisticregressionwith variousamountsof RAM Gracefuldegradationwith lessspace
  • 31. Interactive data mining 1 TB of Wikipedia page viewlogs(2 yearsof data) Runon 100 m2.4xlarge nodes 8 coresand 68 GB of RAM per node True interactivity(lessthan7 s) Queryingfrom disk took170 s
  • 33. Applications Nothingnew under the sun In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) RDDscan provideallthesefeaturesin a single framework RDDscan express existingcluster programmingmodels Sameoutput, betterperformance Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
  • 34. Advantages Dramaticspeedupwith reuseddata (dependingon application) Fast fault recoverythanksto lightweightloggingof transformations Efficiencyunder control of user(storage, partitioning) Gracefulperformance degradationwith lowRAM High expressivity Versatility Interactivity Open source 
  • 35. Limitations Notsuitedfor fine-grainedtransformations Overheadfrom loggingtoomanylineagegraphs Traditionaldata loggingand checkpointingperformbetter