RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING 
MateiZahariaet al. 
Universityof California, Berkeley
Alessandro MenabòPolitecnico di Torino, Italy
INTRODUCTION
Motivations 
Interactive (real-time) data mining 
Reuseof intermediate results(iterative algorithms) 
Examples: 
Machine learning 
K-meansclustering 
PageRank
Limitationsof currentframeworks 
Data reuseusuallythroughdisk storage 
Disk IO latencyand serialization 
Too high-levelabstractions 
Implicitmemorymanagement 
Implicitwork distribution 
Fault tolerancethroughdata replicationand logging 
High network traffic
Goals 
Keepfrequentlyuseddata in mainmemory 
Efficientfault recovery 
Log data transformationsratherthandata itself 
User control
RESILIENT DISTRIBUTED DATASETS (RDDs)
Whatisan RDD? 
Read-only, partitionedcollectionof recordsin key-valueform 
Createdthroughtransformations 
From storeddata or otherRDDs 
Coarse-grained: sameoperationon the wholedataset 
Examples: map, filter, join 
Lineage: sequenceof transformationsthatcreatedthe RDD 
Keyto efficientfault recovery 
Usedthroughactions 
Return a resultor storedata 
Examples: count, collect, save
Whatisan RDD? (cont’d) 
Lazycomputation 
RDDsare computedonlywhenthe first actionisinvoked 
Persistencecontrol 
ChooseRDDsto be reused, and howto storethem(e.g. in memory) 
Partitioningcontrol 
Definehowto distributeRDDsacrosscluster nodes 
Minimizeinter-nodecommunication
Implementation 
Apache Sparkcluster computingframework 
Open source 
Basedon HadoopDistributed File System (HDFS) (by Apache) 
Scala programminglanguage 
Derivedfrom Java, compilesto Java bytecode 
Object-orientedand functionalprogramming 
Staticallytyped, efficientand concise
Sparkprogramminginterface 
Driver program 
Definesand invokesactionson RDDs 
TracksRDDs’ lineage 
Assignsworkloadto workers 
Workers 
Persistentprocesseson cluster nodes 
Performactionson data 
Can storepartitionsof RDDsin RAM
Example: PageRank 
Iterative algorithm 
Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
Example: PageRank(cont’d) 
The graphgrowswith the numberof iterations 
Replicate some intermediate resultsto speedupfault recovery 
Reduce communicationoverhead 
Partitionbothlinksand ranksby URL in the sameway 
Joiningthemcan be doneon the samenode
RDD representation 
Goals 
Easilytracklineage 
Supportrichset of transformations 
Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) 
Graph-basedstructure 
Set of partitions(piecesof the dataset) 
Set of dependencieson parentRDDs 
Functionfor computingthe datasetfrom parentRDDs 
Metadataaboutpartitioningand data location
Dependencies 
Narrowdependencies 
Eachpartitionof the parentisusedby atmostonepartitionof the child 
Example: map, filter, union 
Wide dependencies 
Eachpartitionof the parentmaybe usedby manypartitionsof the child 
Example: join, groupByKey
Dependencies(cont’d) 
Normalexecution 
Narrowpipelined(e.g. map+ filteroneelementata time) 
Wide serial (allparentsneedto be availablebeforecomputationstarts) 
Fault recovery 
Narrowfast (onlyoneparentpartitionhasto be recomputed) 
Wide full (onefailednodemayrequireallparentsto be recomputed)
OVERVIEW OF SPARK
Scheduling 
Tracksin-memorypartitions 
On actionrequest: 
Examineslineageand buildsa DAG of executionstages 
Eachstage containsasmanytransformationswith narrowdependenciesaspossible 
Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions 
Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed 
Tasksassignedaccordingto in-memorydata locality 
Otherwiseassignto RDD’spreferredlocation (user-specified)
Scheduling(cont’d) 
On task failure, re-runiton anothernodeifallparentsare stillavailable 
Ifstagesbecomeunavailable, re-runparenttasksin parallel 
Schedulerfailuresnotaddressed 
Replicate lineagegraph?
Interactivity 
Desirablegivenlow-latencyin-memorycapabilities 
Scala shellintegration 
Eachline iscompiledintoa Java classand runin JVM 
Bytecodeshippedto workersvia HTTP
Memory management 
PersistentRDDsstoragemodes: 
In-memory, deserializedobject: fastest(native supportby JVM) 
In-memory, serializedobject: more memory-efficient, butslower 
On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime 
LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM 
Unlessthe new partitionbelongsto the LRU RDD 
Separate memoryspaceon eachnode
Checkpointing 
Save intermediate RDDsto disk (replication) 
Speeduprecoveryof RDDswith long lineageor wide dependencies 
Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) 
Notstrictlyrequired, butniceto have 
Easy becauseRDDsare read-only 
No consistencyissuesor distributedcoordinationrequired 
Donein the background, programsdo nothaveto be suspended 
Controlledby the user, no automaticcheckpointingyet
EVALUATION
Testingenvironment 
Amazon ElasticCompute Cloud(EC2) 
m1.xlarge nodes 
4 cores/ node 
15 GB of RAM / node 
HDFS with 256 MB blocks
Iterative machine learning 
10 iterationson 100 GB of data 
Runon 25, 50, 100 nodes
Iterative machine learning(cont’d) 
Differentalgorithms 
K-meansismore compute-intensive 
Logisticregressionismore sensitive to IO and deserialization 
Minimum overheadin Spark 
25.3×/ 20.7×with logisticregression 
3.2×/ 1.9×with K-means 
OutperformsevenHadoopBinMem(in-memorybinarydata)
PageRank 
10 iterationson a 54 GB Wikipedia dump 
Approximately4 millionarticles 
Runon 30 and 60 nodes 
Linear speedupwith numberof nodes 
2.4×with in-memorystorageonly 
7.4×with partitioncontrollingtoo
Fault recovery 
10 iterationsof K-meanswith 100 GB of data on 75 nodes 
Failureat6thiteration
Fault recovery(cont’d) 
Lossof tasksand partitionson failednode 
Task rescheduledon differentnodes 
Missingpartitionsrecomputedin parallel 
Lineagegraphslessthan10 KB 
Checkpointingwouldrequire 
Runningseveraliterationsagain 
Replicate all100 GB over the network 
Consumetwicethe memoryor writeall100 GB to disk
Lowmemory 
Logisticregressionwith variousamountsof RAM 
Gracefuldegradationwith lessspace
Interactive data mining 
1 TB of Wikipedia page viewlogs(2 yearsof data) 
Runon 100 m2.4xlarge nodes 
8 coresand 68 GB of RAM per node 
True interactivity(lessthan7 s) 
Queryingfrom disk took170 s
CONCLUSIONS
Applications 
Nothingnew under the sun 
In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) 
RDDscan provideallthesefeaturesin a single framework 
RDDscan express existingcluster programmingmodels 
Sameoutput, betterperformance 
Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
Advantages 
Dramaticspeedupwith reuseddata (dependingon application) 
Fast fault recoverythanksto lightweightloggingof transformations 
Efficiencyunder control of user(storage, partitioning) 
Gracefulperformance degradationwith lowRAM 
High expressivity 
Versatility 
Interactivity 
Open source 
Limitations 
Notsuitedfor fine-grainedtransformations 
Overheadfrom loggingtoomanylineagegraphs 
Traditionaldata loggingand checkpointingperformbetter
Thanks!

Resilient Distributed Datasets

  • 1.
    RESILIENT DISTRIBUTED DATASETS:A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MateiZahariaet al. Universityof California, Berkeley
  • 2.
  • 3.
  • 4.
    Motivations Interactive (real-time)data mining Reuseof intermediate results(iterative algorithms) Examples: Machine learning K-meansclustering PageRank
  • 5.
    Limitationsof currentframeworks Datareuseusuallythroughdisk storage Disk IO latencyand serialization Too high-levelabstractions Implicitmemorymanagement Implicitwork distribution Fault tolerancethroughdata replicationand logging High network traffic
  • 6.
    Goals Keepfrequentlyuseddata inmainmemory Efficientfault recovery Log data transformationsratherthandata itself User control
  • 7.
  • 8.
    Whatisan RDD? Read-only,partitionedcollectionof recordsin key-valueform Createdthroughtransformations From storeddata or otherRDDs Coarse-grained: sameoperationon the wholedataset Examples: map, filter, join Lineage: sequenceof transformationsthatcreatedthe RDD Keyto efficientfault recovery Usedthroughactions Return a resultor storedata Examples: count, collect, save
  • 9.
    Whatisan RDD? (cont’d) Lazycomputation RDDsare computedonlywhenthe first actionisinvoked Persistencecontrol ChooseRDDsto be reused, and howto storethem(e.g. in memory) Partitioningcontrol Definehowto distributeRDDsacrosscluster nodes Minimizeinter-nodecommunication
  • 10.
    Implementation Apache Sparkclustercomputingframework Open source Basedon HadoopDistributed File System (HDFS) (by Apache) Scala programminglanguage Derivedfrom Java, compilesto Java bytecode Object-orientedand functionalprogramming Staticallytyped, efficientand concise
  • 11.
    Sparkprogramminginterface Driver program Definesand invokesactionson RDDs TracksRDDs’ lineage Assignsworkloadto workers Workers Persistentprocesseson cluster nodes Performactionson data Can storepartitionsof RDDsin RAM
  • 12.
    Example: PageRank Iterativealgorithm Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
  • 13.
    Example: PageRank(cont’d) Thegraphgrowswith the numberof iterations Replicate some intermediate resultsto speedupfault recovery Reduce communicationoverhead Partitionbothlinksand ranksby URL in the sameway Joiningthemcan be doneon the samenode
  • 14.
    RDD representation Goals Easilytracklineage Supportrichset of transformations Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) Graph-basedstructure Set of partitions(piecesof the dataset) Set of dependencieson parentRDDs Functionfor computingthe datasetfrom parentRDDs Metadataaboutpartitioningand data location
  • 15.
    Dependencies Narrowdependencies Eachpartitionofthe parentisusedby atmostonepartitionof the child Example: map, filter, union Wide dependencies Eachpartitionof the parentmaybe usedby manypartitionsof the child Example: join, groupByKey
  • 16.
    Dependencies(cont’d) Normalexecution Narrowpipelined(e.g.map+ filteroneelementata time) Wide serial (allparentsneedto be availablebeforecomputationstarts) Fault recovery Narrowfast (onlyoneparentpartitionhasto be recomputed) Wide full (onefailednodemayrequireallparentsto be recomputed)
  • 17.
  • 18.
    Scheduling Tracksin-memorypartitions Onactionrequest: Examineslineageand buildsa DAG of executionstages Eachstage containsasmanytransformationswith narrowdependenciesaspossible Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed Tasksassignedaccordingto in-memorydata locality Otherwiseassignto RDD’spreferredlocation (user-specified)
  • 19.
    Scheduling(cont’d) On taskfailure, re-runiton anothernodeifallparentsare stillavailable Ifstagesbecomeunavailable, re-runparenttasksin parallel Schedulerfailuresnotaddressed Replicate lineagegraph?
  • 20.
    Interactivity Desirablegivenlow-latencyin-memorycapabilities Scalashellintegration Eachline iscompiledintoa Java classand runin JVM Bytecodeshippedto workersvia HTTP
  • 21.
    Memory management PersistentRDDsstoragemodes: In-memory, deserializedobject: fastest(native supportby JVM) In-memory, serializedobject: more memory-efficient, butslower On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM Unlessthe new partitionbelongsto the LRU RDD Separate memoryspaceon eachnode
  • 22.
    Checkpointing Save intermediateRDDsto disk (replication) Speeduprecoveryof RDDswith long lineageor wide dependencies Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) Notstrictlyrequired, butniceto have Easy becauseRDDsare read-only No consistencyissuesor distributedcoordinationrequired Donein the background, programsdo nothaveto be suspended Controlledby the user, no automaticcheckpointingyet
  • 23.
  • 24.
    Testingenvironment Amazon ElasticComputeCloud(EC2) m1.xlarge nodes 4 cores/ node 15 GB of RAM / node HDFS with 256 MB blocks
  • 25.
    Iterative machine learning 10 iterationson 100 GB of data Runon 25, 50, 100 nodes
  • 26.
    Iterative machine learning(cont’d) Differentalgorithms K-meansismore compute-intensive Logisticregressionismore sensitive to IO and deserialization Minimum overheadin Spark 25.3×/ 20.7×with logisticregression 3.2×/ 1.9×with K-means OutperformsevenHadoopBinMem(in-memorybinarydata)
  • 27.
    PageRank 10 iterationsona 54 GB Wikipedia dump Approximately4 millionarticles Runon 30 and 60 nodes Linear speedupwith numberof nodes 2.4×with in-memorystorageonly 7.4×with partitioncontrollingtoo
  • 28.
    Fault recovery 10iterationsof K-meanswith 100 GB of data on 75 nodes Failureat6thiteration
  • 29.
    Fault recovery(cont’d) Lossoftasksand partitionson failednode Task rescheduledon differentnodes Missingpartitionsrecomputedin parallel Lineagegraphslessthan10 KB Checkpointingwouldrequire Runningseveraliterationsagain Replicate all100 GB over the network Consumetwicethe memoryor writeall100 GB to disk
  • 30.
    Lowmemory Logisticregressionwith variousamountsofRAM Gracefuldegradationwith lessspace
  • 31.
    Interactive data mining 1 TB of Wikipedia page viewlogs(2 yearsof data) Runon 100 m2.4xlarge nodes 8 coresand 68 GB of RAM per node True interactivity(lessthan7 s) Queryingfrom disk took170 s
  • 32.
  • 33.
    Applications Nothingnew underthe sun In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) RDDscan provideallthesefeaturesin a single framework RDDscan express existingcluster programmingmodels Sameoutput, betterperformance Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
  • 34.
    Advantages Dramaticspeedupwith reuseddata(dependingon application) Fast fault recoverythanksto lightweightloggingof transformations Efficiencyunder control of user(storage, partitioning) Gracefulperformance degradationwith lowRAM High expressivity Versatility Interactivity Open source 
  • 35.
    Limitations Notsuitedfor fine-grainedtransformations Overheadfrom loggingtoomanylineagegraphs Traditionaldata loggingand checkpointingperformbetter
  • 36.