Resilient Distributed Datasets

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
MateiZahariaet al.
Universityof California, Berkeley

Alessandro MenabòPolitecnico di Torino, Italy

Motivations
Interactive (real-time) data mining
Reuseof intermediate results(iterative algorithms)
Examples:
Machine learning
K-meansclustering
PageRank

Limitationsof currentframeworks
Data reuseusuallythroughdisk storage
Disk IO latencyand serialization
Too high-levelabstractions
Implicitmemorymanagement
Implicitwork distribution
Fault tolerancethroughdata replicationand logging
High network traffic

Goals
Keepfrequentlyuseddata in mainmemory
Efficientfault recovery
Log data transformationsratherthandata itself
User control

RESILIENT DISTRIBUTED DATASETS (RDDs)

Whatisan RDD?
Read-only, partitionedcollectionof recordsin key-valueform
Createdthroughtransformations
From storeddata or otherRDDs
Coarse-grained: sameoperationon the wholedataset
Examples: map, filter, join
Lineage: sequenceof transformationsthatcreatedthe RDD
Keyto efficientfault recovery
Usedthroughactions
Return a resultor storedata
Examples: count, collect, save

Whatisan RDD? (cont’d)
Lazycomputation
RDDsare computedonlywhenthe first actionisinvoked
Persistencecontrol
ChooseRDDsto be reused, and howto storethem(e.g. in memory)
Partitioningcontrol
Definehowto distributeRDDsacrosscluster nodes
Minimizeinter-nodecommunication

Implementation
Apache Sparkcluster computingframework
Open source
Basedon HadoopDistributed File System (HDFS) (by Apache)
Scala programminglanguage
Derivedfrom Java, compilesto Java bytecode
Object-orientedand functionalprogramming
Staticallytyped, efficientand concise

Sparkprogramminginterface
Driver program
Definesand invokesactionson RDDs
TracksRDDs’ lineage
Assignsworkloadto workers
Workers
Persistentprocesseson cluster nodes
Performactionson data
Can storepartitionsof RDDsin RAM

Example: PageRank
Iterative algorithm
Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it

Example: PageRank(cont’d)
The graphgrowswith the numberof iterations
Replicate some intermediate resultsto speedupfault recovery
Reduce communicationoverhead
Partitionbothlinksand ranksby URL in the sameway
Joiningthemcan be doneon the samenode

RDD representation
Goals
Easilytracklineage
Supportrichset of transformations
Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic)
Graph-basedstructure
Set of partitions(piecesof the dataset)
Set of dependencieson parentRDDs
Functionfor computingthe datasetfrom parentRDDs
Metadataaboutpartitioningand data location

Dependencies
Narrowdependencies
Eachpartitionof the parentisusedby atmostonepartitionof the child
Example: map, filter, union
Wide dependencies
Eachpartitionof the parentmaybe usedby manypartitionsof the child
Example: join, groupByKey

Dependencies(cont’d)
Normalexecution
Narrowpipelined(e.g. map+ filteroneelementata time)
Wide serial (allparentsneedto be availablebeforecomputationstarts)
Fault recovery
Narrowfast (onlyoneparentpartitionhasto be recomputed)
Wide full (onefailednodemayrequireallparentsto be recomputed)

Scheduling
Tracksin-memorypartitions
On actionrequest:
Examineslineageand buildsa DAG of executionstages
Eachstage containsasmanytransformationswith narrowdependenciesaspossible
Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions
Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed
Tasksassignedaccordingto in-memorydata locality
Otherwiseassignto RDD’spreferredlocation (user-specified)

Scheduling(cont’d)
On task failure, re-runiton anothernodeifallparentsare stillavailable
Ifstagesbecomeunavailable, re-runparenttasksin parallel
Schedulerfailuresnotaddressed
Replicate lineagegraph?

Interactivity
Desirablegivenlow-latencyin-memorycapabilities
Scala shellintegration
Eachline iscompiledintoa Java classand runin JVM
Bytecodeshippedto workersvia HTTP

Memory management
PersistentRDDsstoragemodes:
In-memory, deserializedobject: fastest(native supportby JVM)
In-memory, serializedobject: more memory-efficient, butslower
On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime
LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM
Unlessthe new partitionbelongsto the LRU RDD
Separate memoryspaceon eachnode

Checkpointing
Save intermediate RDDsto disk (replication)
Speeduprecoveryof RDDswith long lineageor wide dependencies
Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD)
Notstrictlyrequired, butniceto have
Easy becauseRDDsare read-only
No consistencyissuesor distributedcoordinationrequired
Donein the background, programsdo nothaveto be suspended
Controlledby the user, no automaticcheckpointingyet

Testingenvironment
Amazon ElasticCompute Cloud(EC2)
m1.xlarge nodes
4 cores/ node
15 GB of RAM / node
HDFS with 256 MB blocks

Iterative machine learning
10 iterationson 100 GB of data
Runon 25, 50, 100 nodes

Iterative machine learning(cont’d)
Differentalgorithms
K-meansismore compute-intensive
Logisticregressionismore sensitive to IO and deserialization
Minimum overheadin Spark
25.3×/ 20.7×with logisticregression
3.2×/ 1.9×with K-means
OutperformsevenHadoopBinMem(in-memorybinarydata)

PageRank
10 iterationson a 54 GB Wikipedia dump
Approximately4 millionarticles
Runon 30 and 60 nodes
Linear speedupwith numberof nodes
2.4×with in-memorystorageonly
7.4×with partitioncontrollingtoo

Fault recovery
10 iterationsof K-meanswith 100 GB of data on 75 nodes
Failureat6thiteration

Fault recovery(cont’d)
Lossof tasksand partitionson failednode
Task rescheduledon differentnodes
Missingpartitionsrecomputedin parallel
Lineagegraphslessthan10 KB
Checkpointingwouldrequire
Runningseveraliterationsagain
Replicate all100 GB over the network
Consumetwicethe memoryor writeall100 GB to disk

Lowmemory
Logisticregressionwith variousamountsof RAM
Gracefuldegradationwith lessspace

Interactive data mining
1 TB of Wikipedia page viewlogs(2 yearsof data)
Runon 100 m2.4xlarge nodes
8 coresand 68 GB of RAM per node
True interactivity(lessthan7 s)
Queryingfrom disk took170 s

Applications
Nothingnew under the sun
In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately)
RDDscan provideallthesefeaturesin a single framework
RDDscan express existingcluster programmingmodels
Sameoutput, betterperformance
Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)

Advantages
Dramaticspeedupwith reuseddata (dependingon application)
Fast fault recoverythanksto lightweightloggingof transformations
Efficiencyunder control of user(storage, partitioning)
Gracefulperformance degradationwith lowRAM
High expressivity
Versatility
Interactivity
Open source 

Limitations
Notsuitedfor fine-grainedtransformations
Overheadfrom loggingtoomanylineagegraphs
Traditionaldata loggingand checkpointingperformbetter

Resilient Distributed Datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Resilient Distributed Datasets

Similar to Resilient Distributed Datasets (20)

Recently uploaded

Recently uploaded (20)

Resilient Distributed Datasets