Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
13. Example: PageRank(cont’d)
The graphgrowswith the numberof iterations
Replicate some intermediate resultsto speedupfault recovery
Reduce communicationoverhead
Partitionbothlinksand ranksby URL in the sameway
Joiningthemcan be doneon the samenode
14. RDD representation
Goals
Easilytracklineage
Supportrichset of transformations
Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic)
Graph-basedstructure
Set of partitions(piecesof the dataset)
Set of dependencieson parentRDDs
Functionfor computingthe datasetfrom parentRDDs
Metadataaboutpartitioningand data location
15. Dependencies
Narrowdependencies
Eachpartitionof the parentisusedby atmostonepartitionof the child
Example: map, filter, union
Wide dependencies
Eachpartitionof the parentmaybe usedby manypartitionsof the child
Example: join, groupByKey
16. Dependencies(cont’d)
Normalexecution
Narrowpipelined(e.g. map+ filteroneelementata time)
Wide serial (allparentsneedto be availablebeforecomputationstarts)
Fault recovery
Narrowfast (onlyoneparentpartitionhasto be recomputed)
Wide full (onefailednodemayrequireallparentsto be recomputed)
21. Memory management
PersistentRDDsstoragemodes:
In-memory, deserializedobject: fastest(native supportby JVM)
In-memory, serializedobject: more memory-efficient, butslower
On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime
LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM
Unlessthe new partitionbelongsto the LRU RDD
Separate memoryspaceon eachnode
22. Checkpointing
Save intermediate RDDsto disk (replication)
Speeduprecoveryof RDDswith long lineageor wide dependencies
Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD)
Notstrictlyrequired, butniceto have
Easy becauseRDDsare read-only
No consistencyissuesor distributedcoordinationrequired
Donein the background, programsdo nothaveto be suspended
Controlledby the user, no automaticcheckpointingyet
31. Interactive data mining
1 TB of Wikipedia page viewlogs(2 yearsof data)
Runon 100 m2.4xlarge nodes
8 coresand 68 GB of RAM per node
True interactivity(lessthan7 s)
Queryingfrom disk took170 s
33. Applications
Nothingnew under the sun
In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately)
RDDscan provideallthesefeaturesin a single framework
RDDscan express existingcluster programmingmodels
Sameoutput, betterperformance
Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)