Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resilient Distributed Datasets

Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.

The paper is not my work.

These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).

Resilient Distributed Datasets

  1. 1. RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MateiZahariaet al. Universityof California, Berkeley
  2. 2. Alessandro MenabòPolitecnico di Torino, Italy
  3. 3. INTRODUCTION
  4. 4. Motivations Interactive (real-time) data mining Reuseof intermediate results(iterative algorithms) Examples: Machine learning K-meansclustering PageRank
  5. 5. Limitationsof currentframeworks Data reuseusuallythroughdisk storage Disk IO latencyand serialization Too high-levelabstractions Implicitmemorymanagement Implicitwork distribution Fault tolerancethroughdata replicationand logging High network traffic
  6. 6. Goals Keepfrequentlyuseddata in mainmemory Efficientfault recovery Log data transformationsratherthandata itself User control
  7. 7. RESILIENT DISTRIBUTED DATASETS (RDDs)
  8. 8. Whatisan RDD? Read-only, partitionedcollectionof recordsin key-valueform Createdthroughtransformations From storeddata or otherRDDs Coarse-grained: sameoperationon the wholedataset Examples: map, filter, join Lineage: sequenceof transformationsthatcreatedthe RDD Keyto efficientfault recovery Usedthroughactions Return a resultor storedata Examples: count, collect, save
  9. 9. Whatisan RDD? (cont’d) Lazycomputation RDDsare computedonlywhenthe first actionisinvoked Persistencecontrol ChooseRDDsto be reused, and howto storethem(e.g. in memory) Partitioningcontrol Definehowto distributeRDDsacrosscluster nodes Minimizeinter-nodecommunication
  10. 10. Implementation Apache Sparkcluster computingframework Open source Basedon HadoopDistributed File System (HDFS) (by Apache) Scala programminglanguage Derivedfrom Java, compilesto Java bytecode Object-orientedand functionalprogramming Staticallytyped, efficientand concise
  11. 11. Sparkprogramminginterface Driver program Definesand invokesactionson RDDs TracksRDDs’ lineage Assignsworkloadto workers Workers Persistentprocesseson cluster nodes Performactionson data Can storepartitionsof RDDsin RAM
  12. 12. Example: PageRank Iterative algorithm Updatesdocumentrankbasedon contributionsfrom documentsthatlink to it
  13. 13. Example: PageRank(cont’d) The graphgrowswith the numberof iterations Replicate some intermediate resultsto speedupfault recovery Reduce communicationoverhead Partitionbothlinksand ranksby URL in the sameway Joiningthemcan be doneon the samenode
  14. 14. RDD representation Goals Easilytracklineage Supportrichset of transformations Keepsystemassimpleaspossible(uniforminterface, avoidad-hoc logic) Graph-basedstructure Set of partitions(piecesof the dataset) Set of dependencieson parentRDDs Functionfor computingthe datasetfrom parentRDDs Metadataaboutpartitioningand data location
  15. 15. Dependencies Narrowdependencies Eachpartitionof the parentisusedby atmostonepartitionof the child Example: map, filter, union Wide dependencies Eachpartitionof the parentmaybe usedby manypartitionsof the child Example: join, groupByKey
  16. 16. Dependencies(cont’d) Normalexecution Narrowpipelined(e.g. map+ filteroneelementata time) Wide serial (allparentsneedto be availablebeforecomputationstarts) Fault recovery Narrowfast (onlyoneparentpartitionhasto be recomputed) Wide full (onefailednodemayrequireallparentsto be recomputed)
  17. 17. OVERVIEW OF SPARK
  18. 18. Scheduling Tracksin-memorypartitions On actionrequest: Examineslineageand buildsa DAG of executionstages Eachstage containsasmanytransformationswith narrowdependenciesaspossible Stage boundariescorrespondto wide dependencies, or alreadycomputedpartitions Launchestasksto compute missingpartitionsuntildesiredRDD iscomputed Tasksassignedaccordingto in-memorydata locality Otherwiseassignto RDD’spreferredlocation (user-specified)
  19. 19. Scheduling(cont’d) On task failure, re-runiton anothernodeifallparentsare stillavailable Ifstagesbecomeunavailable, re-runparenttasksin parallel Schedulerfailuresnotaddressed Replicate lineagegraph?
  20. 20. Interactivity Desirablegivenlow-latencyin-memorycapabilities Scala shellintegration Eachline iscompiledintoa Java classand runin JVM Bytecodeshippedto workersvia HTTP
  21. 21. Memory management PersistentRDDsstoragemodes: In-memory, deserializedobject: fastest(native supportby JVM) In-memory, serializedobject: more memory-efficient, butslower On-disk: ifRDD doesnotfitintoRAM, buttoocostlyto recomputeeverytime LRU evictionpolicy of entireRDD whennew partitiondoesnotfitintoRAM Unlessthe new partitionbelongsto the LRU RDD Separate memoryspaceon eachnode
  22. 22. Checkpointing Save intermediate RDDsto disk (replication) Speeduprecoveryof RDDswith long lineageor wide dependencies Pointlesswith short lineageor narrowdependencies(recomputingpartitionsin parallelislesscostlythanreplicatingthe wholeRDD) Notstrictlyrequired, butniceto have Easy becauseRDDsare read-only No consistencyissuesor distributedcoordinationrequired Donein the background, programsdo nothaveto be suspended Controlledby the user, no automaticcheckpointingyet
  23. 23. EVALUATION
  24. 24. Testingenvironment Amazon ElasticCompute Cloud(EC2) m1.xlarge nodes 4 cores/ node 15 GB of RAM / node HDFS with 256 MB blocks
  25. 25. Iterative machine learning 10 iterationson 100 GB of data Runon 25, 50, 100 nodes
  26. 26. Iterative machine learning(cont’d) Differentalgorithms K-meansismore compute-intensive Logisticregressionismore sensitive to IO and deserialization Minimum overheadin Spark 25.3×/ 20.7×with logisticregression 3.2×/ 1.9×with K-means OutperformsevenHadoopBinMem(in-memorybinarydata)
  27. 27. PageRank 10 iterationson a 54 GB Wikipedia dump Approximately4 millionarticles Runon 30 and 60 nodes Linear speedupwith numberof nodes 2.4×with in-memorystorageonly 7.4×with partitioncontrollingtoo
  28. 28. Fault recovery 10 iterationsof K-meanswith 100 GB of data on 75 nodes Failureat6thiteration
  29. 29. Fault recovery(cont’d) Lossof tasksand partitionson failednode Task rescheduledon differentnodes Missingpartitionsrecomputedin parallel Lineagegraphslessthan10 KB Checkpointingwouldrequire Runningseveraliterationsagain Replicate all100 GB over the network Consumetwicethe memoryor writeall100 GB to disk
  30. 30. Lowmemory Logisticregressionwith variousamountsof RAM Gracefuldegradationwith lessspace
  31. 31. Interactive data mining 1 TB of Wikipedia page viewlogs(2 yearsof data) Runon 100 m2.4xlarge nodes 8 coresand 68 GB of RAM per node True interactivity(lessthan7 s) Queryingfrom disk took170 s
  32. 32. CONCLUSIONS
  33. 33. Applications Nothingnew under the sun In-memorycomputing, lineagetracking, partitioningand fast recoveryare alreadyavailablein otherframeworks(separately) RDDscan provideallthesefeaturesin a single framework RDDscan express existingcluster programmingmodels Sameoutput, betterperformance Examples: MapReduce, SQL, Google’sPregel, batchedstreamprocessing (periodicallyupdatingresultswith new data)
  34. 34. Advantages Dramaticspeedupwith reuseddata (dependingon application) Fast fault recoverythanksto lightweightloggingof transformations Efficiencyunder control of user(storage, partitioning) Gracefulperformance degradationwith lowRAM High expressivity Versatility Interactivity Open source 
  35. 35. Limitations Notsuitedfor fine-grainedtransformations Overheadfrom loggingtoomanylineagegraphs Traditionaldata loggingand checkpointingperformbetter
  36. 36. Thanks!

    Be the first to comment

    Login to see the comments

  • ecausarano

    Dec. 14, 2015
  • MehdiGhimouz

    Mar. 16, 2016
  • nitinbhide

    Jul. 18, 2016
  • SaranyaSKumar1

    Jul. 29, 2017

Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley. The paper is not my work. These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).

Views

Total views

1,923

On Slideshare

0

From embeds

0

Number of embeds

9

Actions

Downloads

161

Shares

0

Comments

0

Likes

4

×