Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Small presentation about spark

  • Be the first to comment


  1. 1. Resilient DistributedDatasets : A Fault-TolerantAbstraction for In-Memory Cluster Computing Presentation by Mário Almeida
  2. 2. OutlineMotivationRDDs overviewSparkData SharingExample : Log MiningFault ToleranceExample : Logistic RegressionRDD RepresentationEvaluationConclusion
  3. 3. 1MotivationHow to perform large-scale data analytics?● MapReduce● DryadProblem? Overhead!!● reuse intermediate? DFS? no abstraction for● Pregel? general reuse!!● How to provide Fault-tolerance efficiently? Shared memory? key-value stores? Picollo? Fine-grained!!
  4. 4. 2RDDs OverviewRead-only, partitioned collection of recordsCreated through transformations on data instable storage or other RDDsHas information on the lineage oftransformationsControl over partitioning and persistence(e.g. non serialized in-memory storage)
  5. 5. 3SparkExposes RDDs through a languageintegrated API.RDDs can be used in actions. ● which return a value or export it to a storage system (e.g. count, collect and save)Persist method indicates which RDDs to reuse(default: stored in memory)
  6. 6. 4Data Sharing in MReduceOverhead: Replication, serialization, disk IO!
  7. 7. 5Data Sharing in Spark 10-100x faster than network and disk
  8. 8. 6Example - Log MiningLoad error messages into memory and searchfor patterns.1Tb in 5-7 sec(170 sec for on-disk data)
  9. 9. 7Fault ToleranceRDDs keep information of the transformationsused to build them. This lineage can be used torecover lost data.
  10. 10. Example - Logistic 8Regression One time loaded into memory! Repeated MapReduce steps to calculate the gradientMany machine learning algorithms are iterative in naturebecause they run iterative optimization procedures!
  11. 11. Logistic Regression 9Performance30Gb set20 * 4 cores w/ 15GBHadoop - 127 s/iterationSpark . 1st iteration 174s, afterwards 6s
  12. 12. 10Representing RDDs Wide dependencies are harder to recover! Wide dependencies require data from all parents Narrow dependencies allow pipelined execution Partition
  13. 13. 11 Evaluation - Iteration times Computation intensive Extra MR job to convert to binaryHeartbeatProtocol
  14. 14. Evaluation - 12number of machines 1.9x & 25.3x & 3.2x 20.7x
  15. 15. 13Evaluation - PartitioningPage rank algorithm on a 54GB dataset thatbuilds a link graph of 4 million articles.
  16. 16. 14Evaluation - Failures100 GB Working set
  17. 17. 15ConclusionSpark is up to 20x faster than Hadoop foriterative applications. (IO and serialization)Can interactively scan 1 TB (5-7s latency).Quick recovery (builds lost RDD partitions).Pregel/HaLoop can be built on top of Spark.Good for batch applications that apply thesame operation to all elements of a dataset.
  18. 18. References● Resilient Distributed Datasets : A Fault- Tolerant Abstraction for In-Memory Cluster Computing● slideshare :/Hadoop_Summit/spark-and- shark