1MotivationHow to perform large-scale data analytics?● MapReduce● DryadProblem? Overhead!!● reuse intermediate? DFS? no abstraction for● Pregel? general reuse!!● How to provide Fault-tolerance efficiently? Shared memory? key-value stores? Picollo? Fine-grained!!
2RDDs OverviewRead-only, partitioned collection of recordsCreated through transformations on data instable storage or other RDDsHas information on the lineage oftransformationsControl over partitioning and persistence(e.g. non serialized in-memory storage)
3SparkExposes RDDs through a languageintegrated API.RDDs can be used in actions. ● which return a value or export it to a storage system (e.g. count, collect and save)Persist method indicates which RDDs to reuse(default: stored in memory)
4Data Sharing in MReduceOverhead: Replication, serialization, disk IO!
5Data Sharing in Spark 10-100x faster than network and disk
6Example - Log MiningLoad error messages into memory and searchfor patterns.1Tb in 5-7 sec(170 sec for on-disk data)
7Fault ToleranceRDDs keep information of the transformationsused to build them. This lineage can be used torecover lost data.
Example - Logistic 8Regression One time loaded into memory! Repeated MapReduce steps to calculate the gradientMany machine learning algorithms are iterative in naturebecause they run iterative optimization procedures!
15ConclusionSpark is up to 20x faster than Hadoop foriterative applications. (IO and serialization)Can interactively scan 1 TB (5-7s latency).Quick recovery (builds lost RDD partitions).Pregel/HaLoop can be built on top of Spark.Good for batch applications that apply thesame operation to all elements of a dataset.
References● Resilient Distributed Datasets : A Fault- Tolerant Abstraction for In-Memory Cluster Computing● slideshare :/Hadoop_Summit/spark-and- shark