Your SlideShare is downloading. ×
0
Spark in ActionFast Big Data Analytics using ScalaMatei ZahariaUniversity of California, Berkeleywww.spark-project.org    ...
My BackgroundGrad student in the AMP Lab at UC Berkeley » 50-person lab focusing on big dataCommitter on Apache HadoopStar...
Spark GoalsExtend the MapReduce model to support moretypes of applications efficiently » Spark can run 40x faster than Had...
Why go Beyond MapReduce?MapReduce simplified big data analysis by givinga reliable programming model for large clustersBut...
Why go Beyond MapReduce?Complex jobs and interactive queries both needone thing that MapReduce lacks:         Efficient pr...
Why go Beyond MapReduce?Complex jobs and interactive queries both needone thing that MapReduce lacks:         Efficient pr...
Examples          HDFS             HDFS             HDFS             HDFS          read             write            read ...
Goal: In-Memory Data Sharing                iter. 1         iter. 2     . . .  Input                                query ...
Solution: ResilientDistributed Datasets (RDDs)Distributed collections of objects that can bestored in memory for fast reus...
OutlineSpark programming modelUser applicationsImplementationDemoWhat’s next
Programming ModelResilient distributed datasets (RDDs) » Immutable, partitioned collections of objects » Can be cached in ...
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns            ...
RDD Fault ToleranceRDDs track the series of transformations used tobuild them (their lineage) to recompute lost dataE.g: m...
Example: Logistic RegressionGoal: find best line separating two sets of points                              random initial...
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()var w = Vector.random(D)for (i <- 1 to IT...
Logistic Regression Performance                   4000                   3500                                             ...
Spark Users
User ApplicationsIn-memory analytics on Hive data (Conviva)Interactive queries on data streams (Quantifind)Exploratory log...
Conviva GeoReport Hive                                 20Spark       0.5                                          Time (ho...
Quantifind Stream Analysis               Parsed    Extracted    RelevantData Feeds                                        ...
ImplementationRuns on Apache Mesos cluster                                Spark Hadoop     MPImanager to coexist w/       ...
Task SchedulerRuns general DAGs          A:                  B:Pipelines functions                                        ...
Language IntegrationScala closures are Serializable Java objects » Serialize on master, load & run on workersNot quite eno...
Demo
What’s Next?
Hive on Spark (Shark)Compatible port of the SQL-on-Hadoop enginethat can run 40x faster on existing Hive dataScala UDFs fo...
Streaming SparkExtend Spark to perform streaming computationsRun as a series of small (~1 s) batch jobs, keepingstate in m...
ConclusionSpark offers a simple, efficient and powerfulprogramming model for a wide range of appsShark and Spark Streaming...
Related WorkDryadLINQ » Build queries through language-integrated SQL   operations on lazy datasets » Cannot have a datase...
Related WorkDistributed shared memory (DSM) » Very general model allowing random reads/writes, but hard   to implement eff...
Behavior with Not Enough RAM                     100                              68.8Iteration time (s)                  ...
Upcoming SlideShare
Loading in...5
×

Zaharia spark-scala-days-2012

3,698

Published on

Published in: Technology, Education
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,698
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
176
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • Point out that Scala is a modern PL etcMention DryadLINQ (but we go beyond it with RDDs)Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse
  • Each iteration is, for example, a MapReduce job
  • RDDs = first-class way to manipulate and persist intermediate datasets
  • You write a single program  similar to DryadLINQDistributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across opsVariables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimizationMention cached vars useful for some workloads that won’t be shown hereMention it’s all designed to be easy to distribute in a fault-tolerant fashion
  • Key idea: add “variables” to the “functions” in functional programming
  • Note that dataset is reused on each gradient computation
  • Key idea: add “variables” to the “functions” in functional programming
  • 100 GB of data on 50 m1.xlarge EC2 machines
  • Mention it’s designed to be fault-tolerant
  • NOT a modified versionof Hadoop
  • Transcript of "Zaharia spark-scala-days-2012"

    1. 1. Spark in ActionFast Big Data Analytics using ScalaMatei ZahariaUniversity of California, Berkeleywww.spark-project.org UC BERKELEY
    2. 2. My BackgroundGrad student in the AMP Lab at UC Berkeley » 50-person lab focusing on big dataCommitter on Apache HadoopStarted Spark in 2009 to provide a richer,Hadoop-compatible computing engine
    3. 3. Spark GoalsExtend the MapReduce model to support moretypes of applications efficiently » Spark can run 40x faster than Hadoop for iterative and interactive applicationsMake jobs easier to program »Language-integrated API in Scala »Interactive use from Scala interpreter
    4. 4. Why go Beyond MapReduce?MapReduce simplified big data analysis by givinga reliable programming model for large clustersBut as soon as it got popular, users wanted more: » More complex, multi-stage applications » More interactive ad-hoc queries
    5. 5. Why go Beyond MapReduce?Complex jobs and interactive queries both needone thing that MapReduce lacks: Efficient primitives for data sharing Query 1 Stage 3 Stage 2 Stage 1 Query 2 Query 3 Iterative algorithm Interactive data mining
    6. 6. Why go Beyond MapReduce?Complex jobs and interactive queries both needone thing that MapReduce lacks: Efficient primitives for data sharing Query 1 Stage 3 Stage 2 Stage 1In MapReduce, the only way to shareQuery 2 across data jobs is stable storage (e.g. HDFS) Query 3 -> slow! Iterative algorithm Interactive data mining
    7. 7. Examples HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input HDFS query 1 result 1 read query 2 result 2 query 3 result 3 Input . . . I/O and serialization can take 90% of the time
    8. 8. Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-time processing query 2 query 3 Input Distributed memory . . . 10-100× faster than network and disk
    9. 9. Solution: ResilientDistributed Datasets (RDDs)Distributed collections of objects that can bestored in memory for fast reuseAutomatically recover lost data on failureSupport a wide range of applications
    10. 10. OutlineSpark programming modelUser applicationsImplementationDemoWhat’s next
    11. 11. Programming ModelResilient distributed datasets (RDDs) » Immutable, partitioned collections of objects » Can be cached in memory for efficient reuseTransformations (e.g. map, filter, groupBy, join) » Build RDDs from other RDDsActions (e.g. count, collect, save) » Return a result or write it to storage
    12. 12. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseTransformed RDD RDD Cache 1lines = spark.textFile(“hdfs://...”) Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(„t‟)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker. . . Cache 3 Worker Block 2 Result: scaled tosearch of Wikipedia full-text 1 TB data in 5-7 sec in <1 sec (vs 20 for on-disk data) (vs 170 sec sec for on-disk data) Block 3
    13. 13. RDD Fault ToleranceRDDs track the series of transformations used tobuild them (their lineage) to recompute lost dataE.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
    14. 14. Example: Logistic RegressionGoal: find best line separating two sets of points random initial line target
    15. 15. Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()var w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}println("Final w: " + w)
    16. 16. Logistic Regression Performance 4000 3500 110 s / iterationRunning Time (s) 3000 2500 2000 Hadoop 1500 Spark 1000 500 0 first iteration 80 s further iterations 6 s 1 5 10 20 30 Number of Iterations
    17. 17. Spark Users
    18. 18. User ApplicationsIn-memory analytics on Hive data (Conviva)Interactive queries on data streams (Quantifind)Exploratory log analysis (Foursquare)Traffic estimation w/ GPS data (Mobile Millennium)Algorithms for DNA sequence analysis (SNAP)...
    19. 19. Conviva GeoReport Hive 20Spark 0.5 Time (hours) 0 5 10 15 20Group aggregations on many keys with the sameWHERE clause40× gain over Apache Hive comes fromavoiding repeated reading, deserialization andfiltering
    20. 20. Quantifind Stream Analysis Parsed Extracted RelevantData Feeds Insights Documents Entities Time SeriesLoad new documents every few minutesCompute an in-memory table of time seriesLet users query interactively via web app
    21. 21. ImplementationRuns on Apache Mesos cluster Spark Hadoop MPImanager to coexist w/ …Hadoop MesosSupports any Hadoop storage Node Node Nodesystem (HDFS, HBase, …)Easy local mode and EC2 launch scriptsNo changes to Scala
    22. 22. Task SchedulerRuns general DAGs A: B:Pipelines functions G:within a stage Stage 1 groupByCache-aware data C: D: F:reuse & locality map E: joinPartitioning-awareto avoid shuffles Stage 2 union Stage 3 = cached data partition
    23. 23. Language IntegrationScala closures are Serializable Java objects » Serialize on master, load & run on workersNot quite enough » Nested closures may reference entire outer scope, pulling in non-Serializable variables not used inside » Solution: bytecode analysis + reflectionInterpreter integration » Some magic tracks variables, defs, etc that each line depends on and automatically ships them to workers
    24. 24. Demo
    25. 25. What’s Next?
    26. 26. Hive on Spark (Shark)Compatible port of the SQL-on-Hadoop enginethat can run 40x faster on existing Hive dataScala UDFs for statistics and machine learningAlpha coming really soon
    27. 27. Streaming SparkExtend Spark to perform streaming computationsRun as a series of small (~1 s) batch jobs, keepingstate in memory as fault-tolerant RDDsAlpha expected by June map reduceByWindowtweetStream T=1 .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=2 …
    28. 28. ConclusionSpark offers a simple, efficient and powerfulprogramming model for a wide range of appsShark and Spark Streaming coming soonDownload and docs: www.spark-project.org @matei_zaharia / matei@berkeley.edu
    29. 29. Related WorkDryadLINQ » Build queries through language-integrated SQL operations on lazy datasets » Cannot have a dataset persist across queriesRelational databases » Lineage/provenance, logical logging, materialized viewsPiccolo » Parallel programs with shared distributed hash tables; similar to distributed shared memoryIterative MapReduce (Twister and HaLoop) » Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively
    30. 30. Related WorkDistributed shared memory (DSM) » Very general model allowing random reads/writes, but hard to implement efficiently (needs logging or checkpointing)RAMCloud » In-memory storage system for web applications » Allows random reads/writes and uses logging like DSMNectar » Caching system for DryadLINQ programs that can reuse intermediate results across jobs » Does not provide caching in memory, explicit support over which data is cached, or control over partitioningSMR (functional Scala API for Hadoop)
    31. 31. Behavior with Not Enough RAM 100 68.8Iteration time (s) 58.1 80 40.7 60 29.7 40 11.5 20 0 Cache 25% 50% 75% Fully disabled cached % of working set in cache
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×