Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

31,340 views

Published on

Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.

Published in: Technology, Education
1 Comment
69 Likes
Statistics
Notes
No Downloads
Views
Total views
31,340
On SlideShare
0
From Embeds
0
Number of Embeds
311
Actions
Shares
0
Downloads
799
Comments
1
Likes
69
Embeds 0
No embeds

No notes for slide
  • Each iteration is, for example, a MapReduce job
  • Add “variables” to the “functions” in functional programming
  • Key idea: add “variables” to the “functions” in functional programming
  • This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • This will show interactive search on 50 GB of Wikipedia data
  • This will show interactive search on 50 GB of Wikipedia data
  • Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
  • This will show interactive search on 50 GB of Wikipedia data
  • This will show interactive search on 50 GB of Wikipedia data
  • Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack
  • Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

    1. 1. Spark and SharkHigh-Speed In-Memory Analyticsover Hadoop and Hive DataMatei Zaharia, in collaboration withMosharaf Chowdhury, Tathagata Das, Ankur Dave, CliffEngle, Michael Franklin, Haoyuan Li, Antonio Lupher, JustinMa, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold XinUC Berkeleyspark-project.org UC BERKELEY
    2. 2. What is Spark?Not a modified version of HadoopSeparate, fast, MapReduce-like engine » In-memory data storage for very fast iterative queries » General execution graphs and powerful optimizations » Up to 40x faster than HadoopCompatible with Hadoop’s storage APIs » Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc
    3. 3. What is Shark?Port of Apache Hive to run on SparkCompatible with existing Hive data, metastores,and queries (HiveQL, UDFs, etc)Similar speedups of up to 40x
    4. 4. Project HistorySpark project started in 2009, open sourced 2010Shark started summer 2011, alpha April 2012In use at Berkeley, Princeton, Klout, Foursquare,Conviva, Quantifind, Yahoo! Research & others250+ member meetup, 500+ watchers on GitHub
    5. 5. This TalkSpark programming modelUser applicationsShark overviewDemoNext major addition: Streaming Spark
    6. 6. Why a New Programming Model?MapReduce greatly simplified big data analysisBut as soon as it got popular, users wanted more: » More complex, multi-stage applications (graph algorithms, machine learning) » More interactive ad-hoc queries » More real-time online processingAll three of these apps require fast data sharingacross parallel jobs
    7. 7. Data Sharing in MapReduce HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input HDFS query 1 result 1 read query 2 result 2 query 3 result 3 Input . . .Slow due to replication, serialization, and disk IO
    8. 8. Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-time processing query 2 query 3 Input Distributed memory . . . 10-100× faster than network and disk
    9. 9. Spark Programming ModelKey idea: resilient distributed datasets (RDDs) » Distributed collections of objects that can be cached in memory across cluster nodes » Manipulated through various parallel operators » Automatically rebuilt on failureInterface » Clean language-integrated API in Scala » Can be used interactively from Scala console
    10. 10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseTransformed RDD RDD Cache 1lines = spark.textFile(“hdfs://...”) Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(„t‟)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker. . . Cache 3 Worker Block 2 Result: scaled tosearch of Wikipedia full-text 1 TB data in 5-7 sec in <1 sec (vs 20 for on-disk data) (vs 170 sec sec for on-disk data) Block 3
    11. 11. Fault ToleranceRDDs track the series of transformations used tobuild them (their lineage) to recompute lost dataE.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
    12. 12. Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()var w = Vector.random(D) Load data in memory once Initial parameter vectorfor (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient Repeated MapReduce steps} to do gradient descentprintln("Final w: " + w)
    13. 13. Logistic Regression Performance 4500 4000 3500 127 s / iterationRunning Time (s) 3000 2500 Hadoop 2000 1500 Spark 1000 500 0 first iteration 174 s further iterations 6 s 1 5 10 20 30 Number of Iterations
    14. 14. Supported Operatorsmap reduce samplefilter count cogroupgroupBy reduceByKey takesort groupByKey partitionByjoin first pipeleftOuterJoin union saverightOuterJoin cross ...
    15. 15. Other Engine FeaturesGeneral graphs of operators ( efficiency) A: B: G: Stage 1 groupBy C: D: F: map = RDD E: join = cached data Stage 2 union Stage 3
    16. 16. Other Engine FeaturesControllable data partitioning to minimizecommunication PageRank Performance 200 171 Hadoop Iteration time (s) 150 Basic Spark 100 72 Spark + Controlled 50 23 Partitioning 0
    17. 17. User Applications
    18. 18. Spark Users
    19. 19. ApplicationsIn-memory analytics & anomaly detection (Conviva)Interactive queries on data streams (Quantifind)Exploratory log analysis (Foursquare)Traffic estimation w/ GPS data (Mobile Millennium)Twitter spam classification (Monarch)...
    20. 20. Conviva GeoReport Hive 20Spark 0.5 Time (hours) 0 5 10 15 20Group aggregations on many keys w/ same filter40× gain over Hive from avoiding repeatedreading, deserialization and filtering
    21. 21. Quantifind Feed Analysis Parsed Extracted In-MemoryData Feeds Web Documents Entities Time Series Spark App queriesLoad data feeds, extract entities, and computein-memory tables every few minutesLet users drill down interactively from AJAX app
    22. 22. Mobile Millennium ProjectEstimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
    23. 23. Shark: Hive on Spark
    24. 24. MotivationHive is great, but Hadoop’s execution enginemakes even the smallest queries take minutesScala is good for programmers, but many datausers only know SQLCan we extend Hive to run on Spark?
    25. 25. Hive Architecture Client CLI JDBC Driver Meta Physical Plan store SQL Query Parser Optimizer Execution MapReduce HDFS
    26. 26. Shark Architecture Client CLI JDBC Driver Cache Mgr. Meta Physical Plan store SQL Query Parser Optimizer Execution Spark HDFS [Engle et al, SIGMOD 2012]
    27. 27. Efficient In-Memory StorageSimply caching Hive records as Java objects isinefficient due to high per-object overheadInstead, Shark employs column-orientedstorage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
    28. 28. Efficient In-Memory StorageSimply caching Hive records as Java objects isinefficient due to high per-object overheadInstead, Shark employs column-orientedstorage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3Benefit: similarly compact size to serialized data, 2 but >5x faster to access sally mike 3.5 john mike 3 sally 6.4 4.1 3.5 6.4
    29. 29. Using SharkCREATE TABLE mydata_cached AS SELECT …Run standard HiveQL on it, including UDFs » A few esoteric features are not yet supportedCan also call from Scala to mix with Spark Early alpha release at shark.cs.berkeley.edu
    30. 30. Benchmark Query 1SELECT * FROM grep WHERE field LIKE „%XYZ%‟; Shark (cached) 12s Shark 182s Hive 207s 0 50 100 150 200 250 Execution Time (secs)
    31. 31. Benchmark Query 2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earningsFROM rankings AS R, userVisits AS V ON R.pageURL = V.destURLWHERE V.visitDate BETWEEN „1999-01-01‟ AND „2000-01-01‟GROUP BY V.sourceIPORDER BY earnings DESCLIMIT 1; Shark (cached) 126s Shark 270s Hive 447s 0 100 200 300 400 500 Execution Time (secs)
    32. 32. Demo
    33. 33. What’s Next
    34. 34. Streaming SparkMany key big data apps must run in real time » Live event reporting, click analysis, spam filtering, …Event-passing systems (e.g. Storm) are low-level » Users must worry about FT, state, consistency » Programming model is different from batch, so must write each app twiceCan we give streaming a Spark-like interface?
    35. 35. Our IdeaRun streaming computations as a series of veryshort (<1 second) batch jobs » “Discretized stream processing”Keep state in memory as RDDs (automaticallyrecover from any failure)Provide a functional API similar to Spark
    36. 36. Spark Streaming API Functional operators on discretized streams New “stateful” operators for windowing pageViews ones countspageViews = readStream("...", "1s") t = 1:ones = pageViews.map(ev => (ev.url, 1)) map reducecounts = ones.runningReduce(_ + _) t = 2:D-streams Transformationsliding = ones.reduceByWindow( “5s”, _ + _, _ - _) ... = RDD = partition Sliding window reduce with “add” and “subtract” functions
    37. 37. Streaming + Batch + Ad-HocCombining D-streams with historical data: pageViews.join(historicCounts).map(...)Interactive ad-hoc queries on stream state: counts.slice(“21:00”, “21:05”).topK(10)
    38. 38. How Fast Can It Go?Can process 4 GB/s (42M records/s) of data on 100nodes at sub-second latencyRecovers from failures within 1 sec 5 5 Grep TopKWords Cluster Throughput 4 4 3 3 (GB/s) 2 2 1 1 0 1 sec 2 sec 0 0 50 100 0 50 100 Maximum throughput possible with 1s or 2s latency
    39. 39. Performance vs Storm Spark Storm Spark Storm 60 30Grep Throughput TopK Throughput (MB/s/node) (MB/s/node) 40 20 20 10 0 0 10000 100 10000 100 Record Size (bytes) Record Size (bytes) Storm limited to 10,000 records/s/node Also tried S4: 7000 records/s/node
    40. 40. Streaming RoadmapAlpha release expected in AugustSpark engine changes already in “dev” branch
    41. 41. ConclusionSpark & Shark speed up your interactive, complex,and (soon) streaming analytics on Hadoop dataDownload and docs: www.spark-project.org » Easy local mode and deploy scripts for EC2User meetup: meetup.com/spark-usersTraining camp at Berkeley in August! matei@berkeley.edu / @matei_zaharia
    42. 42. Behavior with Not Enough RAM 100 68.8Iteration time (s) 58.1 80 40.7 60 29.7 40 11.5 20 0 Cache 25% 50% 75% Fully disabled cached % of working set in memory
    43. 43. Software Stack Shark Bagel Streaming (Hive on Spark) (Pregel on Spark) Spark … Spark Local Apache EC2 YARN mode Mesos

    ×