20130912 YTC_Reynold Xin_Spark and Shark

4,511 views
4,314 views

Published on

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.

These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,511
On SlideShare
0
From Embeds
0
Number of Embeds
2,198
Actions
Shares
0
Downloads
145
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • FT = Fourier Transform
  • NOT a modified versionof Hadoop
  • Note that dataset is reused on each gradient computation
  • Key idea: add “variables” to the “functions” in functional programming
  • Key idea: add “variables” to the “functions” in functional programming
  • 100 GB of data on 50 m1.xlarge EC2 machines
  • Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
  • 20130912 YTC_Reynold Xin_Spark and Shark

    1. 1. Spark and Shark: High-speed Analytics over Hadoop Data Sep 12, 2013 @ Yahoo! Tech Conference Reynold Xin (辛湜), AMPLab, UC Berkeley
    2. 2. The Spark Ecosystem Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
    3. 3. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
    4. 4. Apache Spark Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: » In-memory computing primitives » General computation graphs Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell Up to 100× faster (2-10× on disk) Often 5× less code
    5. 5. Shark An analytic query engine built on top of Spark » Support both SQL and complex analytics » Can be100X faster than Apache Hive Compatible with Hive data, metastore, queries » HiveQL » UDF / UDAF » SerDes » Scripts
    6. 6. Community 3000 people attended online training 1100+ meetup members 80+ GitHub contributors
    7. 7. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
    8. 8. Why a New Framework? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: » More complex, multi-pass analytics (e.g. ML, graph) » More interactive ad-hoc queries » More real-time stream processing
    9. 9. Hadoop MapReduce iter. 1 iter. 2 … Input filesystem read filesystem write filesystem read filesystem write Iterative: Interactive: Input query 1 query 2 … select Subset Slow due to replication and disk I/O, but necessary for fault tolerance
    10. 10. Spark Programming Model Key idea: resilient distributed datasets (RDDs) » Distributed collections of objects that can optionally be cached in memory across cluster » Manipulated through parallel operators » Automatically recomputed on failure Programming interface » Functional APIs in Scala, Java, Python » Interactive use from Scala shell
    11. 11. Example: Log Mining Exposes RDDs through a functional API in Scala Usable interactively from Scala shell lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) errors.persist() Block 1 Block 2 Block 3 Worker errors.filter(_.contains(“foo”)).count() errors.filter(_.contains(“bar”)).count() tasks results Errors 2 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: 1 TB data in 5 sec (vs 170 sec for on-disk data) Worker Errors 3 Worker Errors 1 Master
    12. 12. Data Sharing in Spark iter. 1 iter. 2 … Input Iterative: Interactive: Input query 1 query 2select … 10-100x faster than network/disk
    13. 13. Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)
    14. 14. Spark: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ... 15 and broadcast variables, accumulators…
    15. 15. public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
    16. 16. Word Count val docs = sc.textFiles(“hdfs://…”) docs.flatMap(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)
    17. 17. Word Count val docs = sc.textFiles(“hdfs://…”) docs.flatMap(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) docs.flatMap(_.split(“s”)) .map((_, 1)) .reduceByKey(_ + _)
    18. 18. Spark in Java and Python Python API lines = spark.textFile(…) errors = lines.filter( lambda s: "ERROR" in s) errors.count() Java API JavaRDD<String> lines = spark.textFile(…); errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } }); errors.count()
    19. 19. Scheduler General task DAGs Pipelines functions within a stage Cache-aware data locality & reuse Partitioning-aware to avoid shuffles Launch jobs in ms join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition
    20. 20. Example: Logistic Regression Goal: find best line separating two sets of points target random initial line 22
    21. 21. Logistic Regression: Prep the Data val sc = new SparkContext(“spark://master”, “LRExp”) def loadDict(): Map[String, Double] = { … } val dict = sc.broadcast( loadDict() ) def parser(s: String): (Array[Double], Double) = { val parts = s.split(„t‟) val y = parts(0).toDouble val x = parts.drop(1).map(dict.getOrElse(_, 0.0)) (x, y) } val data = sc.textFile(“hdfs://d”).map(parser).cache() Load data in memory once Start a new Spark Session Build a dictionary for file parsing Broadcast the Dictionary to the Cluster Line Parser which uses the Dictionary 23
    22. 22. Logistic Regression: Grad. Desc. val sc = new SparkContext(“spark://master”, “LRExp”) // Parsing … val data: Spark.RDD[(Array[Double],Double)] = … var w = Vector.random(D) val lambda = 1.0/data.count for (i <- 1 to ITERATIONS) { val gradient = data.map { case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x }.reduce( (g1, g2) => g1 + g2 ) w -= (lambda / sqrt(i) ) * gradient } Initial parameter vector Iterate Set Learning Rate Apply Gradient on Master Compute Gradient Using Map & Reduce 24
    23. 23. Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
    24. 24. Spark ML Lib (Spark 0.8) Classification Logistic Regression, Linear SVM (+L1, L2) Regression Linear Regression (+Lasso, Ridge) Collaborative Filtering Alternating Least Squares Clustering K-Means Optimization SGD, Parallel Gradient
    25. 25. Spark Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
    26. 26. Shark Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
    27. 27. Shark A data analytics system that » builds on Spark, » scales out and tolerate worker failures, » supports low-latency, interactive queries through (optional) in-memory computation, » supports both SQL and complex analytics, » is compatible with Hive (storage, serdes, UDFs, types, metadata).
    28. 28. Hive Architecture Meta store HDFS Client Driver SQL Parse r Query Optimizer Physical Plan Execution CLI JDBC MapReduce
    29. 29. Shark Architecture Meta store HDFS Client Driver SQL Parse r Physical Plan Execution CLI JDBC Spark Cache Mgr. Query Optimizer
    30. 30. Shark Query Language Hive QL and a few additional DDL add-ons: Creating an in-memory table: » CREATE TABLE src_cached AS SELECT * FROM …
    31. 31. Engine Features Columnar Memory Store Data Co-partitioning & Co-location Partition Pruning based on Statistics (range, bloom filter and others) Spark Integration …
    32. 32. Columnar Memory Store Simply caching Hive records as JVM objects is inefficient. Shark employs column-oriented storage. 1 Column Storage 2 3 john mike sally 4.1 3.5 6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4Benefit: compact representation, CPU efficient compression, cache locality.
    33. 33. Spark Integration Unified system for query processing and Spark analytics Query processing and ML share the same set of workers and caches def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())
    34. 34. Performance 0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime(seconds) Shark Shark (disk) Hive 1.1 0.8 0.7 1.0 1.7 TB Real Warehouse Data on 100 EC2 nodes
    35. 35. More Information Download and docs: www.spark-project.org » Easy to run locally, on EC2, or on Mesos/YARN Email: rxin@apache.org Twitter: @rxin
    36. 36. Behavior with Insufficient RAM 68.8 58.1 40.7 29.7 11.5 0 20 40 60 80 100 0% 25% 50% 75% 100% Iterationtime(s) Percent of working set in memory
    37. 37. Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page’s rank to Σi∈neighbors ranki / |neighborsi| links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) }

    ×