Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark

5,021 views

Published on

Technical introduction into Apache Spark - the Swiss Army Knife of Big Data analytics tools.

The talk was held at the Big Data User Group Mannheim, Germany at 24.11.2014.

Published in: Technology

Apache Spark

  1. 1. 2 24.11.2014 uweseiler Apache Spark
  2. 2. 2 About me 24.11.2014 Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate
  3. 3. 2 About us 24.11.2014 specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  4. 4. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  5. 5. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  6. 6. 2 Spark: In a tweet 24.11.2014 “Spark … is what you might call a Swiss Army knife of Big Data analytics tools” – Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
  7. 7. 2 Spark: In a nutshell 24.11.2014 • Fast and general engine for large scale data processing • Advanced DAG execution engine with support for in-memory storage data locality (micro) batch streaming support • Improves usability via Rich APIs in Scala, Java, Python Interactive shell • Runs Standalone, on YARN, on Mesos, and on Amazon EC2
  8. 8. 2 Spark is also… 24.11.2014 • Came out of AMPLab at UCB in 2009 • A top-level Apache project as of 2014 – http://spark.apache.org • Backed by a commercial entity: Databricks • A toolset for Data Scientist / Analysts • Implementation of Resilient Distributed Dataset (RDD) in Scala • Hadoop Compatible
  9. 9. 2 Spark: Trends 24.11.2014 Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez Generated using http://www.google.com/trends/
  10. 10. 2 Spark: Community 24.11.2014 https://github.com/apache/spark/pulse
  11. 11. 2 Spark: Performance 24.11.2014 3X faster using 10X fewer machines http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
  12. 12. 2 24.11.2014 BlinkDB MapReduce Cluster resource mgmt. + data processing HDFS Spark: Ecosystem Redundant, reliable storage Spark Core Spark SQL SQL Spark Streaming Streaming MLlib Machine Learning SparkR R on Spark GraphX Graph Computation
  13. 13. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  14. 14. 2 Spark: Core Concept 24.11.2014 • Resilient Distributed Dataset (RDD) Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors RDD A11 A12 A13 • Read-only collection of objects spread across a cluster • Built through parallel transformations actions • Computation can be represented by lazy evaluated lineage DAGs composed by connected RDDs • Automatically rebuilt on failure • Controllable persistence
  15. 15. 2 Spark: RDD Example 24.11.2014 Base RDD from HDFS lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(Error)) messages = errors.map(_.split('t')(2)) messages.cache() RDD in memory Iterative Processing for (str - Array(“foo”, “bar”)) messages.filter(_.contains(str)).count()
  16. 16. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map
  17. 17. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) sortByKey([ascending], [numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) sample(withReplacement,fraction, seed) repartition(numPartitions)
  18. 18. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce
  19. 19. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce(func) collect() count() first() countByKey() foreach(func) take(n) takeSample(withReplacement,num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) (Only Java and Scala) saveAsObjectFile(path) (Only Java and Scala)
  20. 20. 2 Spark: Dataflow 24.11.2014 All transformations in Spark are lazy and are only computed when an actions requires it.
  21. 21. 2 Spark: Persistence 24.11.2014 One of the most important capabilities in Spark is caching a dataset in-memory across operations • cache() MEMORY_ONLY • persist() MEMORY_ONLY
  22. 22. 2 Spark: Storage Levels 24.11.2014 • persist(Storage Level) Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, … … … Same as the levels above, but replicate each partition on two cluster nodes.
  23. 23. 2 Spark: Parallelism 24.11.2014 Can be specified in a number of different ways • RDD partition number • sc.textFile(input, minSplits = 10) • sc.parallelize(1 to 10000, numSlices = 10) • Mapper side parallelism • Usually inherited from parent RDD(s) • Reducer side parallelism • rdd.reduceByKey(_ + _, numPartitions = 10) • rdd.reduceByKey(partitioner = p, _ + _) • “Zoom in/out” • rdd.repartition(numPartitions: Int) • rdd.coalesce(numPartitions: Int, shuffle: Boolean)
  24. 24. 2 Spark: Example 24.11.2014 Text Processing Example Top words by frequency
  25. 25. 2 Spark: Frequency Example 24.11.2014 Create RDD from external data Data Sources supported by Hadoop Cassandra ElasticSearch HDFS S3 HBase Mongo DB … I/O via Hadoop optional // Step 1. Create RDD from Hadoop text files val docs = spark.textFile(“hdfs://docs/“)
  26. 26. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String]
  27. 27. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] = .map(_.ToLowerCase)
  28. 28. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end = // Step 2. Convert lines to lower case val lower = docs.map(line = line.ToLowerCase) hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] .map(_.ToLowerCase)
  29. 29. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[Array[String]] hello spark _.split(s+) world this is spark the end
  30. 30. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello spark .flatten* _.split(s+) world this is spark hello world this the end end
  31. 31. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) the end .flatMap(line = line.split(“s+“)) hello world this end
  32. 32. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) hello world this the end end .flatMap(line = line.split(“s+“)) // Step 3. Split lines into words val words = lower.flatMap(line = line.split(“s+“))
  33. 33. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  34. 34. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  35. 35. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) // Step 4. Convert into tuples val counts = words.map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  36. 36. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] .groupByKey end 1 hello 1 spark 1 1 world 1
  37. 37. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  38. 38. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b .reduceByKey((a,b) = a+b)
  39. 39. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark spark end 1 1 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 // Step 5. Count all words val freq = counts.reduceByKey(_ + _) end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  40. 40. 2 Spark: Frequency Example 24.11.2014 Top N (Prepare data) RDD[(String, Int)] end 1 hello 1 spark 2 world 1 // Step 6. Swap tupels (Partial code) freq.map(_.swap) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world .map(_.swap)
  41. 41. 2 Spark: Frequency Example 24.11.2014 Top N (First Attempt) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .sortByKey
  42. 42. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world local top N .top(N) local top N
  43. 43. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction
  44. 44. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] spark 2 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction // Step 6. Swap tupels (Complete code) val top = freq.map(_.swap).top(N)
  45. 45. 2 Spark: Frequency Example 24.11.2014 val spark = new SparkContext() // Create RDD from Hadoop text file val docs = spark.textFile(“hdfs://docs/“) // Split lines into words and process val lower = docs.map(line = line.ToLowerCase) val words = lower.flatMap(line = line.split(“s+“)) val counts = words.map(word = (word,1)) // Count all words val freq = counts.reduceByKey(_ + _) // Swap tupels and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  46. 46. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  47. 47. 2 Spark: Streaming 24.11.2014 • Real-time computation • Similar to Apache Storm… • Streaming input split into sliding windows of RDD‘s • Input distributed to memory for fault tolerance • Supports input from Kafka, Flume, ZeroMQ, HDFS, S3, Kinesis, Twitter, …
  48. 48. 2 Spark: Streaming 24.11.2014 Discretized Stream Windowed Computations
  49. 49. 2 Spark: Streaming 24.11.2014 TwitterUtils.createStream() .filter(_.getText.contains(Spark)) .countByWindow(Seconds(5))
  50. 50. 2 Spark: SQL 24.11.2014 • Spark SQL allows relational queries expressed in SQL, HiveQL or Scala • Uses SchemaRDD’s composed of Row objects (= table in a traditional RDBMS) • SchemaRDD can be created from an • Existing RDD • Parquet File • JSON dataset • By running HiveQL against data stored in Apache Hive • Supports a domain specific language for writing queries
  51. 51. 2 Spark: SQL 24.11.2014 registerFunction(LEN, (_: String).length) val queryRdd = sql( SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ) queryRdd .map( c = sword: ${c(0)} t| total: ${c(1)}) .collect() .foreach(println)
  52. 52. 2 Spark: GraphX 24.11.2014 • GraphX is the Spark API for graphs and graph-parallel computation • API’s to join and traverse graphs • Optimally partitions and indexes vertices edges (represented as RDD’s) • Supports PageRank, connected components, triangle counting, …
  53. 53. 2 Spark: GraphX 24.11.2014 val graph = Graph(userIdRDD, assocRDD) val ranks = graph.pageRank(0.0001).vertices val userRDD = sc.textFile(graphx/data/users.txt) val users = userRdd. map {line = val fields = line.split(,) (fields(0).toLong, fields( 1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) = (username, rank) }
  54. 54. 2 Spark: MLlib 24.11.2014 • Machine learning library similar to Apache Mahout • Supports statistics, regression, decision trees, clustering, PCA, gradient descent, … • Iterative algorithms much faster due to in-memory processing
  55. 55. 2 Spark: MLlib 24.11.2014 val data = sc.textFile(data.txt) val parsedData = data.map {line = val parts = line.split(',') LabeledPoint( parts( 0). toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) } val model = LinearRegressionWithSGD.train( parsedData, 100 ) val valuesAndPreds = parsedData.map {point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds .map{case(v, p) = math.pow((v - p), 2)}.mean()
  56. 56. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  57. 57. 2 Use Case: Yahoo Native Ads 24.11.2014 Logistic regression algorithm • 120 LOC in Spark/Scala • 30 min. on model creation for 100M samples and 13K features Initial version launched within 2 hours after Spark-on- YARN announcement • Compared: Several days on hardware acquisition, system setup and data movement http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  58. 58. 2 Use Case: Yahoo Mobile Ads 24.11.2014 Learn from mobile search ads clicks data • 600M labeled examples on HDFS • 100M sparse features Spark programs for Gradient Boosting Decision Trees • 6 hours for model training with 100 workers • Model with accuracy very close to heavily-manually-tuned Logistic Regression models http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  59. 59. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  60. 60. 2 Spark-on-YARN (Current) 24.11.2014 Hadoop 2 Spark as YARN App Pig … In- Hive Stream Tez Spark MapReduce Execution Engine Execution Engine YARN Memory Cluster resource management HDFS Redundant, reliable storage ing Storm …
  61. 61. 2 Spark-on-YARN (Future) 24.11.2014 Hadoop 2 Spark as Execution Engine Hive … Mahout YARN HDFS Pig MapReduce Execution Engine Stream ing Storm … Tez Execution Engine Spark Execution Engine Slider
  62. 62. 2 Spark: Future work 24.11.2014 • Spark Core • Focus on maturity, optimization pluggability • Enable long-running services (Slider) • Give resources back to cluster when idle • Integrate with Hadoop enhancements • Timeline server • ORC File Format • Spark Eco System • Focus on adding capabilities
  63. 63. 2 One more thing… 24.11.2014 Let’s get started with Spark!
  64. 64. 2 Hortonworks Sandbox 2.2 24.11.2014 http://hortonworks.com/hdp/downloads/
  65. 65. 2 Hortonworks Sandbox 2.2 24.11.2014 // 1. Download wget http://public-repo-1.hortonworks.com/HDP-LABS/ Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 2.4.0.tgz // 2. Untar tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz // 3. Start Spark Shell ./bin/spark-shell
  66. 66. 2 Thanks for listening 24.11.2014 Twitter: @uweseiler Mail: uwe.seiler@codecentric.de XING: https://www.xing.com/profile /Uwe_Seiler

×