Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Spark - for Denver Big Data Meetup

2,637 views

Published on

Published in: Engineering, Technology

Intro to Spark - for Denver Big Data Meetup

  1. 1. 1 Introduction to Spark Gwen Shapira, Solutions Architect
  2. 2. Spark is next-generation Map Reduce 2
  3. 3. MapReduce has been around for a while It made distributed compute easier But, can we do better? 3
  4. 4. MapReduce Issues • Launching mappers and reducers takes time • One MR job can rarely do a full computation • Writing to disk (in triplicate!) between each job • Going back to queue between jobs • No in-memory caching • No iterations • Very high latency • Not the greatest APIs either 4
  5. 5. Spark: Easy to Develop, Fast to Run 5
  6. 6. Spark Features • In-memory cache • General execution graphs • APIs in Scala, Java and Python • Integrates but does not depend on Hadoop 6
  7. 7. Why is it better? • (Much) Faster than MR • Iterative programming – Must have for ML • Interactive – allows rapid exploratory analytics • Flexible execution graph: • Map, map, reduce, reduce, reduce, map • High productivity compared to MapReduce 7
  8. 8. Word Count file = spark.textFile(“hdfs://…”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_) Remember MapReduce WordCount? 8
  9. 9. Agenda • Concepts • Examples • Streaming • Summary 9
  10. 10. 10 Concepts
  11. 11. AMP Lab BDAS 11
  12. 12. CDH5 (simplified) 12 HDFS + In-Memory Cache YARN Spark MR Impala Spark Streaming ML Lib
  13. 13. How Spark runs on a Cluster 13 Driver Worker Worker Data RAM Data RAMWorker Data RAM
  14. 14. Workflow • SparkContext in driver connects to Master • Master allocates resources for app on cluster • SC acquires executors on worker nodes • SC sends the app code (JAR) to executors • SC sends tasks to executors 14
  15. 15. RDD – Resilient Distributed Dataset • Collection of elements • Read-only • Partitioned • Fault-tolerant • Supports parallel operations 15
  16. 16. RDD Types • Parallelized Collection • Parallelize(Seq) • HDFS files • Text, Sequence or any InputFormat • Both support same operations 16
  17. 17. Operations Transformations • Map • Filter • Sample • Join • ReduceByKey • GroupByKey • Distinct Actions • Reduce • Collect • Count • First, Take • SaveAs • CountByKey 17
  18. 18. Transformations are lazy 18
  19. 19. Lazy transformation 19 Find all lines that mention “MySQL” Only the timestamp portion of the line Set the date and hour as key, 1 as value Now reduce by key and sum the values Return the result as Array so I can print Find lines, get timestamp… Aha! Finally something to do!
  20. 20. Persistence / Caching • Store RDD in memory for later use • Each node persists a partition • Persist() marks an RDD for caching • It will be cached first time an action is performed • Use for iterative algorithms 20
  21. 21. Caching – Storage Levels • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2… 21
  22. 22. Fault Tolerance • Lost partitions can be re-computed from source data • Because we remember all transformations 22 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  23. 23. 23 Examples
  24. 24. Word Count file = spark.textFile(“hdfs://…”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_) Remember MapReduce WordCount? 24
  25. 25. Log Mining • Load error messages from a log into memory • Interactively search for patterns 25
  26. 26. Log Mining lines = spark.textFile(“hdfs://…”) errors = lines.filter(_.startsWith(“ERROR”) messages = errors.map(_.split(„t‟)(2)) cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count … 26 Base RDD Transformed RDD Action
  27. 27. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 27
  28. 28. Intuition 28
  29. 29. Logistic Regression val points = spark.textFile(…).map(parsePoint).cache() val w = Vector.random(D) for (I <- 1 to ITERATIONS) { val gradient = points.map(p => (1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ) .reduce(_+_) w -= gradient } println(“Final separating plane: ” + w) 29
  30. 30. Conviva Use-Case • Monitor online video consumption • Analyze trends Need to run tens of queries like this a day: SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 30
  31. 31. Conviva With Spark val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSum maryOnHdfs) val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap 31
  32. 32. 32 Streaming
  33. 33. What is it? • Extension of Spark API • For high-throughput fault-tolerant processing of live data streams 33
  34. 34. Sources & Outputs • Kafka • Flume • Twitter • JMS Queues • TCP sockets • HDFS • Databases • Dashboards 34
  35. 35. Architecture 35 Input Streaming Context Spark Context
  36. 36. DStreams • Stream is broken down into micro-batches • Each micro-batch is an RDD • This means any Spark function or library can apply to a stream • Including ML-Lib, graph processing, etc. 36
  37. 37. Processing DStreams 37
  38. 38. Processing Dstreams - Stateless 38
  39. 39. Processing Dstreams - Stateful 39
  40. 40. Dstream Operators • Transformation produce DStream from one or more parent streams • Stateless (independent per interval) Map, reduce • Stateful (share data across intervals) Window, incremental aggregation, time- skewed join • Output Write data to external system (save RDD to HDFS) Save, foreach 40
  41. 41. Fault Recovery • Input from TCP, Flume or Kafka is stored on 2 nodes • In case of failure: missing RDDs will be re-computed from surviving nodes. • RDDs are deterministic • So any computation will lead to the same result • Transformation can guarantee exactly once semantics. • Even through failure 41
  42. 42. Key Question - How fast can the system recover? 42
  43. 43. Example – Streaming WordCount import org.apache.spark.streaming.{Seconds, StreamingContext} import StreamingContext._ ... // Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1)) val lines = ssc.socketTextStream(args(1), args(2).toInt) // Split the lines into words, count them // print some of the counts on the master val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() 43
  44. 44. 44 Shark
  45. 45. Shark Architecture • Identical to Hive • Same CLI, JDBC, SQL Parser, Metastore • Replaced the optimizer, plan generator and the execution engine. • Added Cache Manager. • Generate Spark code instead of Map Reduce 45
  46. 46. Hive Compatibility • MetaStore • HQL • UDF / UDAF • SerDes • Scripts 46
  47. 47. Dynamic Query Plans • Hive MetaData often lacks statistics • Join types often requires hinting • Shark gathers statistics per partition • While materializing map output • Partition sizes, record count, skew, histograms • Alter plan accordingly 47
  48. 48. Columnar Memory Store • Better compression • CPU efficiency • Cache Locality 48
  49. 49. Spark + Shark Integration val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache()) 49
  50. 50. 50 Summary
  51. 51. Why Spark? • Flexible • High performance • Machine learning, iterative algorithms • Interactive data explorations • Developer productivity 51
  52. 52. Why not Spark? • Still immature • Uses *lots* of memory • Equivalent functionality in Impala, Storm, etc 52
  53. 53. How Spark Works? • RDDs – resilient distributed data • Lazy transformations • Fault tolerant caching • Streams – micro-batches of RDDs 53
  54. 54. 54

×