Intro to Spark - for Denver Big Data Meetup

2,230 views
2,113 views

Published on

Published in: Engineering, Technology
1 Comment
12 Likes
Statistics
Notes
  • Hello, You mentioned JMS Queue as a source to Spark Streaming. Could you please point to the API, documentation or code examples to help?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,230
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
137
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide
  • Transformations create new datasets from existing ones.Actions return value
  • When you apply a transformation to an RDD, they don’t happen right away. Instead they are remembered and are computed only when an action requires a result.
  • Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.
  • Count number of words received from network server every secondThe socketTextStream returns a DStream of lines received from a TCP socket-based source. The lines DStream is transformed into a DStream using the flatMap operation, where each line is split into words. This words DStream is then mapped to a DStream of (word, 1) pairs, which is finally reduced to get the word counts. wordCounts.print() will print 10 of the counts generated every second.
  • Intro to Spark - for Denver Big Data Meetup

    1. 1. 1 Introduction to Spark Gwen Shapira, Solutions Architect
    2. 2. Spark is next-generation Map Reduce 2
    3. 3. MapReduce has been around for a while It made distributed compute easier But, can we do better? 3
    4. 4. MapReduce Issues • Launching mappers and reducers takes time • One MR job can rarely do a full computation • Writing to disk (in triplicate!) between each job • Going back to queue between jobs • No in-memory caching • No iterations • Very high latency • Not the greatest APIs either 4
    5. 5. Spark: Easy to Develop, Fast to Run 5
    6. 6. Spark Features • In-memory cache • General execution graphs • APIs in Scala, Java and Python • Integrates but does not depend on Hadoop 6
    7. 7. Why is it better? • (Much) Faster than MR • Iterative programming – Must have for ML • Interactive – allows rapid exploratory analytics • Flexible execution graph: • Map, map, reduce, reduce, reduce, map • High productivity compared to MapReduce 7
    8. 8. Word Count file = spark.textFile(“hdfs://…”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_) Remember MapReduce WordCount? 8
    9. 9. Agenda • Concepts • Examples • Streaming • Summary 9
    10. 10. 10 Concepts
    11. 11. AMP Lab BDAS 11
    12. 12. CDH5 (simplified) 12 HDFS + In-Memory Cache YARN Spark MR Impala Spark Streaming ML Lib
    13. 13. How Spark runs on a Cluster 13 Driver Worker Worker Data RAM Data RAMWorker Data RAM
    14. 14. Workflow • SparkContext in driver connects to Master • Master allocates resources for app on cluster • SC acquires executors on worker nodes • SC sends the app code (JAR) to executors • SC sends tasks to executors 14
    15. 15. RDD – Resilient Distributed Dataset • Collection of elements • Read-only • Partitioned • Fault-tolerant • Supports parallel operations 15
    16. 16. RDD Types • Parallelized Collection • Parallelize(Seq) • HDFS files • Text, Sequence or any InputFormat • Both support same operations 16
    17. 17. Operations Transformations • Map • Filter • Sample • Join • ReduceByKey • GroupByKey • Distinct Actions • Reduce • Collect • Count • First, Take • SaveAs • CountByKey 17
    18. 18. Transformations are lazy 18
    19. 19. Lazy transformation 19 Find all lines that mention “MySQL” Only the timestamp portion of the line Set the date and hour as key, 1 as value Now reduce by key and sum the values Return the result as Array so I can print Find lines, get timestamp… Aha! Finally something to do!
    20. 20. Persistence / Caching • Store RDD in memory for later use • Each node persists a partition • Persist() marks an RDD for caching • It will be cached first time an action is performed • Use for iterative algorithms 20
    21. 21. Caching – Storage Levels • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2… 21
    22. 22. Fault Tolerance • Lost partitions can be re-computed from source data • Because we remember all transformations 22 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
    23. 23. 23 Examples
    24. 24. Word Count file = spark.textFile(“hdfs://…”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_) Remember MapReduce WordCount? 24
    25. 25. Log Mining • Load error messages from a log into memory • Interactively search for patterns 25
    26. 26. Log Mining lines = spark.textFile(“hdfs://…”) errors = lines.filter(_.startsWith(“ERROR”) messages = errors.map(_.split(„t‟)(2)) cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count … 26 Base RDD Transformed RDD Action
    27. 27. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 27
    28. 28. Intuition 28
    29. 29. Logistic Regression val points = spark.textFile(…).map(parsePoint).cache() val w = Vector.random(D) for (I <- 1 to ITERATIONS) { val gradient = points.map(p => (1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ) .reduce(_+_) w -= gradient } println(“Final separating plane: ” + w) 29
    30. 30. Conviva Use-Case • Monitor online video consumption • Analyze trends Need to run tens of queries like this a day: SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 30
    31. 31. Conviva With Spark val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSum maryOnHdfs) val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap 31
    32. 32. 32 Streaming
    33. 33. What is it? • Extension of Spark API • For high-throughput fault-tolerant processing of live data streams 33
    34. 34. Sources & Outputs • Kafka • Flume • Twitter • JMS Queues • TCP sockets • HDFS • Databases • Dashboards 34
    35. 35. Architecture 35 Input Streaming Context Spark Context
    36. 36. DStreams • Stream is broken down into micro-batches • Each micro-batch is an RDD • This means any Spark function or library can apply to a stream • Including ML-Lib, graph processing, etc. 36
    37. 37. Processing DStreams 37
    38. 38. Processing Dstreams - Stateless 38
    39. 39. Processing Dstreams - Stateful 39
    40. 40. Dstream Operators • Transformation produce DStream from one or more parent streams • Stateless (independent per interval) Map, reduce • Stateful (share data across intervals) Window, incremental aggregation, time- skewed join • Output Write data to external system (save RDD to HDFS) Save, foreach 40
    41. 41. Fault Recovery • Input from TCP, Flume or Kafka is stored on 2 nodes • In case of failure: missing RDDs will be re-computed from surviving nodes. • RDDs are deterministic • So any computation will lead to the same result • Transformation can guarantee exactly once semantics. • Even through failure 41
    42. 42. Key Question - How fast can the system recover? 42
    43. 43. Example – Streaming WordCount import org.apache.spark.streaming.{Seconds, StreamingContext} import StreamingContext._ ... // Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1)) val lines = ssc.socketTextStream(args(1), args(2).toInt) // Split the lines into words, count them // print some of the counts on the master val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() 43
    44. 44. 44 Shark
    45. 45. Shark Architecture • Identical to Hive • Same CLI, JDBC, SQL Parser, Metastore • Replaced the optimizer, plan generator and the execution engine. • Added Cache Manager. • Generate Spark code instead of Map Reduce 45
    46. 46. Hive Compatibility • MetaStore • HQL • UDF / UDAF • SerDes • Scripts 46
    47. 47. Dynamic Query Plans • Hive MetaData often lacks statistics • Join types often requires hinting • Shark gathers statistics per partition • While materializing map output • Partition sizes, record count, skew, histograms • Alter plan accordingly 47
    48. 48. Columnar Memory Store • Better compression • CPU efficiency • Cache Locality 48
    49. 49. Spark + Shark Integration val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache()) 49
    50. 50. 50 Summary
    51. 51. Why Spark? • Flexible • High performance • Machine learning, iterative algorithms • Interactive data explorations • Developer productivity 51
    52. 52. Why not Spark? • Still immature • Uses *lots* of memory • Equivalent functionality in Impala, Storm, etc 52
    53. 53. How Spark Works? • RDDs – resilient distributed data • Lazy transformations • Fault tolerant caching • Streams – micro-batches of RDDs 53
    54. 54. 54

    ×