Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

2,606 views

Published on

Spark Streaming Kinesis Machine Learning Graph Processing Lambda Architecture Approximations

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

  1. 1. Spark and Friends: Spark Streaming, Machine Learning, Graph Processing, Lambda Architecture, Approximations Chris Fregly East Bay Java User Group Oct 2014 Kinesis Streaming
  2. 2. Who am I? Former Netflix’er (netflix.github.io) Spark Contributor (github.com/apache/spark) Founder (fluxcapacitor.com) Author (effectivespark.com, sparkinaction.com)
  3. 3. Spark Streaming-Kinesis Jira
  4. 4. Not-so-Quick Poll • Spark, Spark Streaming? • Hadoop, Hive, Pig? • Parquet, Avro, RCFile, ORCFile? • EMR, DynamoDB, Redshift? • Flume, Kafka, Kinesis, Storm? • Lambda Architecture? • PageRank, Collaborative Filtering, Recommendations? • Probabilistic Data Structs, Bloom Filters, HyperLogLog?
  5. 5. Quick Poll Raiders or 49’ers?
  6. 6. “Streaming” Kinesis Streaming Video Streaming Piping Big Data Streaming
  7. 7. Agenda • Spark, Spark Streaming • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations
  8. 8. Timeline of Spark Evolution
  9. 9. Spark and Berkeley AMP Lab Berkeley Data Analytics Stack (BDAS)
  10. 10. Spark Overview • Based on 2007 Microsoft Dryad paper • Written in Scala • Supports Java, Python, SQL, and R • Data fits in memory when possible, but not required • Improved efficiency over MapReduce – 100x in-memory, 2-10x on-disk • Compatible with Hadoop – File formats, SerDes, and UDFs
  11. 11. Spark Use Cases • Ad hoc, exploratory, interactive analytics • Real-time + Batch Analytics – Lambda Architecture • Real-time Machine Learning • Real-time Graph Processing • Approximate, Time-bound Queries
  12. 12. Explosion of Specialized Systems
  13. 13. Unified High-level Spark Libraries • Spark SQL (Data Processing) • Spark Streaming (Streaming) • MLlib (Machine Learning) • GraphX (Graph Processing) • BlinkDB (Approximate Queries) • Statistics (Correlations, Sampling, etc) • Others – Shark (Hive on Spark) – Spork (Pig on Spark)
  14. 14. Benefits of Unified Libraries • Advancements in higher-level libraries are pushed down into core and vice-versa • Examples – Spark Streaming • GC and memory management improvements – Spark GraphX • IndexedRDD for random, hashed-based access within a partition versus scanning entire partition – Spark Core • Sort-based Shuffle
  15. 15. Resilient Distributed Dataset (RDD)
  16. 16. RDD Overview • Core Spark abstraction • Represents partitions across the cluster nodes • Enables parallel processing on data sets • Partitions can be in-memory or on-disk • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set
  17. 17. Types of RDDs • Spark Out-of-the-Box – HadoopRDD – FilteredRDD – MappedRDD – PairRDD – ShuffledRDD – UnionRDD – DoubleRDD – JdbcRDD – JsonRDD – SchemaRDD – VertexRDD – EdgeRDD • External – CassandraRDD (DataStax) – GeoRDD (Esri) – EsSpark (ElasticSearch)
  18. 18. RDD Partitions
  19. 19. RDD Lineage Examples
  20. 20. Demo! • Load user data • Load gender data • Join user data with gender data • Analyze lineage
  21. 21. Join Optimizations • When joining large dataset with small dataset (reference data) • Broadcast small dataset to each node/partition of large dataset (one broadcast per node)
  22. 22. Spark API
  23. 23. Spark API Overview • Richer, more expressive than MapReduce • Native support for Java, Scala, Python, SQL, and R (mostly) • Unified API across all libraries • Operations – Transformations (lazy evaluation) – Actions (execute transformations)
  24. 24. Transformations
  25. 25. Actions
  26. 26. Spark Execution Model
  27. 27. Job Scheduling • Job – Contains many stages • Contains many tasks • FIFO – Long-running jobs may starve resources for other jobs • Fair – Round-robin to prevent resource starvation • Fair Scheduler Pools – High-priority pool for important jobs – Separate user pools – FIFO within the pool – Modeled after Hadoop Fair Scheduler
  28. 28. Spark Execution Model Overview • Parallel, distributed • DAG-based • Lazy evaluation • Allows optimizations – Reduce disk I/O – Reduce shuffle I/O – Single pass through dataset – Parallel execution – Task pipelining • Data locality and rack awareness • Worker node fault tolerance using RDD lineage graphs per partition
  29. 29. Execution Optimizations
  30. 30. Task Pipelining
  31. 31. Spark Cluster Deployment
  32. 32. Types of Cluster Deployments • Spark Standalone (default) • YARN – Allows hierarchies of resources – Kerberos integration – Multiple workloads from disparate execution frameworks • Hive, Pig, Spark, MapReduce, Cascading, etc • Mesos – Coarse-grained • Single, long-running Mesos tasks runs Spark mini tasks – Fine-grained • New Mesos task for each Spark task • Higher overhead • Not good for long-running Spark jobs like Spark Streaming
  33. 33. Spark Standalone Deployment
  34. 34. Spark YARN Deployment
  35. 35. Master High Availability • Multiple Master Nodes • ZooKeeper maintains current Master • Existing applications and workers will be notified of new Master election • New applications and workers need to explicitly specify current Master • Alternatives (Not recommended) – Local filesystem – NFS Mount
  36. 36. Spark After Dark (sparkafterdark.com)
  37. 37. Spark After Dark ~ Tinder
  38. 38. Mutual Match!
  39. 39. Goal: Increase Matches (1/2) • Top Influencers – PageRank – Most desirable people • People You May Know – Shortest Path – Facebook-enabled • Recommendations – Alternating Least Squares (ALS)
  40. 40. Goal: Increase Matches (2/2) • Cluster on Interests, Height, Religion, etc – K-Means – Nearest Neighbor • Textual Analysis of Profile Text – TF/IDF – N-gram – LDA Topic Extraction
  41. 41. Demo!
  42. 42. Spark Streaming
  43. 43. Spark Streaming Overview • Low latency, high throughput, fault-tolerance (mostly) • Long-running Spark application • Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc. • Graceful shutdown, in-flight message draining • Uses Spark Core, DAG Execution Model, and Fault Tolerance • Submits micro-batch jobs to the cluster
  44. 44. Spark Streaming Overview
  45. 45. Spark Streaming Use Cases • ETL and enrichment of streaming data on injest • Operational dashboards • Lambda Architecture
  46. 46. Discretized Stream (DStream) • Core Spark Streaming abstraction • Micro-batches of RDDs • Operations similar to RDD – operates on underlying RDDs
  47. 47. Spark Streaming API
  48. 48. Spark Streaming API Overview • Rich, expressive API similar to core • Operations – Transformations (lazy) – Actions (execute transformations) • Window and State Operations • Requires checkpointing to snip long-running DStream lineage • Register DStream as a Spark SQL table for querying?! Wow.
  49. 49. DStream Transformations
  50. 50. DStream Actions
  51. 51. Window and State DStream Operations
  52. 52. DStream Window and State Example
  53. 53. Spark Streaming Cluster Deployment
  54. 54. Spark Streaming Cluster Deployment
  55. 55. Adding Receivers
  56. 56. Adding Workers
  57. 57. Spark Streaming + Kinesis
  58. 58. Spark Streaming + Kinesis Architecture
  59. 59. Throughput and Pricing
  60. 60. Demo! Kinesis Streaming https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  61. 61. Spark Streaming Fault Tolerance
  62. 62. Characteristics of Sources • Buffered – Flume, Kafka, Kinesis – Allows replay and back pressure • Batched – Flume, Kafka, Kinesis – Improves throughput at expense of duplication on failure • Checkpointed – Kafka, Kinesis – Allows replay from specific checkpoint
  63. 63. Message Delivery Guarantees • Exactly once [1] – No loss – No redeliver – Perfect delivery – Incurs higher latency for transactional semantics – Spark default per batch using DStream lineage – Degrades to less guarantees depending on source • At least once [1..n] – No loss – Possible redeliver • At most once [0,1] – Possible loss – No redeliver – *Best configuration if some data loss is acceptable • Ordered – Per partition: Kafka, Kinesis – Global across all partitions: Hard to scale
  64. 64. Types of Checkpoints Spark 1. Spark checkpointing of StreamingContext DStreams and metadata 2. Lineage of state and window DStream operations Kinesis 3. Kinesis Client Library (KCL) checkpoints current position within shard – Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
  65. 65. Fault Tolerance • Points of Failure – Receiver – Driver – Worker/Processor • Possible Solutions – Use HDFS File Source for durability – Data Replication – Secondary/Backup Nodes – Checkpoints • Stream, Window, and State info
  66. 66. Streaming Receiver Failure • Use a backup receiver • Use multiple receivers pulling from multiple shards – Use checkpoint-enabled, sharded streaming source (ie. Kafka and Kinesis) • Data is replicated to 2 nodes immediately upon ingestion – Will spill to disk if doesn’t fit in memory • Possible loss of most-recent batch • Possible at-least once delivery of batch • Use buffered sources for replay – Kafka and Kinesis
  67. 67. Streaming Driver Failure • Use a backup Driver – Use DStream metadata checkpoint info to recover • Single point of failure – interrupts stream processing • Streaming Driver is a long-running Spark application – Schedules long-running stream receivers • State and Window RDD checkpoints to HDFS to help avoid data loss (mostly)
  68. 68. Stream Worker/Processor Failure • No problem! • DStream RDD partitions will be recalculated from lineage • Causes blip in processing during node failover
  69. 69. Spark Streaming Monitoring and Tuning
  70. 70. Monitoring • Monitor driver, receiver, worker nodes, and streams • Alert upon failure or unusually high latency • Spark Web UI – Streaming tab • Ganglia, CloudWatch • StreamingListener callback
  71. 71. Spark Web UI
  72. 72. Tuning • Batch interval – High: reduce overhead of submitting new tasks for each batch – Low: keeps latencies low – Sweet spot: DStream job time (scheduling + processing) is steady and less than batch interval • Checkpoint interval – High: reduce load on checkpoint overhead – Low: reduce amount of data loss on failure – Recommendation: 5-10x sliding window interval • Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster • Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist • Use CMS GC for consistent processing times
  73. 73. Lambda Architecture
  74. 74. Lambda Architecture Overview • Batch Layer – Immutable, Batch read, Append-only write – Source of truth – ie. HDFS • Speed Layer – Mutable, Random read/write – Most complex – Recent data only – ie. Cassandra • Serving Layer – Immutable, Random read, Batch write – ie. ElephantDB
  75. 75. Spark + AWS + Lambda
  76. 76. Spark + Lambda + GraphX + MLlib
  77. 77. Demo! • Load JSON • Convert to Parquet file • Save Parquet file to disk • Query Parquet file directly
  78. 78. Approximations
  79. 79. Approximation Overview • Required for scaling • Speed up analysis of large datasets • Reduce size of working dataset • Data is messy • Collection of data is messy • Exact isn’t always necessary • “Approximate is the new Exact”
  80. 80. Some Approximation Methods • Approximate time-bound queries – BlinkDB • Bernouilli and Poisson Sampling – RDD: sample(), takeSample() • HyperLogLog PairRDD: countApproxDistinctByKey() • Count-min Sketch • Spark Streaming and Twitter Algebird • Bloom Filters – Everywhere!
  81. 81. Approximations In Action Figure: Memory Savings with Approximation Techniques (http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
  82. 82. Spark Statistics Library • Correlations – Dependence between 2 random variables – Pearson, Spearman • Hypothesis Testing – Measure of statistical significance – Chi-squared test • Stratified Sampling – Sample separately from different sub-populations – Bernoulli and Poisson sampling – With and without replacement • Random data generator – Uniform, standard normal, and Poisson distribution
  83. 83. Summary • Spark, Spark Streaming Overview • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations http://effectivespark.com http://sparkinaction.com Thanks!! Chris Fregly @cfregly

×