Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream, stream, stream: Different streaming methods with Spark and Kafka


Published on

Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* “Streaming” over Data Lake using Kafka

Published in: Data & Analytics
  • Be the first to comment

Stream, stream, stream: Different streaming methods with Spark and Kafka

  1. 1. Stream, Stream, Stream: Different Streaming methods with Spark and Kafka Itai Yaffe + Ron Tevel Nielsen
  2. 2. Introduction Ron Tevel Itai Yaffe ● Big Data developer ● Developing Big Data infrastructure solutions ● Big Data Tech Lead ● Dealing with Big Data challenges since 2012
  3. 3. Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Working with Spark? Planning to? ● Working with Kafka? Planning to? ● Cloud deployments? On-prem?
  4. 4. Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ High-level architecture ● Data flow - past and present ● Spark Streaming ○ “Stateless” and “stateful” use-cases ● Spark Structured Streaming ● “Streaming” over our Data Lake
  5. 5. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  6. 6. Nielsen Marketing Cloud - questions we try to answer ● How many users of a certain profile can we reach Campaign for fancy women sneakers - ● How many hits for a specific web page in a date range
  7. 7. NMC high-level architecture
  8. 8. NMC high-level architecture
  9. 9. NMC high-level architecture
  10. 10. Data flow in the old days... Events are flowing from our Serving system, need to ETL the data into our data stores (DB, DWH, etc.) ● In the past, events were written to CSV files ○ Some fields had double quotes, e.g : 2014-07-17,12:55:38,204,400,US|FL|daytona beach|32114,cdde7b60a3117cc4c539b10faad665a9," 2F%3Fp%3D204%26g%3D400%26buid%3D6989098507373987292%26j%3D0","http%3A%2F%2Fwww rentals%2Fflorida%2Forlando.html",2,2,0,"1619691,9995","","","1",,"Windows 7","Chrome" ● Processing with standalone Java process ● Had many problems with this architecture ○ Truncated lines in input files ○ Can’t enforce schema ○ Had to “manually” scale the processes
  11. 11. That's one small step for [a] man... Moved to Spark (and Scala) in 2014 ● Spark ○ An engine for large-scale data processing ○ Distributed, scalable ○ Unified framework for batch, streaming, machine learning, etc ○ Was gaining a lot of popularity in the Big Data community ○ Built on RDDs (Resilient distributed dataset) ■ A fault-tolerant collection of elements that can be operated on in parallel ● Scala ○ Combines object-oriented and functional programming ○ First-class citizen is Spark ● Converted the standalone Java processes to Spark batch jobs ○ Solved the scaling issues ○ Still faced the CSV-related issues
  12. 12. Data flow - the modern way Introducing Kafka ● Open-source stream-processing platform ○ Highly scalable ○ Publish/Subscribe (A.K.A pub/sub) ○ Schema enforcement ○ Much more ● Originally developed by LinkedIn ● Graduated from Apache Incubator on late 2012 ● Quickly became the de facto standard in the industry ● Today commercial development is led by Confluent
  13. 13. Data flow - the modern way (cont.) … Along with Spark Streaming ● A natural evolvement of our Spark batch job (unified framework - remember?) ● Introduced the DStream concept ○ Continuous stream of data ○ Represented by a continuous series of RDDs ● Works in micro-batches ○ Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
  14. 14. Spark Streaming - “stateless” app use-case We started with Spark Streaming over Kafka (in 2015) ● Our Streaming apps were “stateless”, i.e : ○ Reading a batch of messages from Kafka ○ Performing simple transformations on each message (no aggregations) ○ Writing each batch to a persistent storage (S3) ● Stateful operations (aggregations) were performed in batch on files, either by ○ Spark jobs ○ ETLs in our DB/DWH
  15. 15. Spark Streaming - “stateless” app use-case
  16. 16. The need for stateful streaming Fast forward a few months... ● New requirements were being raised ● Specific use-case : ○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of the aggregative operations to our Spark streaming app
  17. 17. Stateful streaming via “local” aggregations ● The way to achieve it was : ○ Read messages from Kafka ○ Aggregate the messages of the current micro-batch ○ Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS) ○ Write the results back to HDFS ○ Every X batches : ■ Update the DB with the aggregated data (some sort of UPSERT) ■ Delete the aggregated files from HDFS ● UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL) ○ For example, given t1 with columns a (the key) and b (starting from an empty table) ■ INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2 ■ INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
  18. 18. Stateful streaming via “local” aggregations
  19. 19. Stateful streaming via “local”aggregations - cons ● Required us to manage the state on our own ● Error-prone ○ E.g what if my cluster is terminated and data on HDFS is lost? ● Complicates the code ○ Mixed input sources for the same app (Kafka + files) ● Possible performance impact ○ Might cause the Kafka consumer to lag ● Obviously not the perfect way (but that’s what we had…)
  20. 20. Structured Streaming - to the rescue? Spark 2.0 introduced Structured Streaming ● Enables running continuous, incremental processes ○ Basically manages the state for you ● Built on Spark SQL ○ DataFrame/Dataset API ○ Catalyst Optimizer ● Allows handling event-time and late data ● Ensures end-to-end exactly-once fault-tolerance ● Was in ALPHA mode in 2.0 and 2.1
  21. 21. Structured Streaming - basic concepts
  22. 22. Structured Streaming - WordCount example
  23. 23. Structured Streaming - stateful app use-case
  24. 24. Structured Streaming in production So we started moving to Structured Streaming Use case Previous architecture Old flow New architecture New flow Existing Spark app Periodic Spark batch job Read Parquet from S3 -> Transform -> Write Parquet to S3 Stateless Structured Streaming Read from Kafka -> Transform -> Write Parquet to S3 Existing Java app Periodic standalone Java process (“manual” scaling) Read CSV -> Transform and aggregate -> Write to RDBMS Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS New app N/A N/A Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS
  25. 25. Structured Streaming - known issues & tips ● 3 major issues we had in 2.1.0 (solved in 2.1.1) : ○ ○ ○ ● Using EMRFS consistent view when checkpointing to S3 ○ Recommended for stateless apps ○ For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e DynamoDB)
  26. 26. Structured Streaming - strengths and weaknesses (IMO) ● Strengths : ○ Running incremental, continuous processing ○ End-to-end exactly-once fault-tolerance (if you implement it correctly) ○ Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation) ○ Massive efforts are invested in it ● Weaknesses : ○ Maturity ○ Inability to perform multiple actions on the exact same Dataset (by-design?) ■ Seems to be resolved by (in the upcoming Spark 2.4, but then you get at-least once)
  27. 27. Back to the future - DStream revived for “stateful” app use-case
  28. 28. So what’s wrong with our DStream to Druid application?
  29. 29. So what’s wrong with our DStream to Druid application? ● Kafka needs to read 300M messages from disk. ● ConcurrentModificationException when working with Spark Streaming on Kafka 0.10 ○ Forced us to use 1 core per executor to avoid it ○ supposedly solved in 2.4.0 (possibly solving as well) ● We wish we could run it less frequently.
  30. 30. Enter “streaming” over RDR
  31. 31. What is RDR? RDR is Raw Data Repository and it’s our Data Lake ● Kafka topic messages saved to S3 in Parquet. ● RDR Loaders - Spark streaming applications. ● Applications can read from RDR and do analytics on the data. Can we leverage our Data Lake as the data source instead of Kafka?
  32. 32. The Idea of How to Stream RDR files
  33. 33. How do we “stream” RDR Files
  34. 34. How do we use the new RDR “streaming” infrastructure?
  35. 35. “A platform to programmatically author, schedule and monitor workflows” Developed by Airbnb and is a part of the Apache Incubator The Scheduler
  36. 36. Did we solve the problems? No longer streaming application - no longer idle cluster. Name Day 1 Day 2 Day 3 Old App to Druid 1007.68$ 1007.68$ 1007.68$ New App to Druid 150.08$ 198.73$ 174.68$
  37. 37. Did we solve the problems? ● Still reads old messages from Kafka disk but unlike reading 300M messages we just read 1K messages per hour. ● Doesn’t depend on the integration of spark streaming with Kafka - no more weird Kafka exceptions. ● We can run the Spark batch application as (in)frequent as we’d like.
  38. 38. Summary ● Started with Spark Streaming for “stateless” use-cases ○ Replaced CSV files with Kafka (de facto standard in the industry) ○ Already had Spark batch in production (Spark as a unified framework) ● Tried Spark Streaming for stateleful use-cases (via “local” aggregations) ○ Not the optimal solution ● Moved to Structured Streaming (for all use-cases) ○ Pros include : ■ Enables running continuous, incremental processes ■ Built on Spark SQL ○ Cons include : ■ Maturity ■ Inability to perform multiple actions on the exact same Dataset (by-design?)
  39. 39. Summary - cont. ● Moved (back) to Spark Streaming ○ Aggregations are done per micro-batch (in Spark) and daily (in Druid) ○ Still not perfect ■ Performance penalty in Kafka for long micro-batches ■ Concurrency issue with Kafka 0.10 consumer in Spark ■ Under-utilized Spark clusters ● Introduced “streaming” over our Data Lake ○ Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka ○ Spark batch apps read S3 paths from Kafka (and the actual files from S3) ■ Airflow for scheduling and monitoring ■ Meant for apps that don’t require real time ○ Pros : ■ Eliminated the performance penalty we had in Kafka ■ Spark clusters are much better utilized
  40. 40. QUESTIONS? Join us - Big Data Group Leader Big Data Team Leader And more...
  41. 41. THANK YOU!
  42. 42. Structured Streaming - additional slides
  43. 43. Structured Streaming - basic concepts
  44. 44. Structured Streaming - basic terms ● Input sources : ○ File ○ Kafka ○ Socket, Rate (for testing) ● Output modes : ○ Append (default) ○ Complete ○ Update (added in Spark 2.1.1) ○ Different types of queries support different output modes ■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table ● Output sinks : ○ File ○ Kafka (added in Spark 2.2.0) ○ Foreach ○ Console, Memory (for debugging) ○ Different types of sinks support different output modes
  45. 45. Handling event-time and late data
  46. 46. Handling event-time and late data
  47. 47. Handling event-time and late data
  48. 48. Handling event-time and late data // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count() // Group the data by window and word and compute the count of each group val windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word") .count()
  49. 49. Fault tolerance ● The goal - end-to-end exactly-once semantics ● The means : ○ Trackable sources (i.e offsets) ○ Checkpointing ○ Idempotent sinks aggDF .writeStream .outputMode("complete") .option("checkpointLocation", "path/to/HDFS/dir") .format("memory") .start()
  50. 50. Monitoring ● Interactive APIs : ○ streamingQuery.lastProgress()/status() ○ Output example ● Asynchronous API : ○ val spark: SparkSession = ... spark.streams.addListener(new StreamingQueryListener() { override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = { println("Query started: " + } override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = { println("Query terminated: " + } override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = { println("Query made progress: " + queryProgress.progress) } })