Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Your Name, Your Organization
Title of Your Presentation
Goes Here
#UnifiedDataAnalytics #SparkAISummit
Itai Yaffe, Nielsen...
Introduction
Itai Yaffe
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark? Planning...
Agenda
Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
Data flow - past and present
Spark Streaming
○ ”Sta...
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models f...
Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campa...
Nielsen Marketing Cloud - high-level architecture
Data flow in the old days ...
In-DB aggregation
OLAP
Data flow in the old days… What’s wrong with that?
● CSV-related issues, e.g:
○ Truncated lines in input files
○ Can’t enf...
That's one small step for [a] man… (2014)
“Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune...
Why just a small step?
● Solved the scaling issues
● Still faced the CSV-related issues
Data flow - the modern way
+
Photography Copyright: NBC
Spark Streaming - “stateless” app use-case (2015)
Read Messages
In-DB aggregation
OLAP
The need for stateful streaming
Fast forward a few months...
● New requirements were being raised
● Specific use-case :
○ ...
Stateful streaming via “local” aggregations
1.
Read Messages
5.
Upsert aggregated data
(every X micro-batches)
2.
Aggregat...
Stateful streaming via “local” aggregations
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cl...
Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
● Enables running continuous, incremental ...
Structured Streaming - stateful app use-case
2.
Aggregate current window
3.
Checkpoint (state and offsets) handled interna...
Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.or...
Structured Streaming - strengths and weaknesses (IMO)
● Strengths include:
○ Running incremental, continuous processing
○ ...
Back to the future - Spark Streaming revived for “stateful” app use-case
1.
Read Messages
3.
Write Files
2.
Aggregate curr...
Cool, so… Why can’t we stop here?
● Significantly underutilized cluster resources = wasted $$$
Cool, so… Why can’t we stop here? (cont.)
● Extreme load of Kafka brokers’ disks
○ Each micro-batch needs to read ~300M me...
Introducing RDR
RDR (or Raw Data Repository) is our Data Lake
● Kafka topic messages are stored on S3 in Parquet format
pa...
Potentially yes ...
S3 RDR
2.
Process files
1.
Read RDR files
from last day
date=2019-10-14
date=2019-10-15
date=2019-10-16
... but
● This ignores late arriving events
Enter “streaming” over RDR
+ +
How do we “stream” RDR files - producer side
S3 RDRRDR Loaders
2.
Write files
1.
Read Messages
3.
Write files’ paths
Topic...
How do we “stream” RDR files - consumer side
S3 RDR
3.
Process files
1.
Read files’ paths
2.
Read RDR files
How do we “stream” RDR files – producer & consumers
S3 RDR
2.
Write files1.
Read Messages
.3
Write files’ paths
RDR Loader...
How do we use the new RDR “streaming” infrastructure?
1.
Read files’ paths
3.
Write files
2.
Read RDR files
OLAP
4.
Load D...
Did we solve the aforementioned problems?
● EMR clusters are now transient - no more idle clusters
Day 1 Day 2 Day 3
80% R...
Did we solve the aforementioned problems? (cont.)
● No more extreme load of Kafka brokers’ disks
○ We still read old messa...
Summary
● Initially replaced standalone Java with Spark & Scala
○ Still faced CSV-related issues
● Introduced Spark Stream...
Summary (cont.)
● Went back to Spark Streaming (with Druid as OLAP)
○ Performance penalty in Kafka for long micro-batches
...
DRUID ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims:
■ To inspire, connect, grow, and champio...
QUESTIONS
Itai Yaffe
THANK YOU
Itai Yaffe
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Structured Streaming -
additional slides
Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#bas...
Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#bas...
Structured Streaming - WordCount example
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#...
Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes:
○ Append ...
Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means:
○ Trackable sources (i.e offsets)
○ Checkpoint...
Monitoring
● Interactive APIs :
○ streamingQuery.lastProgress()/status()
○ Output example
● Asynchronous API :
○ val spark...
Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous architecture Old flow Ne...
Upcoming SlideShare
Loading in …5
×

of

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 1 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 2 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 3 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 4 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 5 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 6 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 7 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 8 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 9 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 10 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 11 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 12 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 13 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 14 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 15 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 16 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 17 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 18 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 19 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 20 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 21 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 22 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 23 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 24 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 25 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 26 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 27 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 28 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 29 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 30 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 31 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 32 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 33 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 34 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 35 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 36 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 37 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 38 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 39 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 40 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 41 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 42 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 43 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 44 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 45 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 46 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 47 Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka Slide 48
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka

Download to read offline

At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.

In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:

Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka

  • Be the first to like this

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Your Name, Your Organization Title of Your Presentation Goes Here #UnifiedDataAnalytics #SparkAISummit Itai Yaffe, Nielsen Stream, Stream, Stream Different Streaming methods with Spark and Kafka #UnifiedDataAnalytics #SparkAISummit
  3. 3. Introduction Itai Yaffe ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012
  4. 4. Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Working with Spark? Planning to? ● Working Kafka? Planning to?
  5. 5. Agenda Nielsen Marketing Cloud (NMC) ○ About ○ High-level architecture Data flow - past and present Spark Streaming ○ ”Stateless” and ”stateful” use-cases Spark Structured Streaming ”Streaming” over our Data Lake
  6. 6. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  7. 7. Nielsen Marketing Cloud - questions we try to answer 1. How many unique users of a certain profile can we reach? E.g campaign for young women who love tech 2. How many impressions a campaign received?
  8. 8. Nielsen Marketing Cloud - high-level architecture
  9. 9. Data flow in the old days ... In-DB aggregation OLAP
  10. 10. Data flow in the old days… What’s wrong with that? ● CSV-related issues, e.g: ○ Truncated lines in input files ○ Can’t enforce schema ● Scale-related issues, e.g: ○ Had to “manually” scale the processes
  11. 11. That's one small step for [a] man… (2014) “Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015) In-DB aggregation OLAP
  12. 12. Why just a small step? ● Solved the scaling issues ● Still faced the CSV-related issues
  13. 13. Data flow - the modern way + Photography Copyright: NBC
  14. 14. Spark Streaming - “stateless” app use-case (2015) Read Messages In-DB aggregation OLAP
  15. 15. The need for stateful streaming Fast forward a few months... ● New requirements were being raised ● Specific use-case : ○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of the aggregative operations to our Spark Streaming app
  16. 16. Stateful streaming via “local” aggregations 1. Read Messages 5. Upsert aggregated data (every X micro-batches) 2. Aggregate current micro-batch 3. Write combined aggregated data 4. Read aggregated data of previous micro-batches from HDFS OLAP
  17. 17. Stateful streaming via “local” aggregations ● Required us to manage the state on our own ● Error-prone ○ E.g what if my cluster is terminated and data on HDFS is lost? ● Complicates the code ○ Mixed input sources for the same app (Kafka + files) ● Possible performance impact ○ Might cause the Kafka consumer to lag
  18. 18. Structured Streaming - to the rescue? Spark 2.0 introduced Structured Streaming ● Enables running continuous, incremental processes ○ Basically manages the state for you ● Built on Spark SQL ○ DataFrame/Dataset API ○ Catalyst Optimizer ● Many other features ● Was in ALPHA mode in 2.0 and 2.1 Structured Streaming
  19. 19. Structured Streaming - stateful app use-case 2. Aggregate current window 3. Checkpoint (state and offsets) handled internally by Spark 1. Read Messages 4. Upsert aggregated data (on window end) Structured streaming OLAP
  20. 20. Structured Streaming - known issues & tips ● 3 major issues we had in 2.1.0 (solved in 2.1.1) : ○ https://issues.apache.org/jira/browse/SPARK-19517 ○ https://issues.apache.org/jira/browse/SPARK-19677 ○ https://issues.apache.org/jira/browse/SPARK-19407 ● Checkpointing to S3 wasn’t straight-forward ○ Tried using EMRFS consistent view ■ Worked for stateless apps ■ Encountered sporadic issues for stateful apps
  21. 21. Structured Streaming - strengths and weaknesses (IMO) ● Strengths include: ○ Running incremental, continuous processing ○ Increased performance (e.g via Catalyst SQL optimizer) ○ Massive efforts are invested in it ● Weaknesses were mostly related to maturity
  22. 22. Back to the future - Spark Streaming revived for “stateful” app use-case 1. Read Messages 3. Write Files 2. Aggregate current micro-batch 4. Load Data OLAP
  23. 23. Cool, so… Why can’t we stop here? ● Significantly underutilized cluster resources = wasted $$$
  24. 24. Cool, so… Why can’t we stop here? (cont.) ● Extreme load of Kafka brokers’ disks ○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory ● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration ○ Forced us to use 1 core per executor to avoid it ○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well) ● We wish we could run it even less frequently ○ Remember - longer micro-batches result in a better aggregation ratio
  25. 25. Introducing RDR RDR (or Raw Data Repository) is our Data Lake ● Kafka topic messages are stored on S3 in Parquet format partitioned by date (date=2019-10-17) ● RDR Loaders - stateless Spark Streaming applications ● Applications can read data from RDR for various use-cases ○ E.g analyzing data of the last 1 day or 30 days Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
  26. 26. Potentially yes ... S3 RDR 2. Process files 1. Read RDR files from last day date=2019-10-14 date=2019-10-15 date=2019-10-16
  27. 27. ... but ● This ignores late arriving events
  28. 28. Enter “streaming” over RDR + +
  29. 29. How do we “stream” RDR files - producer side S3 RDRRDR Loaders 2. Write files 1. Read Messages 3. Write files’ paths Topics with files’ paths as messages
  30. 30. How do we “stream” RDR files - consumer side S3 RDR 3. Process files 1. Read files’ paths 2. Read RDR files
  31. 31. How do we “stream” RDR files – producer & consumers S3 RDR 2. Write files1. Read Messages .3 Write files’ paths RDR Loader Topic with raw data Topic with files’ paths 4. Read files’ paths 5. Read RDR files 6. Process files
  32. 32. How do we use the new RDR “streaming” infrastructure? 1. Read files’ paths 3. Write files 2. Read RDR files OLAP 4. Load Data .3 Aggregate current batch
  33. 33. Did we solve the aforementioned problems? ● EMR clusters are now transient - no more idle clusters Day 1 Day 2 Day 3 80% REDUCTION
  34. 34. Did we solve the aforementioned problems? (cont.) ● No more extreme load of Kafka brokers’ disks ○ We still read old messages from Kafka, but now we only read about 1K messages per hour (rather than ~300M) ● The new infra doesn’t depend on the integration of Spark Streaming with Kafka ○ No more weird exceptions ... ● We can run the Spark batch applications as (in)frequent as we’d like ● Built-in handling of late arriving events
  35. 35. Summary ● Initially replaced standalone Java with Spark & Scala ○ Still faced CSV-related issues ● Introduced Spark Streaming & Kafka for “stateless” use-cases ○ Quickly needed to handle stateful use-cases as well ● Tried Spark Streaming for stateful use-cases (via “local” aggregations) ○ Required us to manage the state on our own ● Moved to Structured Streaming (for all use-cases) ○ Cons were mostly related to maturity
  36. 36. Summary (cont.) ● Went back to Spark Streaming (with Druid as OLAP) ○ Performance penalty in Kafka for long micro-batches ○ Under-utilized Spark clusters ○ Etc . ● Introduced “streaming” over our Data Lake ○ Eliminated Kafka performance penalty ○ Spark clusters are much better utilized = $$$ saved ○ And more ...
  37. 37. DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims: ■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field ■ To grow women representation in Big Data field > 25% by 2020 ○ Over 20 chapters and 14,000+ members world-wide ○ Everyone can join (regardless of gender), so find a chapter near you - https://www.womeninbigdata.org/wibd-structure/ ● Counting Unique Users in Real-Time: Here's a Challenge for You! ○ Big Data LDN, November 13th 2019, https://tinyurl.com/y5ffvlqk ● NMC Tech Blog - https://medium.com/nmc-techblog
  38. 38. QUESTIONS
  39. 39. Itai Yaffe THANK YOU Itai Yaffe
  40. 40. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  41. 41. Structured Streaming - additional slides
  42. 42. Structured Streaming - basic concepts https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts Data stream Unbounded Table New data in the data streamer = New rows appended to a unbounded table Data stream as an unbonded table
  43. 43. Structured Streaming - basic concepts https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
  44. 44. Structured Streaming - WordCount example https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
  45. 45. Structured Streaming - basic terms ● Input sources : ○ File ○ Kafka ○ Socket, Rate (for testing) ● Output modes: ○ Append (default) ○ Complete ○ Update (added in Spark 2.1.1) ○ Different types of queries support different output modes ■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table ● Output sinks: ○ File ○ Kafka (added in Spark 2.2.0) ○ Foreach ○ Console ,Memory (for debugging) ○ Different types of sinks support different output modes
  46. 46. Fault tolerance ● The goal - end-to-end exactly-once semantics ● The means: ○ Trackable sources (i.e offsets) ○ Checkpointing ○ Idempotent sinks aggDF .writeStream .outputMode("complete") .option("checkpointLocation", "path/to/HDFS/dir") .format("memory") .start()
  47. 47. Monitoring ● Interactive APIs : ○ streamingQuery.lastProgress()/status() ○ Output example ● Asynchronous API : ○ val spark: SparkSession = ... spark.streams.addListener(new StreamingQueryListener() { override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = { println("Query started: " + queryStarted.id) } override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = { println("Query terminated: " + queryTerminated.id) } override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = { println("Query made progress: " + queryProgress.progress) } })
  48. 48. Structured Streaming in production So we started moving to Structured Streaming Use case Previous architecture Old flow New architecture New flow Existing Spark app Periodic Spark batch job Read Parquet from S3 - > Transform -> Write Parquet to S3 Stateless Structured Streaming Read from Kafka -> Transform -> Write Parquet to S3 Existing Java app Periodic standalone Java process (“manual” scaling) Read CSV -> Transform and aggregate -> Write to RDBMS Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS New app N/A N/A Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS

At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner. In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include: Kafka and Spark Streaming for stateless and stateful use-cases Spark Structured Streaming as a possible alternative Combining Spark Streaming with batch ETLs ”Streaming” over Data Lake using Kafka

Views

Total views

505

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

28

Shares

0

Comments

0

Likes

0

×