Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization"


Published on

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered and written out in real-time with an ease matching that of batch processing. However the real challenge of matching the prowess of batch ETL remains in doing joins, in maintaining state and to have the data be paused or rested dynamically. Netflix has a microservices architecture. Different microservices serve and record different kind of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when we want to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations. Historically we have done this joining of large volume data-sets in batch. However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real time? Why wait a full day to get information from an event that was generated a few mins ago? In this talk, we will share how we solved a complex join of two high-volume event streams using Flink. We will talk about maintaining large state, fault tolerance of a stateful application and strategies for failure recovery.

Published in: Technology
  • Excelente los enlaces. he visto varios videos.
    Are you sure you want to  Yes  No
    Your message goes here

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization"

  1. 1. Shriya Arora Data Engineering & Infrastructure Taming large state for Personalization
  2. 2. What is this talk about? ● Deriving signal from high volume real-time events ● Using Flink State management to achieve real-time join ● Operations and path to production ● Challenges and Learnings
  3. 3. What are Merched Impressions?
  4. 4. What is a Play-Impression Takerate? ● Number of merched impressions per user play ● Attributes of impressions leading to the play ● Attributes of the play coming from different impressions
  5. 5. What do we use it for? ● Ranking Videos ● Targeting and Reach ● Content Promotion ● Asset Personalization
  6. 6. Volume and Scale: ● 130M members ● ~10B Impressions ● ~2.5B Play Events ● 140M Play hours/day
  7. 7. Why do we need a streaming solution for take-rate ● Model Training on fresher data ○ Reduce time delay between event generation and signal ○ Faster feedback around launches ○ Events relevance temporal in nature ● Long turnaround time on error correction ○ Long running batch jobs have all-or-none failure modes ○ Lack of Real-time auditing delays error-detection
  8. 8. What are the challenges we will need to solve ? ● High-volume input streams ● Out-of-order and late-arriving events ● Large State ○ ~1TB State/ region
  9. 9. Approaches: #1 Window Joins ○ Events are delayed independent of each other #2 Aggregation over Windows followed by Join ○ Stream can be reduced as they are held in state
  10. 10. Approaches: #3 CoProcess Function with Single MapState ○ High variance in stream volumes and logic #4 CoProcess Function with two Value states ○ Each stream gets its own value state
  11. 11. A tale of two states ● CoProcess Function ● Save each Keyed stream into its own ValueState ● For each event in stream, reduce state on duplicates ● For each event in either stream, cross query across states ● Use timerService to expire events from State
  12. 12. Data Flow Architecture Play stream Impressions stream State 1 State 2 F(x) + Ts F(y) + Ts Co-process Fn Output keyBy
  13. 13. Anatomy of CoProcess Function def processElement1{value: T, ctx:Context ..} Access elements of the first stream, update and reduce state, lookup state 2 for out-of-order joins, apply timer def processElement2{value: K, ctx:Context ..} Access elements of the second stream, lookup and join to state 1, apply timer def onTimer{ts: Long ...} Clear up state based on event time ts.
  14. 14. State management
  15. 15. A tale of two states
  16. 16. Challenges with Operations Visibility into application event time progression ○ Flink UI bug: FLINK-8949
  17. 17. Challenges with Operations cont.. ● Visibility into State size ○ RocksDB Statistics have to be logged manually
  18. 18. Future Work ● State migration ● Data restatement and recovery
  19. 19. Questions? Follow us! @netflixdata @shriyarora