Streaming datasets for personalization

199 views

Published on

Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.In this session, we will present our experience on stream processing unbounded datasets in the personalization space. The datasets consisted of -- but were not limited to -- the stream of playback events that are used as feedback for all personalization algorithms. These datasets when ultimately consumed by our machine learning models, directly affect the customer’s personalized experience. We’ll talk about the experiments we did to compare Apache Spark and Apache Flink, and the challenges we faced.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
199
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Total number of processed events much higher ~ 1.5T because of duplicates and redundancy
    Thousands of shows in every country, ma
  • Post processing required
  • Streaming datasets for personalization

    1. 1. Shriya Arora Streaming datasets for Personalization
    2. 2. What is Netflix’s Mission? Entertainment by allowing you to stream content anywhere, anytime
    3. 3. What is Netflix’s Mission? Entertainment by allowing you to stream personalized content anywhere, anytime
    4. 4. How much data do we process to have a personalized Netflix for everyone? ● 125M hours/ day ● 86M active members ● 450B unique events/day ● 600+ kafka topics
    5. 5. Data Infrastructure Raw data (S3/hdfs) Stream Processing (Spark, Flink …) Processed data (Tables/Indexers) Batch processing (Spark/Pig/Hive/MR) Application instances Keystone Ingestion Pipeline
    6. 6. What do we solve with streaming that we can’t solve with batch ETL? ● Business Wins ○ Algorithms become more dynamic/responsive ○ Enables research by reducing time delay between event generation and consumption ○ Creates opportunity for new types of algorithms ● Technical Wins ○ Fewer moving parts means fewer places for error ○ Save on storage costs ○ Avoid long running jobs ■ Reduces processing resources ■ Shortens turnaround times
    7. 7. Picking a Stream Processing Engine? Things to consider: ● Problem Scope/Requirements ○ Event-based pure streaming or micro-batches? ○ Do you want to implement Lambda? ● Existing Internal Technologies ○ Streaming Infrastructure: What are other teams using? ○ ETL eco-system: What about teams that don’t consume out of Kafka? ● What’s your team’s learning curve? ○ Do you know Spark? ○ Do you know Scala?
    8. 8. Getting started with Spark Streaming Micro-batches ● Data received in DStreams, which are easily converted to RDDs ● Support all fundamental RDD operations like map/flatmap/reduce/reduceByKey ● Basic time-based windowing ● Checkpointing support for resilience to failures
    9. 9. Writing a basic Spark Streaming app
    10. 10. Performance tuning your Spark streaming application ● Choice of micro-batch interval ○ The most important parameter ● Cluster memory ○ Large batch intervals need more memory ● Parallelism ○ DStreams naturally partitioned to Kafka partitions ○ Repartition can help with increased parallelism at the cost of shuffle ● # of CPUs ○ <= number of tasks ○ Depends on how computationally intensive your processing is
    11. 11. Getting started with Flink
    12. 12. Performance tuning your Flink application (Yet to be productionised) ● Persistent data storage for checkpointing ○ Fault-tolerant, highly-available system ○ Support high-throughput for frequent state updates ● Parallelism ○ Optimized for # of Kafka Partitions ○ Optimal number of slots/ CPU ● Size of cluster ○ Function of your incoming stream ○ What is your bottleneck? Network/ Memory/ Computation ● Code Optimization ○ Build an optimal DAG with least network shuffle
    13. 13. Challenges with Spark ● Not a ‘pure’ event streaming system ○ Minimum latency of batch interval ○ Un-intuitive for stream design ● Choice of batch interval is a little too critical ○ Everything can go wrong, if you choose this wrong ○ Build-up of scheduling delay can lead to data loss ● Only time-based windowing ○ Cannot be used to solve session-stitching use cases, or trigger based event aggregations*
    14. 14. Challenges with Flink ● Non trivial to bring up a basic app, newer concepts to adjust to ○ Complex (though powerful) concepts like Watermarking, checkpointing, custom windows ● Insufficient monitoring and debugging tools ● Documentation basic, online community support not as proliferated
    15. 15. Challenges with Streaming ● Pioneer Tax: batch.getInfrastructure >= streaming.getInfrastructure ○ Analytics has historically always been batch, instinctively easier to formulate analytical problems in batch frameworks like MR, Pig, Hive etc. ○ Deployments are non-trivial ● Moving towards unbounded === moving towards “On Call” ○ Batch failures have to be addressed urgently, Streaming failures have to be addressed immediately. ● Streaming outages more critical than batch outages ○ In batch it’s easy/cheap to recover from outages (as long as the data isn’t lost). ○ In streaming, data recovery (beyond the fault-tolerant limits of the system) can be exhaustive
    16. 16. Questions?

    ×