Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher Burdorf, NBC Universal)
Next
Download to read offline and view in fullscreen.

5

Share

Download to read offline

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora

Download to read offline

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles.

Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company.

In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced.

Take-aways for the audience:
1) A great example of stream processing large, personalization datasets at scale.
2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully.
3) Exposure to some of the technical challenges that should be expected along the way.

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora

  1. 1. Shriya Arora SeniorData Engineer Personalization Analytics Streaming datasets for Personalization
  2. 2. What is Netflix’s Mission? Entertaining you by allowing you to stream content anywhere, anytime
  3. 3. What is Netflix’s Mission? Entertaining you by allowing you to stream personalized content anywhere, anytime
  4. 4. How much data do we process to have a personalized Netflix for everyone? ● 93M+ active members ● 125M hours/ day ● 190 countries with unique catalogs ● 450B unique events/day ● 600+ Kafka topics Image credit:http://www.bigwisdom.net/
  5. 5. Data Infrastructure Rawdata (S3/hdfs) Stream Processing (Spark, Flink …) Processed data (Tables/Indexers) Batch processing (Spark/Pig/Hive/MR) Application instances Keystone Ingestion Pipeline Rohan’stalk tomorrow @12:20
  6. 6. User watches a video on Netflix Data flows through Netflix Servers Our problem: Using user plays for feature generation, discovery, clustering ..
  7. 7. Why have data later when you can have it now? ● Business Wins ○ Algorithms can be trained with the latest + greatest data ○ Enhances research ○ Creates opportunity for new types of algorithms ● Technical Wins ○ Save on storage costs ○ Avoid long running jobs
  8. 8. Source of Discovery / Source of Play
  9. 9. Source-of-Discovery pipeline Playback sessions Spark- Streaming app Discovery service REST API Video Metadata Backup
  10. 10. Spark Streaming ● Needs a StreamingContext and a batch duration ● Data received in DStreams, which are easily converted to RDDs ● Support all fundamental RDD transformations and operations ● Time-based windowing ● Checkpointing support for resilience to failures ● Deployment
  11. 11. Performance tuning your Spark streaming application
  12. 12. Performance tuning your Spark streaming application ● Choice of micro-batch interval ○ The most important parameter ● Cluster memory ○ Large batch intervals need more memory ● Parallelism ○ DStreams naturally partitioned to Kafka partitions ○ Repartition can help with increased parallelism at the cost of shuffle ● # of CPUs ○ <= number of tasks ○ Depends on howcomputationally intensive your processing is
  13. 13. Performance tuning your Spark streaming application
  14. 14. Challenges with Spark ● Not a ‘pure’ event streaming system ○ Minimum latency of batchinterval ○ Un-intuitive to design for a stream-only world ● Choice of batch interval is a little too critical ○ Everythingcan go wrong, if you choose this wrong ○ Build-up of schedulingdelaycan leadto data loss ● Only time-based windowing* ○ Cannot be used to solve session-stitching use cases, or trigger based event aggregations * I used 1.6.1
  15. 15. Challenges with Streaming ● Pioneer Tax ○ Training ML models on streaming data is new ground ● Increased criticality of outages ○ Batch failures have to be addressed urgently, Streaming failureshave to be addressed immediately. ● Infrastructure investment ○ Monitoring, Alerts, ○ Non-trivial deployments There are two kinds of pain...
  16. 16. Questions? Stay in touch! @NetflixData
  • darooem

    Oct. 9, 2018
  • wcmj1023

    Sep. 17, 2017
  • edyeo52

    Mar. 21, 2017
  • vkm1971

    Feb. 20, 2017
  • StreamingAnalytics

    Feb. 18, 2017

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Views

Total views

1,656

On Slideshare

0

From embeds

0

Number of embeds

95

Actions

Downloads

82

Shares

0

Comments

0

Likes

5

×