Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013


Published on

Providing a great media consumption experience to customers is crucial to maximizing audience engagement. To do that, it is important that you make content available for consumption anytime, anywhere, on any device, with a personalized and interactive experience. This session explores the power of big data log analytics (real-time and batched), using technologies like Spark, Shark, Kafka, Amazon Elastic MapReduce, Amazon Redshift and other AWS services. Such analytics are useful for content personalization, recommendations, personalized dynamic ad-insertions, interactivity, and streaming quality.
This session also includes a discussion from Netflix, which explores personalized content search and discovery with the power of metadata.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

  1. 1. Maximizing Audience Engagement in Media Delivery Usman Shakeel, Amazon Web Services Shobana Radhakrishnan, Engineering Manager at Netflix November 14th 2013 © 2013, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc.
  2. 2. Consumers Want … • To watch content that matters to them – From anywhere – For free • Content to be easily accessible – Nicely organized according to their taste – Instantly accessible from all different devices anywhere • Content to be of high quality – Without any “irrelevant” interruptions – Personalized ads • To multitask while watching content – Interactivity, second screens, social media, share
  3. 3. Ultimate Customer Experience • Ultimate content discovery • Ultimate content delivery
  4. 4. Content Disc overy
  5. 5. Content Choices Evolution
  6. 6. … And More +
  7. 7. Content Discovery Evolution
  8. 8. … And More Personalized Row Display Unified Search Similarity Algorithms
  9. 9. Content Delivery …
  10. 10. User Engagement in Online Video [Source: Conviva Viewer Experience Report – 2013]
  11. 11. Personalized Ads?
  12. 12. Mountains of raw data …
  13. 13. Data Sources • Content discovery – Meta data – Session logs • Content delivery – – – – Video logs Page-click event logs CDN logs Application logs • Computed along several high cardinality dimensions • Very large datasets for a specific time frame
  14. 14. Mountains of Raw Data … Some numbers from Netflix – Over 40 million customers – More than 50 countries and territories – Translates to hundreds of billions of events in a short period of time – Over 100 Billion Meta-data operations a day – Over 1 Billion viewing hours per month
  15. 15. … To Useful Information ASAP • Historical data – Batch Analysis • Live data – Real-time Analysis
  16. 16. Historic vs. Real-time Analytics 100% Dynamic • • • • Always computing on the fly Flexible but slow Scale is very hard Content Delivery 100% Pre Computed • • • • Superfast lookups Rigid Do not cover all the use cases Content Discovery
  17. 17. The Challenge Real-time Processing Dashboards/ User Personalization/ User Experience Ingest & Stream Mountains of Raw Data Storage/DWH Back-end Processing
  18. 18. Agenda • Ultimate Content Discovery – How Netflix creates personalized content and the power of Metadata • Ultimate Content Delivery – The toolset for real-time big data processing
  19. 19. Content Discovery Personalized Experience
  20. 20. Why Is Personalization Important? • Key streaming metrics – Median viewing hours – Net subscribers • Personalization consistently improves these • Over 75% of what people watch comes from recommendations
  21. 21. What Is Video Metadata? • • • • • • • Genre Cast Rating Contract information Streaming and deployment information Subtitles, dubbing, trailers, stills, actual content Hundreds of such attributes
  22. 22. Used For.. • User-specific choices, e.g., language, viewing behavior, taste preferences • Recommendation algorithms • Device-specific rendering and playback • CDN deployment • Billboards, trailer display, streaming • Original programming • Basically everything!
  23. 23. Our Solution Data snapshots to Amazon S3 (~10) Metadata publishing engine (one EC2 instance per country) generates Amazon S3 facets (~10 per country) Metadata cache reads Amazon S3 periodically, servers high-availability apps deployed on EC2 (~2000 m2.4xl and m2.2xl) Relevant Organized Data for Consumption Mountains of Video Metadata Batch Processing
  24. 24. Metadata Platform Architecture Various Metadata Generation and Entry Tools ( Put Snapshot Files – One per Source Amazon S3 Get Snapshot Files Publishing Engine (netflix.vms.blob.file.instancetype.region) Put Facets (10 per Country per Cycle) Amazon S3 Get Blobs (~7GB files, 10 Gets per Instance Refresh) ….. Playback Devices API Algorithms (EC2 Instances – m2.2xl or m2.4xl) Offline Metadata Processing
  25. 25. Data Entry and Encoding Tools Persistent Storage AmazonS3 Publishing Engine In-memory Cache Metadata entered Hourly data snapshots Check for snapshots Generate, write artifacts • Efficient resource utilization • Quick data propagation Periodic cache refresh • Low footprint • Quick startup/refresh Java API calls Apps
  26. 26. Initially.. ~2000 EC2 Instances
  27. 27. Target Application Scale • File size 2 GB–15 GB • ~10 per country (20 total) • ~2000 instances (m2.2x or m2.4xl) accessing these files once an hour via cache refresh from Amazon S3 • Availability goal : Streaming: 99.99%, sign-ups: 99.9% – 100% of metadata access in memory – Autoscaling to efficiently manage, startup time
  28. 28. And Then.. 6000+ EC2 Instances
  29. 29. Target Application Scale • File size 2 GB–15 GB • ~10 per country (20 500 total) • ~2000 6000 instances (m2.2x or m2.4xl) accessing via cache refresh from Amazon S3 • 100% of access in-memory to achieve high availability • Autoscaling to efficiently manage, startup time
  30. 30. Effects • Slower file writes • Longer publish time • Slower startup and cache refresh
  31. 31. Amazon S3 Tricks That Helped • Fewer writes – Region-based publishing engine instead of percountry – Blob images rather than facets – 10 Amazon S3 writes per cycle (down from 500) • Smaller file sizes – Deduping moved to prewrite processing – Compression: Zipped data snapshot files from source • Multipart writes
  32. 32. Results • Significant reduction in average memory footprint • Significant reduction in application startup times • Shorter publish times
  33. 33. What We Learned • In-memory cache (NetflixGraph) effective for high availability • Startup time important when using autoscaling • Use Amazon S3 best practices • Circuit breakers
  34. 34. Future Architecture • Dynamically controlled cache • Dynamic code inclusion • Parallelized publishing engine
  35. 35. Content Delivery Toolset for real-time big-data processing
  36. 36. The Challenge Real-time Processing Dashboards/ User Personalization/ User Experience Ingest & Stream Mountains of Raw Data Storage/DWH Backend Processing
  37. 37. Ingest and Stream Amazon Kinesis Kafka Amazon DynamoDB Amazon SQS Amazon S3
  38. 38. Amazon Kinesis Enabling Real-time ingestion & processing of streaming data Amazon Kinesis Enabled Application Amazon Kinesis User Data Sources User App.1 [Aggregate & DeDuplicate] User Data Sources GET PUT AWS VIP Amazon S3 DynamoDB User Data Sources User Data Sources User App.2 [Metric Extraction] User App.3 [Sliding Window] Control Plane Amazon Redshift
  39. 39. A quick intro to Amazon Kinesis Producer Producer Producers • • Producer Kinesis Cluster S 0 W1 W2 EC2 Instance S 1 S 2 W3 S 3 W4 S 4 W5 W6 EC2 Instance Generate a Stream of data Data records from producers are Put into a Stream using a developer supplied Partition Key which that are places records within a specific Shard Kinesis Cluster • A managed service captures and transports data Streams. • Multiple Shards. Each supports 1MB/sec • Developer controls number of shards – all shards stored for 24 hours. • HA & DD by 3-way replication (3X AZs) • Each data record has a Kinesis-assigned Sequence # Workers • Each Shard is processed by a Worker running on EC2 instances that developer owns and controls
  40. 40. Processing Amazon EMR Amazon Redshift
  41. 41. A Quick Intro to Storm • Similar to Hadoop cluster • Topolgy vs. dobs (Storm vs. Hadoop) – A topology runs forever (unless you kill it) storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2 • Streams – unbounded sequence of tuples – A stream of tweets into a stream of trending topics – Spout: Source of streams (e.g., connect to a log API and emit a stream of logs) – Bolt: Consumes any number of input streams, some processing, emit new streams (e.g. filters, unions, compute) *Source:
  42. 42. A Quick Intro to Storm Example: Get the count of ads that were clicked on and watched in a stream LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder (‘‘reach’’); //Create a topology builder.addBolt(new GetStream(), 3); //Get the stream that showed an ad. Transforms a stream of [id, ad] to [id, stream] builder.addBolt(new GetViewers(), 12).shuffleGrouping(); //Get the viewers for ads. Transforms [id, stream] to [id, viewer] builder.addBolt(new PartialUniquer(), 6).fieldsGrouping(new Fields(‘id’, ‘viewer’)); //Group the viewers stream by viewer id. Unique count of subset viewers builder.addBolt(new CountAggregator(), 2).fieldsGrouping(new Fields(‘id’)); //Compute the aggregates for unique viewers *Adopted from Source:
  43. 43. Putting it together … Amazon Kinesis User Data Source GET PUT User App.1 [Aggregate & DeDuplicate] AWS VIP User Data Source Amazon Kinesis Enabled Application User App.2 [Metric Extraction] User App.3 [Sliding Window] Control Plane
  44. 44. A Quick Intro to • Language-integrated interface in Scala • General purpose programming interface can be used for interactive data mining on clusters • Example (count buffer events from a streaming log) lines = spark.textFile("hdfs://...") //define a data structure errors = lines.filter(_.startsWith(‘BUFFER')) errors.persist() //persist in memory errors.count() errors.filter(_.contains(‘Stream1’)).count() errors.cache() //cache datasets in memory to speed up reuse *Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing
  45. 45. A Quick Intro to
  46. 46. Logistic Regression Performance 30 GB dataset 80 core cluster Up to 20x faster than Hadoop interactive jobs Scan 1TB dataset with 5 – 7 sec latnecy 127 s / iteration *Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing First iteration 174 s Further iterations 6 s
  47. 47. Conviva GeoReport Time (hours) • Aggregations on many keys w/ same WHERE clause • 40× gain comes from: – Not rereading unused columns or filtered records – Avoiding repeated decompression – In-memory storage of deserialized objects
  48. 48. Back-end Storage Amazon Redshift HDFS on Amazon EMR Amazon DynamoDB Amazon RDS Amazon S3
  49. 49. Batch Processing • EMR – Hive on EMR – Custom UDF (user-defined functions) needs for data warehouse Amazon EMR • Redshift – More traditional data warehousing workload Amazon Redshift
  50. 50. The Challenge Real-time Processing Dashboards/ User Personalization/ User Experience Ingest & Stream Mountains of Raw Data Storage/DWH Back-end Processing
  51. 51. The Solution Real-time Processing Audience Engagement Amazon EMR Amazon Redshift Storm Spark Ingest & Stream Amazon S3 Amazon SQS Amazon DynamoDB Kafka Amazon Kinesis Mountains of Raw Data Amazon Amazon Amazon Amazon S3 RDS DynamoDB RedShift Storage/DWH Amazon EMR Back-end Processing
  52. 52. What’s next … • On the fly adaptive bit rate in the future frame rate and resolution • Dynamic personalized ad Insertions • Session Analytics for more monetization oportunities • Social Media Chat
  53. 53. Pick up the remote Start watching Ultimate entertainment experience…
  54. 54. Please give us your feedback on this presentation MED303 As a thank you, we will select prize winners daily for completed surveys!