Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark


Published on

Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Josh McNutt and Keria Bermúdez-Hernández Showtime Networks Inc. Data-Driven Transformation: Leveraging Big Data at SHOWTIME with Apache Spark #UnifiedAnalytics #SparkAISummit
  3. 3. Premium cable network known for bold and wholly unique programming In 2018 SHOWTIME had four of the top six scripted hour-long series on premium cable: HOMELAND, SHAMELESS, RAY DONOVAN and BILLIONS Launch of SHOWTIME standalone streaming service in July 2015 made available an unprecedented level of user-level data Showtime Networks Inc.
  4. 4. - Viewing History - Customer Journey Information - CRM interactions ... Billions of records Extremely high level of granularity Big Data at SHOWTIME Direct relationship with our customers SHOWTIME standalone streaming service Rich data captured on every customer interaction:
  5. 5. QuestionsBig Data at SHOWTIME
  6. 6. What is the lifetime value of a subscriber who signs up to watch Ray Donovan? How many free trial signups have been generated by Shameless? What is the probability a subscriber will begin watching Billions for the first time in the next 7 days?
  7. 7. Questions Capacity to Answer Big Data at SHOWTIME
  8. 8. Big Data at SHOWTIME Strong data science skills needed to interact directly with data lake Research/Business Analyst Basic questions answered via operational reporting Complex questions required bespoke analyses, sometimes taking weeks or longer to develop Raw Data
  9. 9. Questions Capacity to Answer How to reduce this gap? Big Data at SHOWTIME
  10. 10. Small team of data scientists working to… Democratize data and analytics Understand and predict subscriber behaviors Support data-driven programming & scheduling Data Strategy Team
  11. 11. We’re working to bring the data closer to the users and the users closer to the data Business/Research Analyst Raw Data Augmenting Data Supply Chain Making the data more suitable for analysis Democratizing Data and Analytics
  12. 12. Master Viewers Table Foundational customer data representation Captures 1000s of metrics and behaviors in Free Trial, Paid Month 1, lifetime Tracks relationship between each user and series Supports several use cases: • machine learning • reporting/dashboarding • ad hoc analysis Democratizing Data and Analytics
  13. 13. We’re working to bring the data closer to the users and the users closer to the data Business/Research Analyst Raw Data Augmenting Data Supply Chain Data Skills Training Creating conditions leading to a chain reaction of curiosity, exploration, analysis and insight Making the data more suitable for analysis Tableau SQL Democratizing Data and Analytics
  14. 14. We build actionable machine learning models to gain an intimate understanding of how our users behave and how to strengthen their relationship to SHOWTIME XGBoost Random Forests Churn Propensity Resubscription Propensity Future Customer Value Series Viewership Propensity $ Understanding and Predicting Subscriber Behaviors Put into production:
  15. 15. We are employing data and analytics to attribute revenue and subscribers to content Free trial signups Resubscribers Reduction in Churn among viewers Renewal Decisions Schedule Optimization Subscriber Growth Revenue Growth 1 3 5 7 9 11 Churn % by Paid Month Mean subscriber lifetime value (LTV) Supporting Data-Driven Programming and Scheduling
  16. 16. How to stitch everything together? FLASHBACK: In order to build this capability, we needed to get basic infrastructure in place
  17. 17. • Team comprised exclusively of data scientists without a dedicated DevOps engineer • Lot of time spent troubleshooting cluster configuration • Launching clusters was slow and clumsy • Debugging our bootstrap script (to install python libraries) was nontrivial • Any software or hardware upgrades were scary and generally avoided for as long as possible #struggle Configuration Challenges
  18. 18. Back to focusing on data science! • Booting and managing clusters is now easy, fast and reliable • Simple to attach/detach notebooks to clusters • Databricks handles installation of python libraries • Easy to connect data sources to Tableau • We can confidently experiment with new software or hardware in an effort to improve our workflows Databricks Unified Analytics Platform
  19. 19. Data Pipeline 19#UnifiedAnalytics #SparkAISummit
  20. 20. Optimizing Pipeline Solutions • Apache Airflow • Code optimization tactics: – Replacing RDD and Pandas transformations with Pyspark dataframe transformations – Multithreading using futures module 20#UnifiedAnalytics #SparkAISummit
  21. 21. Apache Airflow • Produced different tables by using dbutils.notebook API – While technically you can run notebooks concurrently, complex dependencies and concurrent jobs are harder to do in a single notebook and job job1 job2 job3 job4 job5 job6 Workflow in a single Notebook Workflow in Apache Airflow job1 job2 job3 job4 job5 job6"notebook-name", 60, {"argument": "data", "argument2": "data2", ...}) notebook_task = DatabricksSubmitRunOperator( task_id='notebook_task’, dag=dag, json=notebook_task_params) • Apache Airflow was the solution for scheduling and managing pipelines – Airflow and Databricks integration – Manages dependencies – Task job can be run in a different cluster
  22. 22. Code Optimization Tactics 22 • User level RDD with subscriber status and viewership records • Combinations of methods from custom classes
  23. 23. Code Optimization Tactics 23
  24. 24. Code Optimization Tactics 24
  25. 25. Code Optimization Tactics 25 • To reduce tasks complexity we saved intermediate tables
  26. 26. Different Levels of Concurrency 26 BEFORE AFTER
  27. 27. Tracking Data Quality 27#UnifiedAnalytics #SparkAISummit
  28. 28. 28#UnifiedAnalytics #SparkAISummit Tracking Data Quality
  29. 29. Using Delta to Update Tables • Without Delta: – Ingestion daily data – Updating fields – Rewriting data 3.03 0.47 0 0.5 1 1.5 2 2.5 3 3.5 Without Delta With Delta Hours 29#UnifiedAnalytics #SparkAISummit MERGE INTO viewership AS hv USING temp_tc AS t ON hv.title = t.title WHEN MATCHED AND hv.category IS NULL OR hv.category != t.category THEN UPDATE SET hv.category = t.category • With Delta – Ingestion daily data – Updating fields where necessary – Rewriting data
  30. 30. Optimization Summary • Apache Airflow • Code optimizations • MLflow for tracking data quality • Delta for updating tables 30#UnifiedAnalytics #SparkAISummit
  31. 31. Unified platform allows us to democratize data and analytics We continue to engage in ongoing innovation and optimization Exciting cultural shift Final Thoughts
  32. 32. Questions? 32#UnifiedAnalytics #SparkAISummit