Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reliable Media Reporting in an
Ever-Changing Data Landscape
Presenters
Eric Avila, NBCU
• NBCU Senior Technologist, Creative
Content Protection Team
Rachel Kelley, OnPrem
• Senior Pr...
Agenda
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendation...
NBCU CCP Overview
NBCU is one of the worlds largest entertainment companies
Responsibilities of NBCU’s Creative Content Pr...
OnPrem Solution Partners
5
Media & Entertainment Technology Consulting Firm
Business
Consulting
Technology
Leadership
Appl...
Problem Statement
Problem Statement:
• NBCU CCP wanted to obtain a better view of their data flow and process
to manage as...
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next ...
Project Background
8
• Lightweight digital
identifier, easily
referenced against
fingerprints
generated from other
assets ...
Time series data
Analytic summaries
Title metadata and fingerprinting
Fingerprint, title and analytic data
Systems in Plac...
Methodology
10
♯
Identify relevant systems and tables from stakeholders & obtain
access to databases
Determine table purpo...
11
Data Flow Diagram
3. External Data Sources
4. Vendors and Partners
1. CCP Internal Systems
2. NBCU Systems
CCP
SQL
Serv...
Analysis Performed: SQL Server
As the system takes external metadata and uses it to “patch” together title
data received f...
Analysis Results: Metadata Staging
Column Name Is Nullable Min Max Cardinality
Effective
Cardinality % NULL
Release_Date_I...
Analysis Results: Metadata Staging
General Observations:
• Looked at grain of title, country, language, category, season a...
Analysis Performed: MariaDB
CONSIDERATIONS
• Overall is an analysis of viewership and hits
• Account for matches against o...
Column Name Datatype Nullable % Non-Null
Standard
Deviation
Min Max
claim_type varchar YES 76.77%
asset_name varchar YES 9...
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next ...
Key Findings & Recommendations
18
Gaps in metadata
make it difficult to
understand and
utilize collected data
effectively
...
Data Project Principles & Pitfalls
Maintenance is the Monster
• Initial creation of data solutions is often easier than lo...
Go Forward Plan
NBCU Next Steps
• Increased focus on and automation of data matching
& data clean up
• Enable better busin...
Upcoming SlideShare
Loading in …5
×

Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal

729 views

Published on

OnPrem Solution Partners worked with NBCU to profile in-house data to determine data quality, and recommend process and quality improvements. We present our process for data import, improvements we want to make, and lessons learned regarding various tools used, including MariaDB, ElasticSearch, Cassandra, and others.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal

  1. 1. Reliable Media Reporting in an Ever-Changing Data Landscape
  2. 2. Presenters Eric Avila, NBCU • NBCU Senior Technologist, Creative Content Protection Team Rachel Kelley, OnPrem • Senior Project Manager, Data & Analytics Practice Josh Andrews, OnPrem • Data Technology Lead/Architect, Data & Analytics Practice 2
  3. 3. Agenda Introduction NBCU and OnPrem Problem Statement Approach Background Methodology Data Analysis Outcome Recommendations Next Steps Q&A 3 ♬
  4. 4. NBCU CCP Overview NBCU is one of the worlds largest entertainment companies Responsibilities of NBCU’s Creative Content Protection Group (CCP) CCP creates & manages technological solutions to these needs ♮ 4 Cable Television Broadcast Television Digital Parks Film
  5. 5. OnPrem Solution Partners 5 Media & Entertainment Technology Consulting Firm Business Consulting Technology Leadership Applied Innovation Business Strategy Product Roadmap Process Improvement Change Management CRM Data & Analytics Digital Supply Chain PMO & SI Services Custom Solutions Enterprise App Development QA & Support UX/UI Los Angeles New York Austin ♬
  6. 6. Problem Statement Problem Statement: • NBCU CCP wanted to obtain a better view of their data flow and process to manage asset identification and analytics Scope: • Data from streaming services regarding NBCU owned content • Priority data solutions in place within CCP and other NBCU teams Objectives: • Develop a data strategy around streaming services metadata • Investigate/define initial taxonomy, initiate data profiling, and develop data source list 6 ♬
  7. 7. Introduction NBCU and OnPrem Problem Statement Approach Background Methodology Data Analysis Outcome Recommendations Next Steps Q&A 7 Agenda ♬
  8. 8. Project Background 8 • Lightweight digital identifier, easily referenced against fingerprints generated from other assets of its kind • Sent to vendors/partners & verified against uploaded content • Example: titles • Common data problem across industries: • Duplicates • Language • Quality • “Truth” changes over time and by business need • Oh hey, that’s my content you’ve got there… • Streaming services are triggered to associate content in video to ownership of reference asset ♮
  9. 9. Time series data Analytic summaries Title metadata and fingerprinting Fingerprint, title and analytic data Systems in Place 9 Solutions which allow full Proof-of-Concept testing before full implementation, without licensing or contract constraints, have been easier to employ ♮
  10. 10. Methodology 10 ♯ Identify relevant systems and tables from stakeholders & obtain access to databases Determine table purpose and population source Generate fundamental metrics for all columns, using proprietary data profiling methodology, e.g.: Datatype, Scale, Cardinality Review metrics for outstanding measures Generate further questions for investigation Data Profiling Methodology Project Stats • 18+ data systems encountered • 9 stakeholder interviews • 32 data profiling reports run • 8 weeks
  11. 11. 11 Data Flow Diagram 3. External Data Sources 4. Vendors and Partners 1. CCP Internal Systems 2. NBCU Systems CCP SQL Server APIs Release Dates ♯ Release Dates
  12. 12. Analysis Performed: SQL Server As the system takes external metadata and uses it to “patch” together title data received from various systems to create a more reliable dataset, our primary concerns were: • Data Quality & Source Integrity • Update Frequency • Data Complexity CCP SQL ServerUpstream Metadata Sources Downstream Reporting Capabilities ♯ 12
  13. 13. Analysis Results: Metadata Staging Column Name Is Nullable Min Max Cardinality Effective Cardinality % NULL Release_Date_ID no N/A N/A 100% 100% 0% Prefix yes N/A N/A 0% NULL 100% Title_ID no N/A N/A 4% 4% 0% Release_Date_Category_ID no N/A N/A 0% 0% 0% Country_ID yes N/A N/A 0% 0% 29% Language_ID yes N/A N/A 0% 0% 85% Original_Network_Code yes N/A N/A 0% NULL 100% Licensee_ID yes N/A N/A 0% NULL 100% Season_Number yes 1 2015 0% 0% 73% Episode_Name yes N/A N/A 19% 69% 72% Episode_Number yes 0 2210 0% 2% 72% Episode_Length yes N/A N/A 0% NULL 100% Comment yes N/A N/A 1% 5% 86% Date no 1/1/1900 1/1/3000 14% 14% 0% Is_Special yes N/A N/A 0% 0% 97% Table: RELEASE_DATES ♯ 13 Data Quality: • Irregular Season/Episode naming conventions • Improperly populated Release Dates
  14. 14. Analysis Results: Metadata Staging General Observations: • Looked at grain of title, country, language, category, season and episode, and others • Records pulled from multiple sources lead to complexity… – Duplicate release dates within titles – Conflicting records within titles 51K 48K 40K 14K System 1 System 2 System 3 System 4 ♯ 14 2,584 3,417 352 1 2 3 4 External Sources Per Title# of Records Ingested by External Source (Release Date)
  15. 15. Analysis Performed: MariaDB CONSIDERATIONS • Overall is an analysis of viewership and hits • Account for matches against official, whitelisted, and licensed videos • Outliers were not removed due to the large percentage of match data that would be expunged • Summary statistics indicated a left leaning data set ♯ Title Information MariaDB CCP SQL Server Cassandra Summarized Copyright Match Information 15X: Viewers per Video Y: Count of Cases in Bucket
  16. 16. Column Name Datatype Nullable % Non-Null Standard Deviation Min Max claim_type varchar YES 76.77% asset_name varchar YES 92.78% asset_type varchar YES 100.00% video_title varchar YES 76.08% reference_status varchar YES 65.68% reference_length int YES 65.68% 3065.065455 18 18746 content_type varchar YES 65.68% view_count int YES 76.08% 919084.1099 0 1.05E+09 duration int YES 76.08% 2228.763702 0 192887 video_total_match int YES 74.66% 1554.598552 0 38385 channel_title varchar YES 76.08% claim_date datetime YES 100.00% 12/7/2007 11/16/2015 video_upload_date datetime YES 76.08% 8/9/2005 11/16/2015 licensed_content tinyint YES 76.08% 0.140196747 0 1 privacy varchar YES 76.08% policy_name varchar YES 95.33% match_percentage int YES 56.95% 48.10774725 0 32388 channel_comments int YES 55.04% 3625.070764 0 1213995 channel_videos int YES 55.04% 1518.373624 0 228113 season int YES 10.49% 72.64507792 1 2015 episode int YES 10.70% 78.35203659 1 4601 last_updated timestamp NO 100.00% 11/16/2015 11/16/2015 Whitelisted tinyint YES 100.00% 0.099045453 0 1 official tinyint YES 100.00% 0.076134446 0 1 owner varchar YES 100.00% Analysis Results: Hits Table: SMART_MATCH (copyright match data) ♯ 16 Data Discrepancy: • Reference length longer than actual video length Data Limitation: • Only most recent upload date is displayed, and the value may actually be the date of publishing or being made public
  17. 17. Introduction NBCU and OnPrem Problem Statement Approach Background Methodology Data Analysis Outcome Recommendations Next Steps Q&A 17 Agenda ♯
  18. 18. Key Findings & Recommendations 18 Gaps in metadata make it difficult to understand and utilize collected data effectively Streamline the metadata gathering and cleaning process, leveraging other metadata systems Daily quotas and threshold limit and distort data pulled Selectively pull data to circumvent daily quotas and potentially improve data integrity Data integrity from some sources is questionable and variance in incentive to improve Improve data processes, e.g., addition of data cleaning to certain data extract and aggregation process (ETL) Brand specific workflows, fringe use cases hinder ability to acquire metadata & accurately map references Roadmap of brand and title match data cleanup for reporting needs, process to maintain data integrity FindingsRecommendations Data Challenges Tech Challenges Organizational Challenges ♯
  19. 19. Data Project Principles & Pitfalls Maintenance is the Monster • Initial creation of data solutions is often easier than long term maintenance Common issues • Rapidly changing platforms, frameworks, and methodologies • Need for continuous maintenance and verification of data quality • Incentives and cultures vary across departments and companies • Establishing and disseminating a “data stewardship” mentality • Data “truth” changes over time and by business need • Ongoing changes in individual consumer behavior, options for copyright owners 19 ♬ TechnicalNon-Technical
  20. 20. Go Forward Plan NBCU Next Steps • Increased focus on and automation of data matching & data clean up • Enable better business unit segmentation of enterprise data • Transition from organic to directed architecture • Increased internal outreach 20 ♮

×