Reliable Media Reporting in an
Ever-Changing Data Landscape
Presenters
Eric Avila, NBCU
• NBCU Senior Technologist, Creative
Content Protection Team
Rachel Kelley, OnPrem
• Senior Project Manager, Data &
Analytics Practice
Josh Andrews, OnPrem
• Data Technology Lead/Architect, Data &
Analytics Practice
2
Agenda
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next Steps
Q&A
3
♬
NBCU CCP Overview
NBCU is one of the worlds largest entertainment companies
Responsibilities of NBCU’s Creative Content Protection Group (CCP)
CCP creates & manages technological solutions to these needs
♮
4
Cable
Television
Broadcast
Television
Digital
Parks
Film
OnPrem Solution Partners
5
Media & Entertainment Technology Consulting Firm
Business
Consulting
Technology
Leadership
Applied
Innovation
Business Strategy
Product Roadmap
Process Improvement
Change Management
CRM
Data & Analytics
Digital Supply Chain
PMO & SI Services
Custom Solutions
Enterprise App Development
QA & Support
UX/UI
Los Angeles
New York
Austin
♬
Problem Statement
Problem Statement:
• NBCU CCP wanted to obtain a better view of their data flow and process
to manage asset identification and analytics
Scope:
• Data from streaming services regarding NBCU owned content
• Priority data solutions in place within CCP and other NBCU teams
Objectives:
• Develop a data strategy around streaming services metadata
• Investigate/define initial taxonomy, initiate data profiling, and develop
data source list
6
♬
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next Steps
Q&A
7
Agenda
♬
Project Background
8
• Lightweight digital
identifier, easily
referenced against
fingerprints
generated from other
assets of its kind
• Sent to
vendors/partners &
verified against
uploaded content
• Example: titles
• Common data
problem across
industries:
• Duplicates
• Language
• Quality
• “Truth” changes
over time and by
business need
• Oh hey, that’s my
content you’ve got
there…
• Streaming services
are triggered to
associate content
in video to
ownership of
reference asset
♮
Time series data
Analytic summaries
Title metadata and fingerprinting
Fingerprint, title and analytic data
Systems in Place
9
Solutions which allow full Proof-of-Concept testing before full implementation,
without licensing or contract constraints, have been easier to employ
♮
Methodology
10
♯
Identify relevant systems and tables from stakeholders & obtain
access to databases
Determine table purpose and population source
Generate fundamental metrics for all columns, using proprietary
data profiling methodology, e.g.: Datatype, Scale, Cardinality
Review metrics for outstanding measures
Generate further questions for investigation
Data Profiling Methodology Project Stats
• 18+ data
systems
encountered
• 9 stakeholder
interviews
• 32 data profiling
reports run
• 8 weeks
11
Data Flow Diagram
3. External Data Sources
4. Vendors and Partners
1. CCP Internal Systems
2. NBCU Systems
CCP
SQL
Server
APIs
Release
Dates
♯
Release
Dates
Analysis Performed: SQL Server
As the system takes external metadata and uses it to “patch” together title
data received from various systems to create a more reliable dataset, our
primary concerns were:
• Data Quality & Source Integrity
• Update Frequency
• Data Complexity
CCP
SQL
ServerUpstream Metadata Sources Downstream Reporting
Capabilities
♯
12
Analysis Results: Metadata Staging
Column Name Is Nullable Min Max Cardinality
Effective
Cardinality % NULL
Release_Date_ID no N/A N/A 100% 100% 0%
Prefix yes N/A N/A 0% NULL 100%
Title_ID no N/A N/A 4% 4% 0%
Release_Date_Category_ID no N/A N/A 0% 0% 0%
Country_ID yes N/A N/A 0% 0% 29%
Language_ID yes N/A N/A 0% 0% 85%
Original_Network_Code yes N/A N/A 0% NULL 100%
Licensee_ID yes N/A N/A 0% NULL 100%
Season_Number yes 1 2015 0% 0% 73%
Episode_Name yes N/A N/A 19% 69% 72%
Episode_Number yes 0 2210 0% 2% 72%
Episode_Length yes N/A N/A 0% NULL 100%
Comment yes N/A N/A 1% 5% 86%
Date no 1/1/1900 1/1/3000 14% 14% 0%
Is_Special yes N/A N/A 0% 0% 97%
Table: RELEASE_DATES
♯
13
Data Quality:
• Irregular
Season/Episode
naming conventions
• Improperly populated
Release Dates
Analysis Results: Metadata Staging
General Observations:
• Looked at grain of title, country, language, category, season and episode,
and others
• Records pulled from multiple sources lead to complexity…
– Duplicate release dates within titles
– Conflicting records within titles
51K 48K
40K
14K
System 1 System 2 System 3 System 4
♯
14
2,584
3,417
352
1 2 3 4
External Sources Per Title# of Records Ingested by External Source
(Release Date)
Analysis Performed: MariaDB
CONSIDERATIONS
• Overall is an analysis of viewership and hits
• Account for matches against official, whitelisted, and licensed videos
• Outliers were not removed due to the large percentage of match data that would be expunged
• Summary statistics indicated a left leaning data set
♯
Title Information
MariaDB
CCP SQL
Server
Cassandra
Summarized
Copyright Match
Information
15X: Viewers per Video
Y: Count of
Cases in
Bucket
Column Name Datatype Nullable % Non-Null
Standard
Deviation
Min Max
claim_type varchar YES 76.77%
asset_name varchar YES 92.78%
asset_type varchar YES 100.00%
video_title varchar YES 76.08%
reference_status varchar YES 65.68%
reference_length int YES 65.68% 3065.065455 18 18746
content_type varchar YES 65.68%
view_count int YES 76.08% 919084.1099 0 1.05E+09
duration int YES 76.08% 2228.763702 0 192887
video_total_match int YES 74.66% 1554.598552 0 38385
channel_title varchar YES 76.08%
claim_date datetime YES 100.00% 12/7/2007 11/16/2015
video_upload_date datetime YES 76.08% 8/9/2005 11/16/2015
licensed_content tinyint YES 76.08% 0.140196747 0 1
privacy varchar YES 76.08%
policy_name varchar YES 95.33%
match_percentage int YES 56.95% 48.10774725 0 32388
channel_comments int YES 55.04% 3625.070764 0 1213995
channel_videos int YES 55.04% 1518.373624 0 228113
season int YES 10.49% 72.64507792 1 2015
episode int YES 10.70% 78.35203659 1 4601
last_updated timestamp NO 100.00% 11/16/2015 11/16/2015
Whitelisted tinyint YES 100.00% 0.099045453 0 1
official tinyint YES 100.00% 0.076134446 0 1
owner varchar YES 100.00%
Analysis Results: Hits
Table: SMART_MATCH (copyright match data)
♯
16
Data Discrepancy:
• Reference length
longer than actual
video length
Data Limitation:
• Only most recent upload
date is displayed, and the
value may actually be the
date of publishing or being
made public
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next Steps
Q&A
17
Agenda
♯
Key Findings & Recommendations
18
Gaps in metadata
make it difficult to
understand and
utilize collected data
effectively
Streamline the
metadata gathering
and cleaning process,
leveraging other
metadata systems
Daily quotas and
threshold limit and
distort data pulled
Selectively pull data
to circumvent daily
quotas and
potentially improve
data integrity
Data integrity from
some sources is
questionable and
variance in incentive
to improve
Improve data
processes, e.g.,
addition of data
cleaning to certain
data extract and
aggregation process
(ETL)
Brand specific
workflows, fringe
use cases hinder
ability to acquire
metadata &
accurately map
references
Roadmap of brand
and title match data
cleanup for
reporting needs,
process to maintain
data integrity
FindingsRecommendations
Data Challenges Tech Challenges Organizational Challenges
♯
Data Project Principles & Pitfalls
Maintenance is the Monster
• Initial creation of data solutions is often easier than long term maintenance
Common issues
• Rapidly changing platforms, frameworks, and methodologies
• Need for continuous maintenance and verification of data quality
• Incentives and cultures vary across departments and companies
• Establishing and disseminating a “data stewardship” mentality
• Data “truth” changes over time and by business need
• Ongoing changes in individual consumer behavior, options for copyright owners
19
♬
TechnicalNon-Technical
Go Forward Plan
NBCU Next Steps
• Increased focus on and automation of data matching
& data clean up
• Enable better business unit segmentation of
enterprise data
• Transition from organic to directed architecture
• Increased internal outreach
20
♮

Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal

  • 1.
    Reliable Media Reportingin an Ever-Changing Data Landscape
  • 2.
    Presenters Eric Avila, NBCU •NBCU Senior Technologist, Creative Content Protection Team Rachel Kelley, OnPrem • Senior Project Manager, Data & Analytics Practice Josh Andrews, OnPrem • Data Technology Lead/Architect, Data & Analytics Practice 2
  • 3.
  • 4.
    NBCU CCP Overview NBCUis one of the worlds largest entertainment companies Responsibilities of NBCU’s Creative Content Protection Group (CCP) CCP creates & manages technological solutions to these needs ♮ 4 Cable Television Broadcast Television Digital Parks Film
  • 5.
    OnPrem Solution Partners 5 Media& Entertainment Technology Consulting Firm Business Consulting Technology Leadership Applied Innovation Business Strategy Product Roadmap Process Improvement Change Management CRM Data & Analytics Digital Supply Chain PMO & SI Services Custom Solutions Enterprise App Development QA & Support UX/UI Los Angeles New York Austin ♬
  • 6.
    Problem Statement Problem Statement: •NBCU CCP wanted to obtain a better view of their data flow and process to manage asset identification and analytics Scope: • Data from streaming services regarding NBCU owned content • Priority data solutions in place within CCP and other NBCU teams Objectives: • Develop a data strategy around streaming services metadata • Investigate/define initial taxonomy, initiate data profiling, and develop data source list 6 ♬
  • 7.
  • 8.
    Project Background 8 • Lightweightdigital identifier, easily referenced against fingerprints generated from other assets of its kind • Sent to vendors/partners & verified against uploaded content • Example: titles • Common data problem across industries: • Duplicates • Language • Quality • “Truth” changes over time and by business need • Oh hey, that’s my content you’ve got there… • Streaming services are triggered to associate content in video to ownership of reference asset ♮
  • 9.
    Time series data Analyticsummaries Title metadata and fingerprinting Fingerprint, title and analytic data Systems in Place 9 Solutions which allow full Proof-of-Concept testing before full implementation, without licensing or contract constraints, have been easier to employ ♮
  • 10.
    Methodology 10 ♯ Identify relevant systemsand tables from stakeholders & obtain access to databases Determine table purpose and population source Generate fundamental metrics for all columns, using proprietary data profiling methodology, e.g.: Datatype, Scale, Cardinality Review metrics for outstanding measures Generate further questions for investigation Data Profiling Methodology Project Stats • 18+ data systems encountered • 9 stakeholder interviews • 32 data profiling reports run • 8 weeks
  • 11.
    11 Data Flow Diagram 3.External Data Sources 4. Vendors and Partners 1. CCP Internal Systems 2. NBCU Systems CCP SQL Server APIs Release Dates ♯ Release Dates
  • 12.
    Analysis Performed: SQLServer As the system takes external metadata and uses it to “patch” together title data received from various systems to create a more reliable dataset, our primary concerns were: • Data Quality & Source Integrity • Update Frequency • Data Complexity CCP SQL ServerUpstream Metadata Sources Downstream Reporting Capabilities ♯ 12
  • 13.
    Analysis Results: MetadataStaging Column Name Is Nullable Min Max Cardinality Effective Cardinality % NULL Release_Date_ID no N/A N/A 100% 100% 0% Prefix yes N/A N/A 0% NULL 100% Title_ID no N/A N/A 4% 4% 0% Release_Date_Category_ID no N/A N/A 0% 0% 0% Country_ID yes N/A N/A 0% 0% 29% Language_ID yes N/A N/A 0% 0% 85% Original_Network_Code yes N/A N/A 0% NULL 100% Licensee_ID yes N/A N/A 0% NULL 100% Season_Number yes 1 2015 0% 0% 73% Episode_Name yes N/A N/A 19% 69% 72% Episode_Number yes 0 2210 0% 2% 72% Episode_Length yes N/A N/A 0% NULL 100% Comment yes N/A N/A 1% 5% 86% Date no 1/1/1900 1/1/3000 14% 14% 0% Is_Special yes N/A N/A 0% 0% 97% Table: RELEASE_DATES ♯ 13 Data Quality: • Irregular Season/Episode naming conventions • Improperly populated Release Dates
  • 14.
    Analysis Results: MetadataStaging General Observations: • Looked at grain of title, country, language, category, season and episode, and others • Records pulled from multiple sources lead to complexity… – Duplicate release dates within titles – Conflicting records within titles 51K 48K 40K 14K System 1 System 2 System 3 System 4 ♯ 14 2,584 3,417 352 1 2 3 4 External Sources Per Title# of Records Ingested by External Source (Release Date)
  • 15.
    Analysis Performed: MariaDB CONSIDERATIONS •Overall is an analysis of viewership and hits • Account for matches against official, whitelisted, and licensed videos • Outliers were not removed due to the large percentage of match data that would be expunged • Summary statistics indicated a left leaning data set ♯ Title Information MariaDB CCP SQL Server Cassandra Summarized Copyright Match Information 15X: Viewers per Video Y: Count of Cases in Bucket
  • 16.
    Column Name DatatypeNullable % Non-Null Standard Deviation Min Max claim_type varchar YES 76.77% asset_name varchar YES 92.78% asset_type varchar YES 100.00% video_title varchar YES 76.08% reference_status varchar YES 65.68% reference_length int YES 65.68% 3065.065455 18 18746 content_type varchar YES 65.68% view_count int YES 76.08% 919084.1099 0 1.05E+09 duration int YES 76.08% 2228.763702 0 192887 video_total_match int YES 74.66% 1554.598552 0 38385 channel_title varchar YES 76.08% claim_date datetime YES 100.00% 12/7/2007 11/16/2015 video_upload_date datetime YES 76.08% 8/9/2005 11/16/2015 licensed_content tinyint YES 76.08% 0.140196747 0 1 privacy varchar YES 76.08% policy_name varchar YES 95.33% match_percentage int YES 56.95% 48.10774725 0 32388 channel_comments int YES 55.04% 3625.070764 0 1213995 channel_videos int YES 55.04% 1518.373624 0 228113 season int YES 10.49% 72.64507792 1 2015 episode int YES 10.70% 78.35203659 1 4601 last_updated timestamp NO 100.00% 11/16/2015 11/16/2015 Whitelisted tinyint YES 100.00% 0.099045453 0 1 official tinyint YES 100.00% 0.076134446 0 1 owner varchar YES 100.00% Analysis Results: Hits Table: SMART_MATCH (copyright match data) ♯ 16 Data Discrepancy: • Reference length longer than actual video length Data Limitation: • Only most recent upload date is displayed, and the value may actually be the date of publishing or being made public
  • 17.
  • 18.
    Key Findings &Recommendations 18 Gaps in metadata make it difficult to understand and utilize collected data effectively Streamline the metadata gathering and cleaning process, leveraging other metadata systems Daily quotas and threshold limit and distort data pulled Selectively pull data to circumvent daily quotas and potentially improve data integrity Data integrity from some sources is questionable and variance in incentive to improve Improve data processes, e.g., addition of data cleaning to certain data extract and aggregation process (ETL) Brand specific workflows, fringe use cases hinder ability to acquire metadata & accurately map references Roadmap of brand and title match data cleanup for reporting needs, process to maintain data integrity FindingsRecommendations Data Challenges Tech Challenges Organizational Challenges ♯
  • 19.
    Data Project Principles& Pitfalls Maintenance is the Monster • Initial creation of data solutions is often easier than long term maintenance Common issues • Rapidly changing platforms, frameworks, and methodologies • Need for continuous maintenance and verification of data quality • Incentives and cultures vary across departments and companies • Establishing and disseminating a “data stewardship” mentality • Data “truth” changes over time and by business need • Ongoing changes in individual consumer behavior, options for copyright owners 19 ♬ TechnicalNon-Technical
  • 20.
    Go Forward Plan NBCUNext Steps • Increased focus on and automation of data matching & data clean up • Enable better business unit segmentation of enterprise data • Transition from organic to directed architecture • Increased internal outreach 20 ♮