2. Agenda
• Who is Time Warner Cable & Time Warner Cable Media
• What is Audience Measurement?
• Challenges With Legacy Architecture
• Next Generation Architecture
• Lessons Learned
1
3. Who Am I?
• 10+ years in E-commerce
• Focused on Data Warehousing for the last 5 years
• Certifications
– Cloudera Certified Administrator for Apache Hadoop (CCAH)
– Cloudera Certified Developer for Apache Hadoop (CCDH)
– Teradata Certified Professional
– IBM Certified Specialist - PureData System for Analytics
– Tableau Server Certified Professional
– MicroStrategy Certified Engineering Principal
– Certified ScrumMaster
– Certified SAFe Agilist
• College sports fan – Go Noles!
2
4. Time Warner Cable & Time Warner Cable Media
Time Warner Cable is among the largest providers of video, high-
speed data and voice services in the U.S., connecting more than
15 million customers to entertainment, information and each
other
• Serves customers in 29 states
• More than 50,000 employees across the U.S.
Time Warner Cable Media, the advertising arm of Time Warner
Cable, provides national, regional and local marketers and
agencies with innovative, strategic and cost effective advertising
solutions.
3
5. The Audience Measurement platforms enables census reporting of subscriber
viewership and allows us to answer the Five W’s
Who is watching?
– Anonymized demographics, consumer behaviors
What are they watching?
– Station, program information, advertisements
When are they watching?
– Day of week, daypart, time-shifted
Where are they watching?
– Set-top box, TWC TV apps
Why are they watching?
– Program metadata
What is Audience Measurement?
4
6. Viewership Data
• Set-top box
– Processing more than 500 million events per day
• Largest table is Program Tuning Event Fact
75 TB of raw data
180+ Billion records
• TWC TV app (iPad, iPhone, Android, Xbox, etc.)
• Video On Demand (VOD)
Ads Data
• TWC Media and 3rd party spots
Reference Data
• Household demographics
• Program data
• Automotive data
• Political affiliation
5 Heavy
Analytical
Users
200
Audience
Finder
Users
50 Tableau
Consumers
5
Audience Measurement
By the Numbers
7. • Around 100 Tableau Workbooks
– Authored by the business and IT
• Numerous ad hoc queries
6
Video Viewership Analyzer (VVA)
8. • Custom application that enables complex audience definition by the user
community
– Date range
– Geography (DMA, Ad Zone or Zip Code)
– Platform (Classic, IPTV)
– Audience Definition
• Daypart
• Station and/or Program
• Demographics (includes line-of-business, propensities, Tribes and automotive)
• Platform usage (VOD, IPTV, high-speed data)
• Custom segmentation
• Output includes ranked list of stations and some high-level metrics
7
Audience Finder
11. • 3rd Party application ingests raw data and performs anonymization,
correlation and some enrichment/mediation
• TWC ingests files provided by 3rd Party and performs additional enrichment
as well as applying business rules and stitching logic
– Executed in Netezza using SQL and shell scripts
• Two Netezza appliances
– TwinFin 36 used for ELT processing
– TwinFin 72 used for BI and customer-facing workloads
10
Legacy Platform Architecture
Source
Data
TWC Media
Business Logic
Stitching
Filtering
Zombie Logic
Core Logic
Anonymization
Correlation
Mediation
Enrichment
12. Collection
•Inconsistency
around
reliability and
availability of
source and
reference data
Processing
•Slow catch up
process
•Arch does not
promote speed
to market for
new features
Data Storage
+ Delivery
•Platform
instability
•Does not
support
concurrent
users
Analysis +
Presentation
•Limited
exploration and
interactive
capabilities
Challenges With Legacy Architecture
11
• SLA’s for T-3 and
T-14
• Frequency of
reprocessing
• Reference data
quality
• Duration of
reprocessing
• Team Velocity
when introducing
ETL changes
• Platform
availability
• Query response
times
• Response time
SLA’s during mixed
workload
• User satisfaction
w/ the interface
• Customer
dependency on IT
for changes
Metrics to Assess
13. Technical Criteria
• Performance
• Supports batch and streaming
• Leverage software engineering patterns
• Open source momentum
• “-ilities”
– Scalability
– Elasticity
– Availability
– Durability
– Extensibility
• Enables DevOps to compliment Agile adoption
– Automated testing
– Test-driven Development (TDD)
– Continuous Integration (CI)
• Strong foundation for Data Lake
12
14. Data Warehouse
Event Persistence
Hadoop
Visualization
Data Integration
Apache Spark is a more appropriate solution for set-top box processing logic:
Reduces complexity, simplifies code maintenance, improves defect resolution
time, improves run-time.
Can be applied in batch or near real-time with modest changes which positions
for T-x data availability (where ‘x’ is only limited by the availability of reference
data)
Enables use of Agile development principles (test-driven development and
continuous integration) there by Improving time-to-market, code quality, and
radically reducing QA costs and time.
Hadoop/HDFS for storing large historical data positions the organization
to leverage the evolving open source big data analytics technologies
(machine learning, SQL on Hadoop, graph processing, etc.)
Teradata will allow for large volumes of tuning event data to be secure,
easily accessible, and highly available to large numbers of users and at
reasonable cost.
Tableau enables self-service analytics, including advanced algorithms,
against the audience measurement data, then present information to
various consumers in meaningful ways.
Kafka is a high-performance, fault-tolerant, real-time messaging platform that
will allow us to keep a history of tuning events for faster reprocessing. This
component is critical once we are performing near real-time streaming of
events.
13
Technologies Selected
15. Core Logic
Anonymization
Correlation
Mediation
Enrichment
TWC Media Business
Logic
Stitching
Filtering
Zombie Logic
Initial Nextgen Architecture
Replace MicroStrategy with Tableau to enable self-service
Replace Netezza for customer facing workloads with Teradata, improving
platform stability, enabling sandboxes (e.g., Data Labs) and workload
management tools which assist in managing to performance SLA’s
o Replace 3rd Party application and Netezza ELT with Spark for Collection and
Processing logic (anonymization , correlation, enrichment, filtering,
stitching, & zombie logic)
Source
Data
14
16. Long-Term Architecture
• Implement an enterprise Data Lake to enable non-Media use cases
• Migrate to Spark Streaming and Kafka to enable near real-time use cases
• Evaluate dedicated infrastructure for more predictable performance
Event
Data
Business Logic
Stitching
Filtering
Zombie Logic
Data Lake
Anonymization
Correlation
Mediation
Enrichment
15
Reference
Data
17. Lessons Learned
• Have Executive support
• Infrastructure is critical
– Node sizes
– Network
• Leverage the open source community
– Enhancements
– Extensions (Spark Packages)
• Talent is hard to find
– Consider abstractions
16