VIP Call Girls Pune Kirti 8617697112 Independent Escort Service Pune
Â
Applying Big Data To TV Advertising
1. Early Lessons Learned in Applying Big
Data To TV Advertising
IAB ITV for Agencies Day
Dave Morgan, CEO, Simulmedia
2. About Us
Who We Are We are a New York based start-up. We are venture backed by Avalon
Ventures, Union Square Ventures and Time-Warner.
Where We Have Been Our 35 person team has veterans of:
What We Believe Television is still the most powerful advertising medium in the world.
While addressability will come, we’re not waiting for it. We’ve taken a few
strategies we learned from the Internet and are applying it to linear TV
advertising, today.
How We Do It Through partnerships with major data providers, we have assembled the
world’s largest set of actionable television data.
How We Make Money We sell television advertising. With inventory in over 106 million US
households, we can cost-effectively extend reach into high-value target
audiences across virtually any advertiser category. We use big data and
science to do this.
2
3. Why Did We Leave The Web?
Television remains the dominant consumer medium
(a) Nielsen US TV Viewing Audicence Traditional Live-Only TV based on average monthly viewing during 1Q2011. Internet and Online Video based on average monthly consumption during July 2011. 3
Video on Demand based on consumption during May 2011.
6. Campaign Reach Is Declining
Impossible for measurement and planning tools to keep pace
Source: Simulmedia analysis of data from SQAD, Nielsen and TVB 6
8. Big Data Is Driving Growth
“We are on the cusp of a tremendous wave of
innovation, productivity and growth, as well as
new modes of competition and value-capture –
all driven by Big Data.”
- McKinsey Global Institute, May 2011
“For CMOs, Big Data is a very big deal.”
- Alfredo Gangotena, CMO, Mastercard, July 2011
8
16. But Big Data Is More Than Size
BIG DATA
What Why did it What’s going to
happened? happen? happen next?
Time: Past Future
Focus: Reporting Prediction
Supports: Human Machine
decisions decisions
Data: Structured Unstructured
Aggregated Unaggregated
Human Dashboards Discovery
Skills: Excel Visualization
Statistics & Physics
16
17. Accelerating The Push To Big Data
Hadoop, cloud computing, Facebook, Yahoo,
quants, Bittorrent, machine learning, Stanford,
large hadron collider, Wal-Mart, text
processing, Amazon S3 & EC2, open source
intelligence, NoSQL, social media, Google,
commodity hardware, Hive, fraud detection,
trading desks, MapReduce, natural language
processing
17
18. What Can It Mean For TV Advertising?
Big data drove the rise of web & search advertising
• Accumulation of high volume of direct measurement
of media consumption
• Better predictions about consumer interests
• Real time return path
• Automation
• Interim step for addressability
• More diligence around consumer privacy
• Media buyers and sellers rethinking their approach to
audience packaging, campaign planning, technology,
data assembly and people
18
19. Post Modern Architecture
Have we reached the limits of classic data storage architecture?
Data Warehouses Data Lakes
• Yahoo!: 700 tb1 • Facebook: 30 pb3 (7x
• Australian Bureau of Statistics: 250 tb1 compression)
• AT&T: 250 tb1 • Yahoo: 22 pb4
• Nielsen: 45 tb1 • Google: ???
• Adidas: 13 tb1
• Wal-Mart: 1 pb2
1 Oracle F1Q10 Earnings Call September 16, 2009 Transcript
2 Stair, Principles of Information Systems, 2009, p 181
3 Dhruba Borthakur, Facebook, December 2010, http://www.facebook.com/note.php?note_id=468211193919
4 Simulmedia estimate 19
20. Our Idea of Big Data
Bringing the data set together in a single platform
Client Nielsen
Set Top Boxes Program Public Ad Occurrence
Proprietary Ratings
• 17+ million • 3 different • US census • What ads • Business • All Minute
boxes sets of • Military ran? Development Respondent
• Completely schedule • Business • Where did Indices (BDI) Level Data
anonymous data they run? • Commercial (AMRLD)
viewing • Proprietary Development
• Live metadata Indices (CDI)
• DVR • Regional
• VOD sales data
• Pay channels
Our (comparatively modest) data set:
• 200 tb (approx. 7x compression)
• 113,858,592 daily events
• Approximately 402,301 weekly ads
• Double capacity every 6 months
…And we don’t load every data point across all data sets, yet
20
21. Rethinking Media Data Architecture
Applying big data to television required us to rethink what our
technical architecture should be
Commodity • No clouds allowed (ISO compliance)
Hardware • Expect hardware failure
Open Source • Learn from those who have done it
Software • Participate in the Open Source community
• ELT (Extract, Load, Transform)
Write Your Own
• Meddle
Software
• Machine learning
• Advanced statistical techniques
Science
• Experimentation
21
23. The People We Needed
A different approach required different skill sets
• New core skills for everyone in the company
• Pattern recognition
• Visualization
• Technology
• Experimentation
• Where do you find hard to find tech skills?
• You don’t find them. You make them.
• A dedicated Science team
• Non traditional researchers (Brain imaging, bioinformatics,
economic modeling, genetics)
• People who watch a lot of television
23
25. Some Things To Know, First
• Live viewing unless otherwise noted
• Time shifting lessons is a whole other presentation
• Time shifting + live viewing lessons is a whole other other presentation
• Video on demand is a whole other other other presentation
• We name names and provide numbers where clients and data
partners permit
• Client confidentiality is important to us
• None of this work would’ve been possible without the help of
our clients and partners
This box will contain important Read me…
information about the graphs on
each page.
25
26. 60% of TV Viewers Watch
90% of TV
Highly Confidential
27. Where The Other 40% Are
TCM 13.6
HALLMARK 13.7
Networks with
relatively fewer ADSWIM 14.0
lighter viewer NICKNITE 14.3
impressions CNBC 15.7
FOX NEWS 18.0
OXYGEN 7.4
Networks with
relatively more WE 7.6
lighter viewer PLANET 7.7
Vertical: Ratio of Heavy impressions GREEN
Viewers to light viewer OVATION 7.8
impressions.
STYLE 7.8
Horizontal: Low rated to
Highly rated networks MTV2 7.8
Call outs: Ratio is the SUNDANCE 7.9
number of Heavier
Viewer impressions you IFC 7.9
Lower Higher rated
would deliver to reach a rated networks
Lighter Viewer on a given networks
network Sources: Nielsen & Simulmedia’s a7 27
28. Where The Other 40% Are
To capture light viewers, media planning and measurement
tools must quickly apply new methods to emerging data sets
28
30. When Data Goes Missing
Automation of error
checking/quality control is
essential
Reuse the data to solve other
problems
Occasionally observe missing
data
Three choices:
• Pick up the phone
• Estimate missing fields
• Work around the missing
data
Time series of SYFY
network. 10645
observations from
2010.02.28 at 7:00pm
Eastern to 2010.10.14 at
12:30pm Eastern
30
Source: Simulmedia’s a7
33. The Revolution of Simple Methods
More data beats
better algorithms.
The best performing
algorithm underperforms
the worst algorithm when
given an order of
magnitude more data.
Simple algorithms at very
large scale can help better
Peter Norvig | Internet Scale Data Analysis | June 21, 2010 predict audience
movement.
Original graph sourced from: Banko & Brill, 2001. Mitigating the paucity-of-data problem: exploring the effect
of training corpus size on classifier performance for natural language processing 33
34. Packaging Reach
Very large data sets better predict TV audience movements
Peter Norvig | Internet Scale Data Analysis | June 21, 2010
34
35. The Cost Of More Data
More data drives better results but there are costs
• All data online. All the • All data online. All the
time. time.
• Less expensive hardware • More expensive talent
• Extremely flexible • Physicists & statisticians ain’t
cheap
• Hard to find programmers
• Not everything meets
your needs
• Evolving technologies in
mission critical functions 35
36. The Data Isn’t Biased Just
Because It Comes From A
Set Top Box
Highly Confidential
37. Applying Simple Methods At Scale
High correlation of a7
measures and Nielsen
estimates.
Either bias is insignificant or
Nielsen data and our data
share the same bias.
Multiple methods yield
similar results
Regression analysis of
Nielsen Household Cume
Rating against
Simulmedia’s a7 cume
rating. 20 Primetime
Network shows with
Sources: Nielsen & Simulmedia’s a7 HAWAII FIVE-0. Fall 2010.
37
38. And Then We Kept Going
We measured program Tune-In, Spot Tune-In, Campaign Reach,
Campaign Rating using multiple slices of our data set using two
different sample sets and time frames
How we sliced it Two samples
• Entire a7 data set 1. Sample 1: Fall 2010: 20 Primetime
• Cross correlated individual data broadcast series launches +
sets contained in a7 aggregate promos
2. Sample 2: Jan 2011: 15 Primetime
data set
cable series premieres + promos
• Aggregate cross geographies (Plus one multi-season/year
(DMA to DMA) primetime broadcast premiere +
promos)
Observations
• Sample 1 average r2>0.85 • Hand selected programs
• Sample 2 average r2>0.93 • Mix of genres
• Mix of new vs. returning shows
38
40. Closing The Loop On Program Promotion
Spring 2010 broadcast
premiere promotion.
Horizontal: Left to right moves
back in time. 0 is the premiere
time. Vertical: Conversion rate
is measured in percent. Size of
Sources: Simulmedia’s a7
the bubble represents total
conversions for a given spot.
40
41. Closing The Loop On Program Promotion
Spring 2010 broadcast
premiere promotion.
Horizontal: Left to right moves
back in time. 0 is the premiere
time. Vertical: Conversion rate
is measured in percent. Size of
Sources: Simulmedia’s a7
the bubble represents total
conversions for a given spot.
41
42. Closing The Loop
Long held beliefs and rules of thumb in planning may or may
not be supported by data
TV marketers now have more options for show promotion
42
44. Time Series: Broadcast: CBS
60 networks. High correlation between Nielsen Hour by hour time series
Mar 20 to April 8, 2011. Z
large sample measurement and a7 measures score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
44
45. Time Series: Broadcast: Fox
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
45
46. Time Series: Broadcast: ABC
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
46
47. Time Series: Cable: Investigation Discovery
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
47
48. Time Series: Cable: Golf
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
48
49. Time Series: Cable: Bravo
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
49
50. Time Series: Cable: ESPN2
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
50
51. Time Series: Cable: Speed
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
51
53. When You Look Closer
Hour by hour time series
Mar 20 to April 8, 2011. Z
score plots with Nielsen
estimates in red.
Simulmedia
measurements in blue.
Where Nielsen provided
no estimate, estimates
were imputed using
Multiple Imputation
(Rubin (1987))
Sources: Nielsen & Simulmedia’s a7
53
54. High Frequency Time Series: ABC Family
Volatility in dayparts, low rated networks, demographics….
Unrated networks “don’t exist.” Did NOT look at local.
a7
Nielsen
Sample graph from High Frequency
(Second and Minute level) Time Series
Analysis of 45 networks on January 19th
2011.
Simulmedia a7 Sample (Second by Second
to Minute)
Nielsen Sample (Minute by Minute)
54
Sources: Nielsen & Simulmedia’s a7
56. Gender Driven Geographic Variation
Viewing by zip code among women across markets is more varied than
men in the same zip codes
Women 18-54 Men 18-54
Fraction of view time for ages 18-54 as fraction of view
time for all TV viewers. Week 2 vs. the same fraction for
week 1 (last two weeks in January). Three markets:
Philadelphia (blue) Atlanta (red) and Chicago (green) Each
Source: Simulmedia’s a7 point represents a zip code in one of these markets.
56
57. Gender Driven Geographic Variation
Planning tactics for female targeted campaigns should be different than
male target campaigns
PS…Also a good case for geo based creative versioning
57
59. Privacy By Design
• All marketing data companies need to
care
• Make consumer privacy protection part
of the business from the beginning
• Anonymous, aggregated data only
• No personal data or data that can
be related to particular individuals
or devices
• Broad marketing segmentations,
not profiling
• No sensitive data
Don’t be creepy
59
61. Fragmentation Effects On Frequency
Each segment was above 70% reach but the frequency distribution was nearly
identical
Percent of audience reached for major animated motion
picture campaign 2011. Two weeks prior to release. Each
stacked bar is a different audience segment. Each color
Source: Nielsen & Simulmedia’s a7 with the stacked bar represents the frequency of ad view
61
for each segment.
62. Fragmentation Effects On Frequency
Fragmentation is affecting all high reach campaigns.
Percent of audience reached for insurance advertisers
September to October 2010. Approximately 8000 ads.
Each stacked bar is a different audience segment. Each
Source: Nielsen & Simulmedia’s a7 color with the stacked bar represents the frequency of ad
62
view for each segment.
63. Fragmentation Effects On Frequency
The TV advertising market can’t continue to support this
63
64. 40% Of The Audience Is
Getting 85% Of The
Impressions
Highly Confidential
65. Fragmentation Rears It’s Head Again
Campaign impressions
increasingly concentrated against
0.0 0.0% heavy viewers.
1.4 3.6%
Total
US Television 4.3 10.8%
Audience
Percent of audience
23.0% reached for a different
9.1
major animated motion
picture campaign 2011.
Two weeks prior to
release. The stacked bar
24.8 62.6% represents quintiles.
Blue labels are average
frequency per
Average Frequency % of Total Impressions
respective quintile. Red
Per Quintile Per Quintile
labels are % of total
campaign impressions
Source: Nielsen & Simulmedia’s a7 by respective quintile.
65
68. Choices
• If fragmentation is causing declining campaign reach and
frequency imbalances, marketers must make choices.
• Reduce reach
• Do nothing
• Use other channels
• Stabilize or improve reach
• Re-aggregate audiences using big data
What do you think?
68
69. Jack Smith
jack@simulmedia.com
@simulmedia
@jkellonsmith
69
70. About Our Science Team
• Krishna Balasubramanian, Chief Scientist
• Previously: Chief Scientist, Tacoda. Chief Scientist, Real Media.
• Doctoral Candidate, Physics. (Condensed Matter Physics) The Ohio State University
• MS, Computer & Information Systems. The Ohio State University
• MSc, Physics. Indian Institute of Technology, Kanpur
• Yuliya Torosjan, Scientist
• Previously: Clinical Research (Brain Imaging), Mount Sinai College of Medicine
• MA, Statistics. Columbia University
• BSE, Computer Science & Engineering. University of Pennsylvania
• BA, Psychology. University of Pennsylvania
• Mario Morales, Scientist
• Previously: Lecturer, Bioinformatics, New York University. Senior Consultant, Weiser LLP.
• MS, Statistics. Hunter College
• MS, Bioinformatics. New York University
• Dr. Sidd Mukherjee, Scientist
• Previously, Visiting Scholar (Atomic Scattering experiments), The Ohio State University
• Post doctoral research, Heat capacity of Helium-4. Pennsylvania State University
• PhD, Physics. (Thesis: Measurements of Diffuse and Specular Scattering of 4He Atoms from
4He Films), Ohio State University
• MS, Computer &Information Systems. The Ohio State University
• BSc, Physics & Mathematics. University of Bombay
70
Editor's Notes
The revolution will be televised.
Audience fragmentation is going from bad to worseThis fragmentation is wrecking effective campaign reach and creating a massive frequency imbalanceAudience re-aggregation will be key for brand advertisers to maintain scaleTV is not going to the web. The web is going to television.
Audience fragmentation is going from bad to worseThis fragmentation is wrecking effective campaign reach and creating a massive frequency imbalanceAudience re-aggregation will be key for brand advertisers to maintain scaleTV is not going to the web. The web is going to television.
The Huntington copy is one of eleven surviving copies printed on vellum, and one of three such copies in the United States. An additional thirty-six copies printed on paper also survive.
Our claim of the world's largest actionable set of TV viewing data at 75tb would be hard for anyone to challenge. The fact that we link schedule information, set-top box data and ratings data makes it even more difficult to challenge.  The most interesting discovery was that we're 3x larger than Nielsen's biggest single instance transactional datastore. (Netezza has similar kinds of multiplying factors as our data storage scheme, Hadoop.) The Numbers:Wal-Mart: 1 petabyte (800 million transactions/day across 7000 stores globally) (3)  (This is probably in a combination of HP Neoview and Teradata.)Yahoo!: 700 terabytes (1)  (Doesn't include their Hadoop cluster which is approx 15 petabytes.)Australian Bureau of Statistics: 250 terabytes (1)AT&T: 250 terabytes (1)AC Nielsen: Largest single instances: Netezza: 20 tera, Oracle: 10 tera (500 terabytes TOTAL in Netezza, 45 tera in Oracle) Most are distributed databases with client data. (1)(2)Adidas: 13 terabytesLargest Hadoop cluster (4):Facebook: 30 petabytes of storage---------------------------------------------The fine print----------NOTES:(1) From Oracle F1Q10 Earnings Call September 16, 2009 5:00 pm ET Transcript (Charles E. Phillips Jr.)Yahoo!: 700 terabytes Australian Bureau of Statistics: 250 terabytesAT&T: 250 terabytesAC Nielsen: 45-terabyte data [mart], they called itAdidas: 13 terabytes2) DBMS2:September 29, 2009What Nielsen really uses in data warehousing DBMSIn its latest earnings call, Oracle made a reference to The Nielsen Companythat was — to put it politely — rather confusing. I just plopped down in a chair next to Greg Goff, who evidently runs data warehousing at Nielsen, and had a quick chat. Here’s the real story.The Nielsen Company has over half a petabyte of data on Netezza in the US. This installation is growing.The Nielsen Company indeed has 45 terabytes or whatever of data on Oracle in its European (Customer) Information Factory. This is not particularly growing. Nielsen’s Oracle data warehouse has been built up over the past 9 years. It’s not new. It’s certainly not on Exadata, nor planned to move to Exadata.These are not single-instance databases. Nielsen’s biggest single Netezza database is 20 terabytes or so of user data, and its biggest single Oracle database is 10 terabytes or so.Much (most?) of the rest of the installations are customer data marts and the like, based in each case on the “big” central database. (That’s actually a classic data mart use case.) Greg said that Netezza’s capabilities to spin out those databases seemed pretty good.That 10 terabyte Oracle data warehouse instance requires a lot of partitioning effort and so on in the usual way.Nielsen has no immediate plans to replace Oracle with Netezza.Nielsen actually has 800 terabytes or so of Netezza equipment. Some of that is kept more lightly loaded, for performance.(3) Stair, Principles of Information Systems, 2009, p 181.(4) Dhruba Borthakur who is the Hadoop Engineer for Facebook.30petabytes in December 2010.  This is really interesting....  http://www.facebook.com/note.php?note_id=468211193919In May 2010The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world. Here are some of the details about this single HDFS cluster:21 PB of storage in a single HDFS cluster2000 machines12 TB per machine (a few machines have 24 TB each)1200 machines with 8 cores each + 800 machines with 16 cores each32 GB of RAM per machine15 map-reduce tasks per machineThat's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:
Two reasons for light viewing:Modality. People have busy lives.Fragmentation to lower measured networksThe heaviest viewers watch 3X the volume of television of the average viewer.The lightest viewers watch 5% the volume of television of the average viewer.60% of the television audience accounts for 90% of television viewing (and therefore ad impressions). Call them the Heavier Viewers.The remaining 40% of the viewers account for only 10% of total attention to television. These Lighter Viewers’ attention to television generates less than 1/10 the volume of impressions that a Heavier Viewer does.Without careful planning based on the best possible data resource, every 12 impressions an advertiser buys will yield one unit of reach against the 40% of the audience that are Lighter Viewers.Ratio of Heavier Viewer viewing to Lighter Viewer viewing varies by network. Networks with a relatively greater share of viewing attributable to heavier viewers will tend to accumulate audience more slowly that networks with lower share of viewing attributable to heavier viewers. All else equal, impressions on networks with more heavier viewer viewing will create more frequency and less reach than networks with less heavier viewer viewing.
SYFY 2010.02.28 7:00:00PM to 2010.10.14 12:30PM10645 Observations for 514 stationsSometimes easy to spotFiles corruptedWhat about inconsistency in field level data?Possibly a logging problem at the STB level?Possibly an aggregation problem?
Learning the difference between “bank” of a river vs “bank” as a place where you put your money.In search we called this the “Madonna problem” Madonna the religious icon vs Madonna pop culture icon
Learning the difference between “bank” of a river vs “bank” as a place where you put your money.In search we called this the “Madonna problem” Madonna the religious icon vs Madonna pop culture icon
Learning the difference between “bank” of a river vs “bank” as a place where you put your money.In search we called this the “Madonna problem” Madonna the religious icon vs Madonna pop culture icon
Nielsen has Over The Air, Analog, Digital
Nielsen has Over The Air, Analog, Digital
Nielsen has Over The Air, Analog, Digital
Nielsen has Over The Air, Analog, Digital
Nielsen has Over The Air, Analog, DigitalImputed Nielsen’s numbers
The first chart shows the Fraction of view time for women of ages 18-54 (F18-54) as fraction of view time for all tv viewers for week 2 vs the same fraction for week 1 (two weeks in January). The data is for three markets Philadelphia in blue, Atlanta in red and Chicago in green. Each point represents a zip code in one of these markets. The second chart is similar but for men 18-54 (M18-54).The distance of a point away from the diagonal line represents the variation from one week to the next for that zip code. The separation along the diagonal line represents the varying fraction of adult women between the zip codes. As an example, if there had been no change from the first week to the second, all points would have been along the diagonal.We see strong overlap of all three markets and they can't be separated in these views. However, we see significant spread of the fraction of the F18-54 group and M-18-54 group between the zip codes that compose these markets.  Women appear to show more geographically variation in their viewing habits
Audience fragmentation is going from bad to worseThis fragmentation is wrecking effective campaign reach and creating a massive frequency imbalanceAudience re-aggregation will be key for brand advertisers to maintain scaleTV is not going to the web. The web is going to television.
Audience fragmentation is going from bad to worseThis fragmentation is wrecking effective campaign reach and creating a massive frequency imbalanceAudience re-aggregation will be key for brand advertisers to maintain scaleTV is not going to the web. The web is going to television.