VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
ACMS TV Ratings Midterm Angelini
1. Television Show Cancelation Model
Purposeful Selection of Covariates
Brandon Angelini
Topics in Statistics – Logistic Modeling
ACMS 40950
March 4th
2016
2. Introduction
The model’s goal is to predict if a TV show will be canceled by observing a variety of
publically available information about the show. The data set includes a binary output of
cancelled or renewed, and has continuous covariates 18-49 demographic viewership,
previous year 18-49 demo viewership, overall viewership, previous year overall viewership,
as well as categorical variable Network (ABC, CBS, NBC, etc.), and binary variables
Scripted and Broadcast. The data is an aggregation of Nielsen TV ratings, as archived by
TVSeriesFinale.com, and compiled by the researcher (Brandon Angelini).
Data Set Variables:
Title – Show Title
Cancel – 1 if cancelled, 0 if renewed/other
ID – Arbitrary numerical ID assigned to show
Demo – Ratings for adults 18-49 demographic
PrevDemo – Ratings for previous year for adults 18-49 demographic (0 for new shows)
Viewers – overall viewership
PrevViewers – overall viewership for previous year (0 for new shows)
Scripted – 0 if unscripted show (reality), 1 if scripted show (drama, comedy, etc.)
Broadcast – 0 if cable, 1 if broadcast network (ABC, CBS, FOX, and NBC typically have
higher distribution because they’re broadcast networks)
Network – Network show airs on
ABC – 1
CBS – 2
CW – 3
FOX – 4
Freeform – 5
FX – 6
MTV – 7
NBC – 8
SyFy – 9
TNT – 10
Often the cancelling and renewing of TV shows is looked at as a black box in which TV
executives choose which shows live and die for a variety of reasons, but this model attempts
to simplify the cancelation decisions to ratings data, and the categories a show occupies.
Production companies, media companies, and advertisers could use the model to understand
how to allocate resources among shows, allowing them to predict what production decisions
executives may make.
3. Purposeful Selection of Covariates
Step 1:
Create a univariable logistic regression model for each covariate
Covariate Coeff. Std. Err. Odds Ratio G p
Demo -1.389 0.238 0.24935 46.57 0.00
PrevDemo -2.592 0.429 0.07487 91.10 0.000
Viewers -0.293 0.052 0.74602 42.92 0.000
PrevViewers -0.568 0.101 0.56666 87.32 0.00
Script 1.999 0.612 7.3817 17.51 0.001
Broadcast -0.013 0.310 0.98708 0.00179 0.966
as.factor(Network)2 -.581 0.427 0.55934 6.89 0.174
as.factor(Network)3 -0.899 0.812 0.40698 6.89 0.268
as.factor(Network)4 -0.150 0.455 0.86071 6.89 0.742
as.factor(Network)5 0.112 0.905 1.1185 6.89 0.901
as.factor(Network)6 -0.398 0.709 0.67166 6.89 0.574
as.factor(Network)7 0.295 0.776 1.3431 6.89 0.704
as.factor(Network)8 0.208 0.373 1.2312 6.89 0.578
as.factor(Network)9 -0.111 0.647 0.89494 6.89 0.864
as.factor(Network)10 0.623 0.660 1.8645 6.89 0.345
All appear to be significant at this stage except for Broadcast, so a model will be fit with all
variables except ‘Broadcast’
Step 2:
Fit a multivariable model that contains all covariates that are significant in univariable
analysis at the 25% level.
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.230 1.112 -2.006 0.04489*
PrevDemo -2.641 1.564 -1.689 0.09120.
Viewers -0.771 0.302 -2.554 0.01064*
PrevViewers 0.025 0.347 0.072 0.94257
Script 2.646 0.836 3.163 0.00156**
as.factor(Network)2 2.054 0.953 2.154 0.03121*
as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07***
as.factor(Network)4 -1.636 0.837 -1.954 0.05073.
as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05***
as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07***
as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
as.factor(Network)8 -0.730 0.6582 -1.110 0.26698
as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08***
as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
PrevDemo, PrevViewers, and Network 8 don’t appear to be significant, so we must use the
likelihood ratio test. For PrevDemo the significance is .0923 (>.05), PrevViewers is .94275
(>.05) so both should be excluded from the model at this point. Excluding Network results
in a significance of 1.589e-12, so Network be kept in the model.
4. The new reduced model doesn’t include data points for previous year (insignificant in
multivariable model in step 2) or if the show is on broadcast TV or not (insignificant in
univariable model is step 1).
Reduced Model
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.7532 0.8836 -3.116 0.00183**
Viewers -0.8748 0.2209 -3.959 7.51e-05***
Script 3.0433 0.7953 3.827 0.00013***
as.factor(Network)2 2.0340 0.8442 2.409 0.01598*
as.factor(Network)3 -7.2073 1.2492 -5.769 7.95e-09***
as.factor(Network)4 -2.1870 0.7657 -2.856 0.00429**
as.factor(Network)5 -6.7336 1.3414 -5.020 5.18e-07***
as.factor(Network)6 -7.3584 1.2303 -5.981 2.22e-09***
as.factor(Network)7 -7.0426 1.2802 -5.501 3.77e-08***
as.factor(Network)8 -0.9525 0.5869 -1.623 0.10460
as.factor(Network)9 -7.6447 1.2329 -6.201 5.63e-10***
as.factor(Network)10 -5.2617 1.1354 -4.634 3.58e-06***
Step 3:
Check to see if covariates removed from the model in step 2 confound or are needed to
adjust the effects of the covariates remaining in the model.
With the removal of PrevDemo, coefficients for Demo (21%), Network 4 (33%), and
Network 8 (29%) change by >20%, and the removal of PrevViewers results in the
coefficients for Viewers (58%), and Script (24%) changing by >20%, meaning that both
PrevDemo and PrevViewers are adjusters, and should be left in the model. Additionally, it
makes sense intuitively that a show may not only be judged on that year’s performance, but
also on the previous year, so both will be kept in the model.
Step 4:
In univariable analysis (step 1) the Broadcast covariate was deemed to be insignificant
(p=0.966). The addition of Broadcast to the current model results in a small change to some
of the Network categorical coefficients, but because broadcast is a classification for network
variables (ABC, CBS, Fox, and NBC are ‘broadcast channels’, others are not) the effects of
Broadcast are likely captured through the inclusion of the network categorical variables, so
Broadcast will be left out of the model.
Preliminary Main Effects Model:
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.230 1.112 -2.006 0.04489*
PrevDemo -2.641 1.564 -1.689 0.09120.
Viewers -0.771 0.302 -2.554 0.01064*
PrevViewers 0.025 0.347 0.072 0.94257
Script 2.646 0.836 3.163 0.00156**
as.factor(Network)2 2.054 0.953 2.154 0.03121*
5. as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07***
as.factor(Network)4 -1.636 0.837 -1.954 0.05073.
as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05***
as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07***
as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
as.factor(Network)8 -0.730 0.6582 -1.110 0.26698
as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08***
as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Step 5:
Check the linearity of the remaining continuous covariates (Demo, PrevDemo, Viewers, and
PrevViewers) through the use of lowess plots, and account for any nonlinearlity.
Lowess Plots
Demo PrevDemo
Viewers PrevViewers
The lowess plots for all 4 continuous covariates appear to be non-linear, so it appears that it
will be necessary to use quartile design variables, fractional polynomials, or linear splines.
In this case, we’ll attempt to use linear splines to try to get the best possible fit, and test the
fit. Using linear splines resulted in a model that appeared to be over fit and yielded huge
standard errors, so we’ll try fractional polynomials instead.
6. Rejected linear splines model
Covariate Estimate Std. Err. Z value P values
Demo.splinesx.l1 -2.411e+15 9.919e+07 -24305729 <2e-16***
Demo.splinesx.l2 -5.640e+14 2.299e+07 -24526083 <2e-16***
Demo.splinesx.l3 2.843e+15 4.082e+07 69649357 <2e-16***
Demo.splinesx.l4 -2.774e+14 4.816e+07 -5760509 <2e-16***
PrevDemo.splinesx.l1 1.982e+15 7.734e+07 25626005 <2e-16***
PrevDemo.splinesx.l2 -8.001e+14 2.453e+07 -32612936 <2e-16***
PrevDemo.splinesx.l3 -9.628e+14 5.144e+07 -18715833 <2e-16***
PrevDemo.splinesx.l4 -9.876e+14 8.760e+07 -11274429 <2e-16***
Viewers.splinesx.l1 NA NA NA NA
Viewers.splinesx.l2 1.401e+13 8.712e+06 1608118 <2e-16***
Viewers.splinesx.l3 -7.136e+14 5.992e+06 -119077111 <2e-16***
Viewers.splinesx.l4 -8.728e+14 2.426e+07 -35975439 <2e-16***
PrevViewers.splinesx.l1 -1.852e+15 1.407e+08 -13159211 <2e-16***
PrevViewers.splinesx.l2 -1.942e+14 7.531e+06 -25786173 <2e-16***
PrevViewers.splinesx.l3 1.676e+14 6.477e+06 25883316 <2e-16***
PrevViewers.splinesx.l4 1.424e+15 2.930e+07 48600026 <2e-16***
Script 1.468e+15 1.268e+07 115751505 <2e-16***
as.factor(Network)2 9.463e+14 1.464e+07 64632365 <2e-16***
as.factor(Network)3 -2.331e+15 2.670e+07 -87309082 <2e-16***
as.factor(Network)4 -6.593e+14 1.634e+07 -40359985 <2e-16***
as.factor(Network)5 -2.173e+15 3.583e+07 -60636932 <2e-16***
as.factor(Network)6 -2.439e+15 2.918e+07 -83604516 <2e-16***
as.factor(Network)7 -3.225e+15 3.421e+07 -94264401 <2e-16***
as.factor(Network)8 -2.763e+14 1.215e+07 -22745070 <2e-16***
as.factor(Network)9 -2.602e+15 3.389e+07 -76767104 <2e-16***
as.factor(Network)10 -1.791e+15 2.764e+07 -64791116 <2e-16***
To determine the fractional polynomials, we’ll attempt to fit using the mfp package in R
fit1 <-mfp(Cancel~ fp(Demo) + fp(PrevDemo) +fp(Viewers) +fp(PrevViewers) +Script
+as.factor(Network), family=binomial, data=data1, verbose=T)
Upon trying fractional polynomials, none of the 4 variables result in p-values that merit the
addition of any degree of fractional polynomial, so the continuous covariates could likely
remain assumed to be linear relationships (determined through G-Stat/p-value testing).
However, the plots appear not linear, so we’ll try adding 1 term fractional polynomials as a
sort of compromise between a more complex but statistically insignificant way, and the
linear. The output of the fractional polynomial analysis leads us to conclude p1 the value for
Demo=0.5, PrevDemo=1, Viewers=1, and PrevViewers=-2.
Covariate Estimate Std. Error z value Pr(>|z|)
Demo 7.4589 4.4412 1.679 0.093061.
DemoSQRT -23.7592 9.9072 -2.398 0.016477*
PrevDemo -0.1829 1.9413 -0.094 0.924937
7. Viewers 2.3239 1.3430 1.730 0.083550.
ViewersSQ -0.2950 0.1372 -2.151 0.031488*
PrevViewers -0.5409 0.5159 -1.048 0.294423
Script 19.1208 1137.0261 0.017 0.986583
as.factor(Network)2 3.5713 1.6306 2.190 0.028510*
as.factor(Network)3 -5.5146 1.6626 -3.317 0.000911***
as.factor(Network)4 -1.2444 0.9568 -1.301 0.193417
as.factor(Network)5 -4.4888 1.7850 -2.515 0.011913*
as.factor(Network)6 -5.7524 1.6927 -3.398 0.000678***
as.factor(Network)7 -4.9793 1.8046 -2.759 0.005793**
as.factor(Network)8 -0.5257 0.6967 -0.755 0.450501
as.factor(Network)9 -7.1581 1.7966 -3.984 6.77e-05***
as.factor(Network)10 -5.2485 1.4327 -3.663 0.000249***
The introduction of fractional polynomials has changed the coefficient on Script and
PrevDemo substantially, and made them statistically insignificant, so at this point it seems
rational to conclude that the addition of fractional polynomials is over complicating the
model. A linear splines model was attempted, but similar results occurred, so as a result,
while we’d like to account of the apparent non-linearity illustrated in the Lowess plots, we
cannot introduce a way of accounting for that without severely altering the model or
coefficients. The fractional polynomials will be removed from the model, and the
continuous covariates will remain assumed to be linear.
Step 6:
Explore possible interactions among main effects
15 models were fit with all combinations of interactions between covariates, and none were
significant at the 5% level. The highest significance was between Demo and Network [6.7%]
and between PrevViewers and Network [5.2%], but neither makes sense in context. If demo
viewership and network interact, that should happen every year, and we don’t see that for
network and PrevDemo (and we don’t see that for Viewers and Network like the interaction
between PrevViewers and Network would suggest). As a result of intuition and relatively
low significance levels, we won’t include any interactions in the model.
Preliminary final model
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.230 1.112 -2.006 0.04489*
PrevDemo -2.641 1.564 -1.689 0.09120.
Viewers -0.771 0.302 -2.554 0.01064*
PrevViewers 0.025 0.347 0.072 0.94257
Script 2.646 0.836 3.163 0.00156**
as.factor(Network)2 2.054 0.953 2.154 0.03121*
as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07***
as.factor(Network)4 -1.636 0.837 -1.954 0.05073.
as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05***
as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07***
as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
9. Diagnostics and Results
Hosmer-Lemeshow Statistic
Test of goodness of fit based on fitted values and predicted values. The H-L stat has a null
hypothesis that the model fits, and alternative that there is a poor fit, and the null can be
rejected at typical p-value levels (5% typically).
H-L Values: X-squared = 4.2516, df = 8, p-value = 0.8337
Fail to reject the null, and conclude that there is adequate agreement between the fitted
values and the predicted values.
Cook’s Distances
A way to determine the effect on the coefficients created by “outlier” data points. Cook’s
distances are computed, a maximum distance is determined, and the relative effects on the
model’s coefficients are determined. If any data point has >20% effect, it is removed and the
model is fit again.
Observation 115 is the max, and upon removal none of the changes have the required 20%
effect, with the max being 3.93% for PrevViewers.
K-Folds Validation
To evaluate the generalizability of the model we’ll test it on a variety of subsets of the data.
To do so, we’ll start by splitting the data set into 5 subsets or ‘folds’, of sizes 56, 56, 56, 57,
and 57. The mean error rate over the five ‘folds’ is 0.3196115, which is acceptable for this
type of model.
Fold 1 [1:56] Fold 2 [57:112] Fold 3 [113:168] Fold 4 [169:225] Fold 5 [225:282]
0.3035714 0.1428571 0.6428571 0.3333333 0.1754386
AUC
Performing an Area Under the Curve analysis yields a value of 0.8687, which is very good
discrimination between subjects classified correctly and incorrectly.
10. Conclusions
The final model seems to describe the TV cancelations for this data set well, but the
model may not be generalizable, because there may be groupings to the data points that
weren’t taken into account. However, the favorable AUC and 5-fold validation are positive
signs that it may still discriminate well in outside data sets.
Potential issues with grouping in the model arise in that it may matter what other
shows are in competition in any given year, as it’s unclear if shows are fighting for a fixed
share of viewers in a year. It seems possible that shows have the potential to increase
viewership overall and fight for ratings independently of other shows, but more likely
ratings depend on other shows in the same year, and shows are fighting for a ‘slice of the
pie’ of viewers.
Additionally, I would’ve liked to add some type of variable for genre of a show, as
that’d help tie similarities between types of shows in addition to types of channels as the
categorical network variable does. Currently the scripted variable starts to do this, but there
are more classifications like ‘drama’ and ‘comedy’ that may have expected viewership
levels, that if they don’t achieve they’ll be canceled. Finally, show air time seems like it
would be an interesting statistic to keep on the shows, as it would help normalize ratings and
create a “good rating for that time” type of idea, so that shows that air at times that are
traditionally associated with lower viewership are evaluated properly.
Overall, the model fit is very successful, and has applications that could prove useful
to the general television production community.
Applications
The television cancelation model’s coefficients can help us draw some interesting
conclusions that come from the effects that a coefficient have in the model
(positive/negative, and overall magnitudes).
1. Scripted shows are more likely to be canceled than unscripted shows, as shown by
the positive coefficient on Script (2.64558). This could likely be related to the fact
that there are more scripted shows that air in a given year, but the reality remains,
scripted shows are more likely to be canceled.
2. Previous year demographic ratings are the most significant ratings effect on
cancelation (PrevDemo -2.64102), followed by current year demographic ratings
(Demo -2.22992), overall viewers (Viewers -0.77149) and previous year overall
viewers (PrevView 0.02498). This is interesting, as it points to executives caring
more about attracting the 18-49 demographic that advertisers often target than
overall viewership. This could have broader implications for television, and provide
reasoning for the common criticism of television that ‘minorities are
underrepresented’ by showing that studios have to appeal to the majority voices in
the 18-49 demographic, as dictated by advertisers.
3. It matters what network a show airs on. Some shows may be able to avoid
cancelation by airing on a specific network, as it appears each network has it’s own
standards for ratings resulting in cancelation, shown by the significance of the
network categorical variable.
14. coef(summary(mod.int12))[16,4]
coef(summary(mod.int13))[16,4]
coef(summary(mod.int14))[16,4]
coef(summary(mod.int15))[16,4]
Diagnostics
Hosmer-Lemeshow
#Perform Hosmer-Lemeshow (HL) test via R package ResourceSelection
require(ResourceSelection)
mod<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+as.factor(Network),
family=binomial)
hoslem.test(Cancel,mod$fitted.values,g=10)
Cook’s distances
X <-cbind(rep(1,n),Demo,PrevDemo,Viewers,PrevViewers,Script,as.factor(Network))
p.hats <- mod$fitted.values[-1]
V <- diag(p.hats*(1-p.hats))
hs <- diag(sqrt(V)%*%X%*%solve(t(X)%*%V%*%X)%*%t(X)%*%sqrt(V))
rs <- (Cancel-p.hats)/sqrt(p.hats*(1-p.hats))
delta.chis <- rs^2/(1-hs)
#obtain delta betas (cook's distances)
delta.beta <- rs^2*hs/(1-hs)^2
##obtain delta deviances
#first obtain deviance residuals
ds <- resid(mod)
delta.D <- ds^2/(1-hs)
#Examine observations with large values of diagnostics which.max(delta.beta)
delta.beta[114]
#For some reason 114 identified 115 as max (somehow things got offset by 1)
mod2 <-glm(Cancel[-115]~Demo[-115]+PrevDemo[-115]+Viewers[-115]+PrevViewers[-
115]+Script[-115]+as.factor(Network)[-115], family=binomial)
100*(mod2$coefficients-mod$coefficients)/mod$coefficients
K-fold cross validation
##Create folds from original data
#First k=5 folds, i.e., cut X and y into fifths
X.f1 <- as.matrix(X[1:56,1:6])
X.f2 <- as.matrix(X[57:112,1:6])
X.f3 <- as.matrix(X[113:168,1:6])
X.f4 <- as.matrix(X[169:225,1:6])
X.f5 <- as.matrix(X[226:282,1:6])
y.f1 <- Cancel[1:56]
y.f2 <- Cancel[57:112]
y.f3 <- Cancel[113:168]
y.f4 <- Cancel[169:225]
y.f5 <- Cancel[226:282]
##Next, create training sets. When using Fold 1 as validation set (X.f1 and y.f1), then all
other folds combined are training set, and so on. X.t1 <- rbind(X.f2,X.f3,X.f4,X.f5) X.t2 <-
rbind(X.f1,X.f3,X.f4,X.f5) X.t3 <- rbind(X.f1,X.f2,X.f4,X.f5) X.t4 <-
15. rbind(X.f1,X.f2,X.f3,X.f5) X.t5 <- rbind(X.f1,X.f2,X.f3,X.f4) y.t1 <-c(y.f2,y.f3,y.f4,y.f5)
y.t2 <-c(y.f1,y.f3,y.f4,y.f5) y.t3 <-c(y.f1,y.f2,y.f4,y.f5) y.t4 <-c(y.f1,y.f2,y.f3,y.f5) y.t5 <-
c(y.f1,y.f2,y.f3,y.f4)
###Now, use each training set to fit a regression model and each Fold as a validation set,
recording the error rate each time
##Fold 1 as validation mod1<-glm(y.t1~X.t1,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi1 <- plogis(cbind(1,X.f1)%*%mod1$coefficients)
yhat1 <- round(pi1)
#Compute error rate, i.e., agreement between predictions and actual y values in validation
set
err.rate1 <- length(which (y.f1 != yhat1))/length(y.f1)
err.rate1
##Fold 2 as validation
mod2<-glm(y.t2~X.t2,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi2 <- plogis(cbind(1,X.f2)%*%mod2$coefficients)
yhat2 <- round(pi2)
#Compute error rate, i.e., agreement between predictions and actual y values in validation
set
err.rate2 <- length(which (y.f2 != yhat2))/length(y.f2)
err.rate2
##Fold 3 as validation mod3<-glm(y.t3~X.t3,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi3 <- plogis(cbind(1,X.f3)%*%mod3$coefficients)
yhat3 <- round(pi3)
#Compute error rate, i.e., agreement between predictions and actual y values in validation
set
err.rate3 <- length(which (y.f3 != yhat3))/length(y.f3)
err.rate3
##Fold 4 as validation
mod4<-glm(y.t4~X.t4,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi4 <- plogis(cbind(1,X.f4)%*%mod4$coefficients)
yhat4 <- round(pi4) #Compute error rate, i.e., agreement between predictions and actual y
values in validation set
err.rate4 <- length(which (y.f4 != yhat4))/length(y.f4)
err.rate4
##Fold 5 as validation mod5<-glm(y.t5~X.t5,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi5 <- plogis(cbind(1,X.f5)%*%mod5$coefficients)
yhat5 <- round(pi5) #Compute error rate, i.e., agreement between predictions and actual y
values in validation set
err.rate5 <- length(which (y.f5 != yhat5))/length(y.f5)
err.rate5 #compute mean error rate over five folds
mean(c(err.rate1,err.rate2,err.rate3,err.rate4,err.rate5))