SlideShare a Scribd company logo
1 of 16
Download to read offline
Television Show Cancelation Model
Purposeful Selection of Covariates
Brandon Angelini
Topics in Statistics – Logistic Modeling
ACMS 40950
March 4th
2016
Introduction
The model’s goal is to predict if a TV show will be canceled by observing a variety of
publically available information about the show. The data set includes a binary output of
cancelled or renewed, and has continuous covariates 18-49 demographic viewership,
previous year 18-49 demo viewership, overall viewership, previous year overall viewership,
as well as categorical variable Network (ABC, CBS, NBC, etc.), and binary variables
Scripted and Broadcast. The data is an aggregation of Nielsen TV ratings, as archived by
TVSeriesFinale.com, and compiled by the researcher (Brandon Angelini).
Data Set Variables:
Title – Show Title
Cancel – 1 if cancelled, 0 if renewed/other
ID – Arbitrary numerical ID assigned to show
Demo – Ratings for adults 18-49 demographic
PrevDemo – Ratings for previous year for adults 18-49 demographic (0 for new shows)
Viewers – overall viewership
PrevViewers – overall viewership for previous year (0 for new shows)
Scripted – 0 if unscripted show (reality), 1 if scripted show (drama, comedy, etc.)
Broadcast – 0 if cable, 1 if broadcast network (ABC, CBS, FOX, and NBC typically have
higher distribution because they’re broadcast networks)
Network – Network show airs on
ABC – 1
CBS – 2
CW – 3
FOX – 4
Freeform – 5
FX – 6
MTV – 7
NBC – 8
SyFy – 9
TNT – 10
Often the cancelling and renewing of TV shows is looked at as a black box in which TV
executives choose which shows live and die for a variety of reasons, but this model attempts
to simplify the cancelation decisions to ratings data, and the categories a show occupies.
Production companies, media companies, and advertisers could use the model to understand
how to allocate resources among shows, allowing them to predict what production decisions
executives may make.
Purposeful Selection of Covariates
Step 1:
Create a univariable logistic regression model for each covariate
Covariate Coeff. Std. Err. Odds Ratio G p
Demo -1.389 0.238 0.24935 46.57 0.00
PrevDemo -2.592 0.429 0.07487 91.10 0.000
Viewers -0.293 0.052 0.74602 42.92 0.000
PrevViewers -0.568 0.101 0.56666 87.32 0.00
Script 1.999 0.612 7.3817 17.51 0.001
Broadcast -0.013 0.310 0.98708 0.00179 0.966
as.factor(Network)2 -.581 0.427 0.55934 6.89 0.174
as.factor(Network)3 -0.899 0.812 0.40698 6.89 0.268
as.factor(Network)4 -0.150 0.455 0.86071 6.89 0.742
as.factor(Network)5 0.112 0.905 1.1185 6.89 0.901
as.factor(Network)6 -0.398 0.709 0.67166 6.89 0.574
as.factor(Network)7 0.295 0.776 1.3431 6.89 0.704
as.factor(Network)8 0.208 0.373 1.2312 6.89 0.578
as.factor(Network)9 -0.111 0.647 0.89494 6.89 0.864
as.factor(Network)10 0.623 0.660 1.8645 6.89 0.345
All appear to be significant at this stage except for Broadcast, so a model will be fit with all
variables except ‘Broadcast’
Step 2:
Fit a multivariable model that contains all covariates that are significant in univariable
analysis at the 25% level.
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.230 1.112 -2.006 0.04489*
PrevDemo -2.641 1.564 -1.689 0.09120.
Viewers -0.771 0.302 -2.554 0.01064*
PrevViewers 0.025 0.347 0.072 0.94257
Script 2.646 0.836 3.163 0.00156**
as.factor(Network)2 2.054 0.953 2.154 0.03121*
as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07***
as.factor(Network)4 -1.636 0.837 -1.954 0.05073.
as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05***
as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07***
as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
as.factor(Network)8 -0.730 0.6582 -1.110 0.26698
as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08***
as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
PrevDemo, PrevViewers, and Network 8 don’t appear to be significant, so we must use the
likelihood ratio test. For PrevDemo the significance is .0923 (>.05), PrevViewers is .94275
(>.05) so both should be excluded from the model at this point. Excluding Network results
in a significance of 1.589e-12, so Network be kept in the model.
The new reduced model doesn’t include data points for previous year (insignificant in
multivariable model in step 2) or if the show is on broadcast TV or not (insignificant in
univariable model is step 1).
Reduced Model
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.7532 0.8836 -3.116 0.00183**
Viewers -0.8748 0.2209 -3.959 7.51e-05***
Script 3.0433 0.7953 3.827 0.00013***
as.factor(Network)2 2.0340 0.8442 2.409 0.01598*
as.factor(Network)3 -7.2073 1.2492 -5.769 7.95e-09***
as.factor(Network)4 -2.1870 0.7657 -2.856 0.00429**
as.factor(Network)5 -6.7336 1.3414 -5.020 5.18e-07***
as.factor(Network)6 -7.3584 1.2303 -5.981 2.22e-09***
as.factor(Network)7 -7.0426 1.2802 -5.501 3.77e-08***
as.factor(Network)8 -0.9525 0.5869 -1.623 0.10460
as.factor(Network)9 -7.6447 1.2329 -6.201 5.63e-10***
as.factor(Network)10 -5.2617 1.1354 -4.634 3.58e-06***
Step 3:
Check to see if covariates removed from the model in step 2 confound or are needed to
adjust the effects of the covariates remaining in the model.
With the removal of PrevDemo, coefficients for Demo (21%), Network 4 (33%), and
Network 8 (29%) change by >20%, and the removal of PrevViewers results in the
coefficients for Viewers (58%), and Script (24%) changing by >20%, meaning that both
PrevDemo and PrevViewers are adjusters, and should be left in the model. Additionally, it
makes sense intuitively that a show may not only be judged on that year’s performance, but
also on the previous year, so both will be kept in the model.
Step 4:
In univariable analysis (step 1) the Broadcast covariate was deemed to be insignificant
(p=0.966). The addition of Broadcast to the current model results in a small change to some
of the Network categorical coefficients, but because broadcast is a classification for network
variables (ABC, CBS, Fox, and NBC are ‘broadcast channels’, others are not) the effects of
Broadcast are likely captured through the inclusion of the network categorical variables, so
Broadcast will be left out of the model.
Preliminary Main Effects Model:
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.230 1.112 -2.006 0.04489*
PrevDemo -2.641 1.564 -1.689 0.09120.
Viewers -0.771 0.302 -2.554 0.01064*
PrevViewers 0.025 0.347 0.072 0.94257
Script 2.646 0.836 3.163 0.00156**
as.factor(Network)2 2.054 0.953 2.154 0.03121*
as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07***
as.factor(Network)4 -1.636 0.837 -1.954 0.05073.
as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05***
as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07***
as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
as.factor(Network)8 -0.730 0.6582 -1.110 0.26698
as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08***
as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Step 5:
Check the linearity of the remaining continuous covariates (Demo, PrevDemo, Viewers, and
PrevViewers) through the use of lowess plots, and account for any nonlinearlity.
Lowess Plots
Demo PrevDemo
Viewers PrevViewers
The lowess plots for all 4 continuous covariates appear to be non-linear, so it appears that it
will be necessary to use quartile design variables, fractional polynomials, or linear splines.
In this case, we’ll attempt to use linear splines to try to get the best possible fit, and test the
fit. Using linear splines resulted in a model that appeared to be over fit and yielded huge
standard errors, so we’ll try fractional polynomials instead.
Rejected linear splines model
Covariate Estimate Std. Err. Z value P values
Demo.splinesx.l1 -2.411e+15 9.919e+07 -24305729 <2e-16***
Demo.splinesx.l2 -5.640e+14 2.299e+07 -24526083 <2e-16***
Demo.splinesx.l3 2.843e+15 4.082e+07 69649357 <2e-16***
Demo.splinesx.l4 -2.774e+14 4.816e+07 -5760509 <2e-16***
PrevDemo.splinesx.l1 1.982e+15 7.734e+07 25626005 <2e-16***
PrevDemo.splinesx.l2 -8.001e+14 2.453e+07 -32612936 <2e-16***
PrevDemo.splinesx.l3 -9.628e+14 5.144e+07 -18715833 <2e-16***
PrevDemo.splinesx.l4 -9.876e+14 8.760e+07 -11274429 <2e-16***
Viewers.splinesx.l1 NA NA NA NA
Viewers.splinesx.l2 1.401e+13 8.712e+06 1608118 <2e-16***
Viewers.splinesx.l3 -7.136e+14 5.992e+06 -119077111 <2e-16***
Viewers.splinesx.l4 -8.728e+14 2.426e+07 -35975439 <2e-16***
PrevViewers.splinesx.l1 -1.852e+15 1.407e+08 -13159211 <2e-16***
PrevViewers.splinesx.l2 -1.942e+14 7.531e+06 -25786173 <2e-16***
PrevViewers.splinesx.l3 1.676e+14 6.477e+06 25883316 <2e-16***
PrevViewers.splinesx.l4 1.424e+15 2.930e+07 48600026 <2e-16***
Script 1.468e+15 1.268e+07 115751505 <2e-16***
as.factor(Network)2 9.463e+14 1.464e+07 64632365 <2e-16***
as.factor(Network)3 -2.331e+15 2.670e+07 -87309082 <2e-16***
as.factor(Network)4 -6.593e+14 1.634e+07 -40359985 <2e-16***
as.factor(Network)5 -2.173e+15 3.583e+07 -60636932 <2e-16***
as.factor(Network)6 -2.439e+15 2.918e+07 -83604516 <2e-16***
as.factor(Network)7 -3.225e+15 3.421e+07 -94264401 <2e-16***
as.factor(Network)8 -2.763e+14 1.215e+07 -22745070 <2e-16***
as.factor(Network)9 -2.602e+15 3.389e+07 -76767104 <2e-16***
as.factor(Network)10 -1.791e+15 2.764e+07 -64791116 <2e-16***
To determine the fractional polynomials, we’ll attempt to fit using the mfp package in R
fit1 <-mfp(Cancel~ fp(Demo) + fp(PrevDemo) +fp(Viewers) +fp(PrevViewers) +Script
+as.factor(Network), family=binomial, data=data1, verbose=T)
Upon trying fractional polynomials, none of the 4 variables result in p-values that merit the
addition of any degree of fractional polynomial, so the continuous covariates could likely
remain assumed to be linear relationships (determined through G-Stat/p-value testing).
However, the plots appear not linear, so we’ll try adding 1 term fractional polynomials as a
sort of compromise between a more complex but statistically insignificant way, and the
linear. The output of the fractional polynomial analysis leads us to conclude p1 the value for
Demo=0.5, PrevDemo=1, Viewers=1, and PrevViewers=-2.
Covariate Estimate Std. Error z value Pr(>|z|)
Demo 7.4589 4.4412 1.679 0.093061.
DemoSQRT -23.7592 9.9072 -2.398 0.016477*
PrevDemo -0.1829 1.9413 -0.094 0.924937
Viewers 2.3239 1.3430 1.730 0.083550.
ViewersSQ -0.2950 0.1372 -2.151 0.031488*
PrevViewers -0.5409 0.5159 -1.048 0.294423
Script 19.1208 1137.0261 0.017 0.986583
as.factor(Network)2 3.5713 1.6306 2.190 0.028510*
as.factor(Network)3 -5.5146 1.6626 -3.317 0.000911***
as.factor(Network)4 -1.2444 0.9568 -1.301 0.193417
as.factor(Network)5 -4.4888 1.7850 -2.515 0.011913*
as.factor(Network)6 -5.7524 1.6927 -3.398 0.000678***
as.factor(Network)7 -4.9793 1.8046 -2.759 0.005793**
as.factor(Network)8 -0.5257 0.6967 -0.755 0.450501
as.factor(Network)9 -7.1581 1.7966 -3.984 6.77e-05***
as.factor(Network)10 -5.2485 1.4327 -3.663 0.000249***
The introduction of fractional polynomials has changed the coefficient on Script and
PrevDemo substantially, and made them statistically insignificant, so at this point it seems
rational to conclude that the addition of fractional polynomials is over complicating the
model. A linear splines model was attempted, but similar results occurred, so as a result,
while we’d like to account of the apparent non-linearity illustrated in the Lowess plots, we
cannot introduce a way of accounting for that without severely altering the model or
coefficients. The fractional polynomials will be removed from the model, and the
continuous covariates will remain assumed to be linear.
Step 6:
Explore possible interactions among main effects
15 models were fit with all combinations of interactions between covariates, and none were
significant at the 5% level. The highest significance was between Demo and Network [6.7%]
and between PrevViewers and Network [5.2%], but neither makes sense in context. If demo
viewership and network interact, that should happen every year, and we don’t see that for
network and PrevDemo (and we don’t see that for Viewers and Network like the interaction
between PrevViewers and Network would suggest). As a result of intuition and relatively
low significance levels, we won’t include any interactions in the model.
Preliminary final model
Covariate Coeff. Std. Err. Z Pr(>|z|)
Demo -2.230 1.112 -2.006 0.04489*
PrevDemo -2.641 1.564 -1.689 0.09120.
Viewers -0.771 0.302 -2.554 0.01064*
PrevViewers 0.025 0.347 0.072 0.94257
Script 2.646 0.836 3.163 0.00156**
as.factor(Network)2 2.054 0.953 2.154 0.03121*
as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07***
as.factor(Network)4 -1.636 0.837 -1.954 0.05073.
as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05***
as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07***
as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
as.factor(Network)8 -0.730 0.6582 -1.110 0.26698
as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08***
as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Diagnostics and Results
Hosmer-Lemeshow Statistic
Test of goodness of fit based on fitted values and predicted values. The H-L stat has a null
hypothesis that the model fits, and alternative that there is a poor fit, and the null can be
rejected at typical p-value levels (5% typically).
H-L Values: X-squared = 4.2516, df = 8, p-value = 0.8337
Fail to reject the null, and conclude that there is adequate agreement between the fitted
values and the predicted values.
Cook’s Distances
A way to determine the effect on the coefficients created by “outlier” data points. Cook’s
distances are computed, a maximum distance is determined, and the relative effects on the
model’s coefficients are determined. If any data point has >20% effect, it is removed and the
model is fit again.
Observation 115 is the max, and upon removal none of the changes have the required 20%
effect, with the max being 3.93% for PrevViewers.
K-Folds Validation
To evaluate the generalizability of the model we’ll test it on a variety of subsets of the data.
To do so, we’ll start by splitting the data set into 5 subsets or ‘folds’, of sizes 56, 56, 56, 57,
and 57. The mean error rate over the five ‘folds’ is 0.3196115, which is acceptable for this
type of model.
Fold 1 [1:56] Fold 2 [57:112] Fold 3 [113:168] Fold 4 [169:225] Fold 5 [225:282]
0.3035714 0.1428571 0.6428571 0.3333333 0.1754386
AUC
Performing an Area Under the Curve analysis yields a value of 0.8687, which is very good
discrimination between subjects classified correctly and incorrectly.
Conclusions
The final model seems to describe the TV cancelations for this data set well, but the
model may not be generalizable, because there may be groupings to the data points that
weren’t taken into account. However, the favorable AUC and 5-fold validation are positive
signs that it may still discriminate well in outside data sets.
Potential issues with grouping in the model arise in that it may matter what other
shows are in competition in any given year, as it’s unclear if shows are fighting for a fixed
share of viewers in a year. It seems possible that shows have the potential to increase
viewership overall and fight for ratings independently of other shows, but more likely
ratings depend on other shows in the same year, and shows are fighting for a ‘slice of the
pie’ of viewers.
Additionally, I would’ve liked to add some type of variable for genre of a show, as
that’d help tie similarities between types of shows in addition to types of channels as the
categorical network variable does. Currently the scripted variable starts to do this, but there
are more classifications like ‘drama’ and ‘comedy’ that may have expected viewership
levels, that if they don’t achieve they’ll be canceled. Finally, show air time seems like it
would be an interesting statistic to keep on the shows, as it would help normalize ratings and
create a “good rating for that time” type of idea, so that shows that air at times that are
traditionally associated with lower viewership are evaluated properly.
Overall, the model fit is very successful, and has applications that could prove useful
to the general television production community.
Applications
The television cancelation model’s coefficients can help us draw some interesting
conclusions that come from the effects that a coefficient have in the model
(positive/negative, and overall magnitudes).
1. Scripted shows are more likely to be canceled than unscripted shows, as shown by
the positive coefficient on Script (2.64558). This could likely be related to the fact
that there are more scripted shows that air in a given year, but the reality remains,
scripted shows are more likely to be canceled.
2. Previous year demographic ratings are the most significant ratings effect on
cancelation (PrevDemo -2.64102), followed by current year demographic ratings
(Demo -2.22992), overall viewers (Viewers -0.77149) and previous year overall
viewers (PrevView 0.02498). This is interesting, as it points to executives caring
more about attracting the 18-49 demographic that advertisers often target than
overall viewership. This could have broader implications for television, and provide
reasoning for the common criticism of television that ‘minorities are
underrepresented’ by showing that studios have to appeal to the majority voices in
the 18-49 demographic, as dictated by advertisers.
3. It matters what network a show airs on. Some shows may be able to avoid
cancelation by airing on a specific network, as it appears each network has it’s own
standards for ratings resulting in cancelation, shown by the significance of the
network categorical variable.
Appendix
Load Data
data1 <- read.delim(file.choose(),header=T)
attach(data1)
Final Model
model <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+
Script+as.factor(Network),family=binomial)
summary(model)
Step 1:
mod.demo <-glm(Cancel~Demo,family=binomial)
mod.prevdemo <-glm(Cancel~PrevDemo,family=binomial)
mod.viewers <-glm(Cancel~Viewers,family=binomial)
mod.prevviewers <-glm(Cancel~PrevViewers,family=binomial)
mod.script <-glm(Cancel~Script,family=binomial)
mod.broadcast <-glm(Cancel~Broadcast,family=binomial)
mod.network <-glm(Cancel~as.factor(Network),family=binomial)
#Examinep p-values
round(coef(summary(mod.demo)),3)
round(coef(summary(mod.prevdemo)),3)
round(coef(summary(mod.viewers)),3)
round(coef(summary(mod.prevviewers)),3)
round(coef(summary(mod.script)),3)
round(coef(summary(mod.broadcast)),3)
round(coef(summary(mod.network)),3)
G.demo <- mod.demo$null.deviance-mod.demo$deviance
G.prevdemo <- mod.prevdemo$null.deviance-mod.prevdemo$deviance
G.viewers <- mod.viewers$null.deviance-mod.viewers$deviance
G.prevviewers <- mod.prevviewers$null.deviance-mod.prevviewers$deviance
G.script <- mod.script$null.deviance-mod.script$deviance
G.broadcast <- mod.broadcast$null.deviance-mod.broadcast$deviance
G.network <- mod.network$null.deviance-mod.network$deviance
Step 2:
mod.1.reduce<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script,
family=binomial)
summary(mod.1.reduce)
Step 3:
mod.1.reduce<- glm(Cancel~Demo+Viewers+Script+as.factor(Network), family=binomial)
betas.withPrevDemo <- mod.1.reduce$coefficients[-c(1,4,7)]
betas.withPrevDemo <- mod.1.reduce$coefficients[-1]
mod.1.test<- glm(Cancel~Demo+PrevDemo+Viewers+Script+as.factor(Network),
family=binomial)
betas.withPrevDemo <- mod.1.test$coefficients[-1]
betas.wo.PrevDemo <-mod.1.reduce$coefficients[-1]
Step 5:
##taken from http://thestatsgeek.com/2014/09/13/checking-functional-form-in-logistic-
regression-using-loess/
logitloess <- function(x, y, s) {
logit <- function(pr) {
log(pr/(1-pr))
}
if (missing(s)) {
locspan <- 0.7
} else {
locspan <- s
}
loessfit <- predict(loess(y~x,span=locspan))
pi <- pmax(pmin(loessfit,0.9999),0.0001)
logitfitted <- logit(pi)
plot(x, logitfitted, ylab="logit")
}
logitloess(Demo,Cancel,0.8)
logitloess(PrevDemo,Cancel,0.8)
logitloess(Viewers,Cancel,0.8)
logitloess(PrevViewers,Cancel,0.8)
Linear Splines
#Failed method – not used in model
#Create knots for both demo and viewers
knotsdem <- c(.5,2.2,3.5)
knotsview <- c(.2,5.5,13)
Demo.splines <- my.4splines(Demo,knotsdemo)
Demo.splines <- my.4splines(Demo,knotsdem)
PrevDemo.splines <- my.4splines(PrevDemo,knotsdem)
Viewers.splines <- my.4splines(Viewers,knotsview)
PrevViewers.splines <- my.4splines(PrevViewers,knotsview)
mod.linearsplines <-
glm(Cancel~Demo.splines+PrevDemo.splines+Viewers.splines+PrevViewers.splines+Scrip
t+ as.factor(Network),family=binomial)
Fractional Polynomials
mod<- mfp(Cancel~fp(Demo)+fp(PrevDemo)+fp(Viewers)+fp(PrevViewers)+Script+
as.factor(Network), family=binomial, data=data1, verbose=T)
mod<- glm(Cancel~Demo+DemoSQRT+PrevDemo+Viewers+ViewersSQ+PrevViewers
+Script+ as.factor(Network), family=binomial)
Step 6:
#Fit Models for all possible interactions
mod.int01 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Demo*PrevDemo, family=binomial)
mod.int02 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Demo*Viewers, family=binomial)
mod.int03 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Demo*PrevViewers, family=binomial)
mod.int04 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Demo*Script, family=binomial)
mod.int05 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Demo*as.factor(Network), family=binomial)
mod.int06 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevDemo*Viewers, family=binomial)
mod.int07 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevDemo*PrevViewers, family=binomial)
mod.int08 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevDemo*Script, family=binomial)
mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevDemo*as.factor(Network), family=binomial)
mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Viewers*PrevViewers, family=binomial)
mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevDemo*as.factor(Network), family=binomial)
mod.int10 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Viewers*PrevViewers, family=binomial)
mod.int11 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Viewers*Script, family=binomial)
mod.int12 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Viewers*as.factor(Network), family=binomial)
mod.int13 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevViewers*Script, family=binomial)
mod.int14 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+PrevViewers*as.factor(Network), family=binomial)
mod.int15 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+
as.factor(Network)+Script*as.factor(Network), family=binomial)
coef(summary(mod.int01))[16,4]
coef(summary(mod.int02))[16,4]
coef(summary(mod.int03))[16,4]
coef(summary(mod.int04))[16,4]
coef(summary(mod.int05))[16,4]
coef(summary(mod.int06))[16,4]
coef(summary(mod.int07))[16,4]
coef(summary(mod.int08))[16,4]
coef(summary(mod.int09))[16,4]
coef(summary(mod.int10))[16,4]
coef(summary(mod.int11))[16,4]
coef(summary(mod.int12))[16,4]
coef(summary(mod.int13))[16,4]
coef(summary(mod.int14))[16,4]
coef(summary(mod.int15))[16,4]
Diagnostics
Hosmer-Lemeshow
#Perform Hosmer-Lemeshow (HL) test via R package ResourceSelection
require(ResourceSelection)
mod<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+as.factor(Network),
family=binomial)
hoslem.test(Cancel,mod$fitted.values,g=10)
Cook’s distances
X <-cbind(rep(1,n),Demo,PrevDemo,Viewers,PrevViewers,Script,as.factor(Network))
p.hats <- mod$fitted.values[-1]
V <- diag(p.hats*(1-p.hats))
hs <- diag(sqrt(V)%*%X%*%solve(t(X)%*%V%*%X)%*%t(X)%*%sqrt(V))
rs <- (Cancel-p.hats)/sqrt(p.hats*(1-p.hats))
delta.chis <- rs^2/(1-hs)
#obtain delta betas (cook's distances)
delta.beta <- rs^2*hs/(1-hs)^2
##obtain delta deviances
#first obtain deviance residuals
ds <- resid(mod)
delta.D <- ds^2/(1-hs)
#Examine observations with large values of diagnostics which.max(delta.beta)
delta.beta[114]
#For some reason 114 identified 115 as max (somehow things got offset by 1)
mod2 <-glm(Cancel[-115]~Demo[-115]+PrevDemo[-115]+Viewers[-115]+PrevViewers[-
115]+Script[-115]+as.factor(Network)[-115], family=binomial)
100*(mod2$coefficients-mod$coefficients)/mod$coefficients
K-fold cross validation
##Create folds from original data
#First k=5 folds, i.e., cut X and y into fifths
X.f1 <- as.matrix(X[1:56,1:6])
X.f2 <- as.matrix(X[57:112,1:6])
X.f3 <- as.matrix(X[113:168,1:6])
X.f4 <- as.matrix(X[169:225,1:6])
X.f5 <- as.matrix(X[226:282,1:6])
y.f1 <- Cancel[1:56]
y.f2 <- Cancel[57:112]
y.f3 <- Cancel[113:168]
y.f4 <- Cancel[169:225]
y.f5 <- Cancel[226:282]
##Next, create training sets. When using Fold 1 as validation set (X.f1 and y.f1), then all
other folds combined are training set, and so on. X.t1 <- rbind(X.f2,X.f3,X.f4,X.f5) X.t2 <-
rbind(X.f1,X.f3,X.f4,X.f5) X.t3 <- rbind(X.f1,X.f2,X.f4,X.f5) X.t4 <-
rbind(X.f1,X.f2,X.f3,X.f5) X.t5 <- rbind(X.f1,X.f2,X.f3,X.f4) y.t1 <-c(y.f2,y.f3,y.f4,y.f5)
y.t2 <-c(y.f1,y.f3,y.f4,y.f5) y.t3 <-c(y.f1,y.f2,y.f4,y.f5) y.t4 <-c(y.f1,y.f2,y.f3,y.f5) y.t5 <-
c(y.f1,y.f2,y.f3,y.f4)
###Now, use each training set to fit a regression model and each Fold as a validation set,
recording the error rate each time
##Fold 1 as validation mod1<-glm(y.t1~X.t1,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi1 <- plogis(cbind(1,X.f1)%*%mod1$coefficients)
yhat1 <- round(pi1)
#Compute error rate, i.e., agreement between predictions and actual y values in validation
set
err.rate1 <- length(which (y.f1 != yhat1))/length(y.f1)
err.rate1
##Fold 2 as validation
mod2<-glm(y.t2~X.t2,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi2 <- plogis(cbind(1,X.f2)%*%mod2$coefficients)
yhat2 <- round(pi2)
#Compute error rate, i.e., agreement between predictions and actual y values in validation
set
err.rate2 <- length(which (y.f2 != yhat2))/length(y.f2)
err.rate2
##Fold 3 as validation mod3<-glm(y.t3~X.t3,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi3 <- plogis(cbind(1,X.f3)%*%mod3$coefficients)
yhat3 <- round(pi3)
#Compute error rate, i.e., agreement between predictions and actual y values in validation
set
err.rate3 <- length(which (y.f3 != yhat3))/length(y.f3)
err.rate3
##Fold 4 as validation
mod4<-glm(y.t4~X.t4,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi4 <- plogis(cbind(1,X.f4)%*%mod4$coefficients)
yhat4 <- round(pi4) #Compute error rate, i.e., agreement between predictions and actual y
values in validation set
err.rate4 <- length(which (y.f4 != yhat4))/length(y.f4)
err.rate4
##Fold 5 as validation mod5<-glm(y.t5~X.t5,family=binomial)
#Compute fitted values for for validation set data uisng coefficients from training model
pi5 <- plogis(cbind(1,X.f5)%*%mod5$coefficients)
yhat5 <- round(pi5) #Compute error rate, i.e., agreement between predictions and actual y
values in validation set
err.rate5 <- length(which (y.f5 != yhat5))/length(y.f5)
err.rate5 #compute mean error rate over five folds
mean(c(err.rate1,err.rate2,err.rate3,err.rate4,err.rate5))
ACMS TV Ratings Midterm Angelini

More Related Content

Viewers also liked

Viewers also liked (10)

Friend Session
Friend SessionFriend Session
Friend Session
 
Monitoraggio progetto crescere 2013 2014
Monitoraggio progetto crescere 2013 2014Monitoraggio progetto crescere 2013 2014
Monitoraggio progetto crescere 2013 2014
 
George C van der Walt resume
George C van der Walt resumeGeorge C van der Walt resume
George C van der Walt resume
 
inspección de atun
inspección de atuninspección de atun
inspección de atun
 
CFAST Project Report
CFAST Project ReportCFAST Project Report
CFAST Project Report
 
Prestige silver crest
Prestige silver crestPrestige silver crest
Prestige silver crest
 
Fracking - Damien Short
Fracking - Damien ShortFracking - Damien Short
Fracking - Damien Short
 
Cover Letter 2015
Cover Letter 2015Cover Letter 2015
Cover Letter 2015
 
Milky chance
Milky chanceMilky chance
Milky chance
 
Fotoperiodismo en boddas
Fotoperiodismo en boddasFotoperiodismo en boddas
Fotoperiodismo en boddas
 

Similar to ACMS TV Ratings Midterm Angelini

IRJET - Finger Vein Extraction and Authentication System for ATM
IRJET -  	  Finger Vein Extraction and Authentication System for ATMIRJET -  	  Finger Vein Extraction and Authentication System for ATM
IRJET - Finger Vein Extraction and Authentication System for ATMIRJET Journal
 
Traffic Sign Recognition System
Traffic Sign Recognition SystemTraffic Sign Recognition System
Traffic Sign Recognition SystemIRJET Journal
 
Business Market Research on Instant Messaging -2013
Business Market Research on Instant Messaging -2013Business Market Research on Instant Messaging -2013
Business Market Research on Instant Messaging -2013Rajib Layek
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaIRJET Journal
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataIRJET Journal
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...IRJET Journal
 
Software reliability prediction
Software reliability predictionSoftware reliability prediction
Software reliability predictionMirza Mohymen
 
11 7986 9062-1-pb
11 7986 9062-1-pb11 7986 9062-1-pb
11 7986 9062-1-pbIAESIJEECS
 
CenterAttentionFaceNet: A improved network with the CBAM attention mechanism
CenterAttentionFaceNet: A improved network with the CBAM attention mechanismCenterAttentionFaceNet: A improved network with the CBAM attention mechanism
CenterAttentionFaceNet: A improved network with the CBAM attention mechanismIRJET Journal
 
Anonymous Data Sharing in Cloud using Pack Algorithm
Anonymous Data Sharing in Cloud using Pack AlgorithmAnonymous Data Sharing in Cloud using Pack Algorithm
Anonymous Data Sharing in Cloud using Pack AlgorithmIRJET Journal
 
Video lectures for b.tech
Video lectures for b.techVideo lectures for b.tech
Video lectures for b.techEdhole.com
 
Statistics project2
Statistics project2Statistics project2
Statistics project2shri1984
 
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPVARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPcscpconf
 
Neal-DeignReport
Neal-DeignReportNeal-DeignReport
Neal-DeignReportNeal Derman
 
Advanced Econometrics L13-14.pptx
Advanced Econometrics L13-14.pptxAdvanced Econometrics L13-14.pptx
Advanced Econometrics L13-14.pptxakashayosha
 
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPVARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPcsandit
 

Similar to ACMS TV Ratings Midterm Angelini (20)

IRJET - Finger Vein Extraction and Authentication System for ATM
IRJET -  	  Finger Vein Extraction and Authentication System for ATMIRJET -  	  Finger Vein Extraction and Authentication System for ATM
IRJET - Finger Vein Extraction and Authentication System for ATM
 
Traffic Sign Recognition System
Traffic Sign Recognition SystemTraffic Sign Recognition System
Traffic Sign Recognition System
 
Business Market Research on Instant Messaging -2013
Business Market Research on Instant Messaging -2013Business Market Research on Instant Messaging -2013
Business Market Research on Instant Messaging -2013
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
 
Software reliability prediction
Software reliability predictionSoftware reliability prediction
Software reliability prediction
 
11 7986 9062-1-pb
11 7986 9062-1-pb11 7986 9062-1-pb
11 7986 9062-1-pb
 
CenterAttentionFaceNet: A improved network with the CBAM attention mechanism
CenterAttentionFaceNet: A improved network with the CBAM attention mechanismCenterAttentionFaceNet: A improved network with the CBAM attention mechanism
CenterAttentionFaceNet: A improved network with the CBAM attention mechanism
 
Anonymous Data Sharing in Cloud using Pack Algorithm
Anonymous Data Sharing in Cloud using Pack AlgorithmAnonymous Data Sharing in Cloud using Pack Algorithm
Anonymous Data Sharing in Cloud using Pack Algorithm
 
Video lectures for b.tech
Video lectures for b.techVideo lectures for b.tech
Video lectures for b.tech
 
Statistics project2
Statistics project2Statistics project2
Statistics project2
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
Report
ReportReport
Report
 
Six sigma pedagogy
Six sigma pedagogySix sigma pedagogy
Six sigma pedagogy
 
Six sigma
Six sigma Six sigma
Six sigma
 
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPVARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
 
Neal-DeignReport
Neal-DeignReportNeal-DeignReport
Neal-DeignReport
 
Advanced Econometrics L13-14.pptx
Advanced Econometrics L13-14.pptxAdvanced Econometrics L13-14.pptx
Advanced Econometrics L13-14.pptx
 
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPVARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
 

ACMS TV Ratings Midterm Angelini

  • 1. Television Show Cancelation Model Purposeful Selection of Covariates Brandon Angelini Topics in Statistics – Logistic Modeling ACMS 40950 March 4th 2016
  • 2. Introduction The model’s goal is to predict if a TV show will be canceled by observing a variety of publically available information about the show. The data set includes a binary output of cancelled or renewed, and has continuous covariates 18-49 demographic viewership, previous year 18-49 demo viewership, overall viewership, previous year overall viewership, as well as categorical variable Network (ABC, CBS, NBC, etc.), and binary variables Scripted and Broadcast. The data is an aggregation of Nielsen TV ratings, as archived by TVSeriesFinale.com, and compiled by the researcher (Brandon Angelini). Data Set Variables: Title – Show Title Cancel – 1 if cancelled, 0 if renewed/other ID – Arbitrary numerical ID assigned to show Demo – Ratings for adults 18-49 demographic PrevDemo – Ratings for previous year for adults 18-49 demographic (0 for new shows) Viewers – overall viewership PrevViewers – overall viewership for previous year (0 for new shows) Scripted – 0 if unscripted show (reality), 1 if scripted show (drama, comedy, etc.) Broadcast – 0 if cable, 1 if broadcast network (ABC, CBS, FOX, and NBC typically have higher distribution because they’re broadcast networks) Network – Network show airs on ABC – 1 CBS – 2 CW – 3 FOX – 4 Freeform – 5 FX – 6 MTV – 7 NBC – 8 SyFy – 9 TNT – 10 Often the cancelling and renewing of TV shows is looked at as a black box in which TV executives choose which shows live and die for a variety of reasons, but this model attempts to simplify the cancelation decisions to ratings data, and the categories a show occupies. Production companies, media companies, and advertisers could use the model to understand how to allocate resources among shows, allowing them to predict what production decisions executives may make.
  • 3. Purposeful Selection of Covariates Step 1: Create a univariable logistic regression model for each covariate Covariate Coeff. Std. Err. Odds Ratio G p Demo -1.389 0.238 0.24935 46.57 0.00 PrevDemo -2.592 0.429 0.07487 91.10 0.000 Viewers -0.293 0.052 0.74602 42.92 0.000 PrevViewers -0.568 0.101 0.56666 87.32 0.00 Script 1.999 0.612 7.3817 17.51 0.001 Broadcast -0.013 0.310 0.98708 0.00179 0.966 as.factor(Network)2 -.581 0.427 0.55934 6.89 0.174 as.factor(Network)3 -0.899 0.812 0.40698 6.89 0.268 as.factor(Network)4 -0.150 0.455 0.86071 6.89 0.742 as.factor(Network)5 0.112 0.905 1.1185 6.89 0.901 as.factor(Network)6 -0.398 0.709 0.67166 6.89 0.574 as.factor(Network)7 0.295 0.776 1.3431 6.89 0.704 as.factor(Network)8 0.208 0.373 1.2312 6.89 0.578 as.factor(Network)9 -0.111 0.647 0.89494 6.89 0.864 as.factor(Network)10 0.623 0.660 1.8645 6.89 0.345 All appear to be significant at this stage except for Broadcast, so a model will be fit with all variables except ‘Broadcast’ Step 2: Fit a multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.230 1.112 -2.006 0.04489* PrevDemo -2.641 1.564 -1.689 0.09120. Viewers -0.771 0.302 -2.554 0.01064* PrevViewers 0.025 0.347 0.072 0.94257 Script 2.646 0.836 3.163 0.00156** as.factor(Network)2 2.054 0.953 2.154 0.03121* as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07*** as.factor(Network)4 -1.636 0.837 -1.954 0.05073. as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05*** as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07*** as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07*** as.factor(Network)8 -0.730 0.6582 -1.110 0.26698 as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08*** as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05*** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 PrevDemo, PrevViewers, and Network 8 don’t appear to be significant, so we must use the likelihood ratio test. For PrevDemo the significance is .0923 (>.05), PrevViewers is .94275 (>.05) so both should be excluded from the model at this point. Excluding Network results in a significance of 1.589e-12, so Network be kept in the model.
  • 4. The new reduced model doesn’t include data points for previous year (insignificant in multivariable model in step 2) or if the show is on broadcast TV or not (insignificant in univariable model is step 1). Reduced Model Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.7532 0.8836 -3.116 0.00183** Viewers -0.8748 0.2209 -3.959 7.51e-05*** Script 3.0433 0.7953 3.827 0.00013*** as.factor(Network)2 2.0340 0.8442 2.409 0.01598* as.factor(Network)3 -7.2073 1.2492 -5.769 7.95e-09*** as.factor(Network)4 -2.1870 0.7657 -2.856 0.00429** as.factor(Network)5 -6.7336 1.3414 -5.020 5.18e-07*** as.factor(Network)6 -7.3584 1.2303 -5.981 2.22e-09*** as.factor(Network)7 -7.0426 1.2802 -5.501 3.77e-08*** as.factor(Network)8 -0.9525 0.5869 -1.623 0.10460 as.factor(Network)9 -7.6447 1.2329 -6.201 5.63e-10*** as.factor(Network)10 -5.2617 1.1354 -4.634 3.58e-06*** Step 3: Check to see if covariates removed from the model in step 2 confound or are needed to adjust the effects of the covariates remaining in the model. With the removal of PrevDemo, coefficients for Demo (21%), Network 4 (33%), and Network 8 (29%) change by >20%, and the removal of PrevViewers results in the coefficients for Viewers (58%), and Script (24%) changing by >20%, meaning that both PrevDemo and PrevViewers are adjusters, and should be left in the model. Additionally, it makes sense intuitively that a show may not only be judged on that year’s performance, but also on the previous year, so both will be kept in the model. Step 4: In univariable analysis (step 1) the Broadcast covariate was deemed to be insignificant (p=0.966). The addition of Broadcast to the current model results in a small change to some of the Network categorical coefficients, but because broadcast is a classification for network variables (ABC, CBS, Fox, and NBC are ‘broadcast channels’, others are not) the effects of Broadcast are likely captured through the inclusion of the network categorical variables, so Broadcast will be left out of the model. Preliminary Main Effects Model: Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.230 1.112 -2.006 0.04489* PrevDemo -2.641 1.564 -1.689 0.09120. Viewers -0.771 0.302 -2.554 0.01064* PrevViewers 0.025 0.347 0.072 0.94257 Script 2.646 0.836 3.163 0.00156** as.factor(Network)2 2.054 0.953 2.154 0.03121*
  • 5. as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07*** as.factor(Network)4 -1.636 0.837 -1.954 0.05073. as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05*** as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07*** as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07*** as.factor(Network)8 -0.730 0.6582 -1.110 0.26698 as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08*** as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05*** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Step 5: Check the linearity of the remaining continuous covariates (Demo, PrevDemo, Viewers, and PrevViewers) through the use of lowess plots, and account for any nonlinearlity. Lowess Plots Demo PrevDemo Viewers PrevViewers The lowess plots for all 4 continuous covariates appear to be non-linear, so it appears that it will be necessary to use quartile design variables, fractional polynomials, or linear splines. In this case, we’ll attempt to use linear splines to try to get the best possible fit, and test the fit. Using linear splines resulted in a model that appeared to be over fit and yielded huge standard errors, so we’ll try fractional polynomials instead.
  • 6. Rejected linear splines model Covariate Estimate Std. Err. Z value P values Demo.splinesx.l1 -2.411e+15 9.919e+07 -24305729 <2e-16*** Demo.splinesx.l2 -5.640e+14 2.299e+07 -24526083 <2e-16*** Demo.splinesx.l3 2.843e+15 4.082e+07 69649357 <2e-16*** Demo.splinesx.l4 -2.774e+14 4.816e+07 -5760509 <2e-16*** PrevDemo.splinesx.l1 1.982e+15 7.734e+07 25626005 <2e-16*** PrevDemo.splinesx.l2 -8.001e+14 2.453e+07 -32612936 <2e-16*** PrevDemo.splinesx.l3 -9.628e+14 5.144e+07 -18715833 <2e-16*** PrevDemo.splinesx.l4 -9.876e+14 8.760e+07 -11274429 <2e-16*** Viewers.splinesx.l1 NA NA NA NA Viewers.splinesx.l2 1.401e+13 8.712e+06 1608118 <2e-16*** Viewers.splinesx.l3 -7.136e+14 5.992e+06 -119077111 <2e-16*** Viewers.splinesx.l4 -8.728e+14 2.426e+07 -35975439 <2e-16*** PrevViewers.splinesx.l1 -1.852e+15 1.407e+08 -13159211 <2e-16*** PrevViewers.splinesx.l2 -1.942e+14 7.531e+06 -25786173 <2e-16*** PrevViewers.splinesx.l3 1.676e+14 6.477e+06 25883316 <2e-16*** PrevViewers.splinesx.l4 1.424e+15 2.930e+07 48600026 <2e-16*** Script 1.468e+15 1.268e+07 115751505 <2e-16*** as.factor(Network)2 9.463e+14 1.464e+07 64632365 <2e-16*** as.factor(Network)3 -2.331e+15 2.670e+07 -87309082 <2e-16*** as.factor(Network)4 -6.593e+14 1.634e+07 -40359985 <2e-16*** as.factor(Network)5 -2.173e+15 3.583e+07 -60636932 <2e-16*** as.factor(Network)6 -2.439e+15 2.918e+07 -83604516 <2e-16*** as.factor(Network)7 -3.225e+15 3.421e+07 -94264401 <2e-16*** as.factor(Network)8 -2.763e+14 1.215e+07 -22745070 <2e-16*** as.factor(Network)9 -2.602e+15 3.389e+07 -76767104 <2e-16*** as.factor(Network)10 -1.791e+15 2.764e+07 -64791116 <2e-16*** To determine the fractional polynomials, we’ll attempt to fit using the mfp package in R fit1 <-mfp(Cancel~ fp(Demo) + fp(PrevDemo) +fp(Viewers) +fp(PrevViewers) +Script +as.factor(Network), family=binomial, data=data1, verbose=T) Upon trying fractional polynomials, none of the 4 variables result in p-values that merit the addition of any degree of fractional polynomial, so the continuous covariates could likely remain assumed to be linear relationships (determined through G-Stat/p-value testing). However, the plots appear not linear, so we’ll try adding 1 term fractional polynomials as a sort of compromise between a more complex but statistically insignificant way, and the linear. The output of the fractional polynomial analysis leads us to conclude p1 the value for Demo=0.5, PrevDemo=1, Viewers=1, and PrevViewers=-2. Covariate Estimate Std. Error z value Pr(>|z|) Demo 7.4589 4.4412 1.679 0.093061. DemoSQRT -23.7592 9.9072 -2.398 0.016477* PrevDemo -0.1829 1.9413 -0.094 0.924937
  • 7. Viewers 2.3239 1.3430 1.730 0.083550. ViewersSQ -0.2950 0.1372 -2.151 0.031488* PrevViewers -0.5409 0.5159 -1.048 0.294423 Script 19.1208 1137.0261 0.017 0.986583 as.factor(Network)2 3.5713 1.6306 2.190 0.028510* as.factor(Network)3 -5.5146 1.6626 -3.317 0.000911*** as.factor(Network)4 -1.2444 0.9568 -1.301 0.193417 as.factor(Network)5 -4.4888 1.7850 -2.515 0.011913* as.factor(Network)6 -5.7524 1.6927 -3.398 0.000678*** as.factor(Network)7 -4.9793 1.8046 -2.759 0.005793** as.factor(Network)8 -0.5257 0.6967 -0.755 0.450501 as.factor(Network)9 -7.1581 1.7966 -3.984 6.77e-05*** as.factor(Network)10 -5.2485 1.4327 -3.663 0.000249*** The introduction of fractional polynomials has changed the coefficient on Script and PrevDemo substantially, and made them statistically insignificant, so at this point it seems rational to conclude that the addition of fractional polynomials is over complicating the model. A linear splines model was attempted, but similar results occurred, so as a result, while we’d like to account of the apparent non-linearity illustrated in the Lowess plots, we cannot introduce a way of accounting for that without severely altering the model or coefficients. The fractional polynomials will be removed from the model, and the continuous covariates will remain assumed to be linear. Step 6: Explore possible interactions among main effects 15 models were fit with all combinations of interactions between covariates, and none were significant at the 5% level. The highest significance was between Demo and Network [6.7%] and between PrevViewers and Network [5.2%], but neither makes sense in context. If demo viewership and network interact, that should happen every year, and we don’t see that for network and PrevDemo (and we don’t see that for Viewers and Network like the interaction between PrevViewers and Network would suggest). As a result of intuition and relatively low significance levels, we won’t include any interactions in the model. Preliminary final model Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.230 1.112 -2.006 0.04489* PrevDemo -2.641 1.564 -1.689 0.09120. Viewers -0.771 0.302 -2.554 0.01064* PrevViewers 0.025 0.347 0.072 0.94257 Script 2.646 0.836 3.163 0.00156** as.factor(Network)2 2.054 0.953 2.154 0.03121* as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07*** as.factor(Network)4 -1.636 0.837 -1.954 0.05073. as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05*** as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07*** as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***
  • 8. as.factor(Network)8 -0.730 0.6582 -1.110 0.26698 as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08*** as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05*** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  • 9. Diagnostics and Results Hosmer-Lemeshow Statistic Test of goodness of fit based on fitted values and predicted values. The H-L stat has a null hypothesis that the model fits, and alternative that there is a poor fit, and the null can be rejected at typical p-value levels (5% typically). H-L Values: X-squared = 4.2516, df = 8, p-value = 0.8337 Fail to reject the null, and conclude that there is adequate agreement between the fitted values and the predicted values. Cook’s Distances A way to determine the effect on the coefficients created by “outlier” data points. Cook’s distances are computed, a maximum distance is determined, and the relative effects on the model’s coefficients are determined. If any data point has >20% effect, it is removed and the model is fit again. Observation 115 is the max, and upon removal none of the changes have the required 20% effect, with the max being 3.93% for PrevViewers. K-Folds Validation To evaluate the generalizability of the model we’ll test it on a variety of subsets of the data. To do so, we’ll start by splitting the data set into 5 subsets or ‘folds’, of sizes 56, 56, 56, 57, and 57. The mean error rate over the five ‘folds’ is 0.3196115, which is acceptable for this type of model. Fold 1 [1:56] Fold 2 [57:112] Fold 3 [113:168] Fold 4 [169:225] Fold 5 [225:282] 0.3035714 0.1428571 0.6428571 0.3333333 0.1754386 AUC Performing an Area Under the Curve analysis yields a value of 0.8687, which is very good discrimination between subjects classified correctly and incorrectly.
  • 10. Conclusions The final model seems to describe the TV cancelations for this data set well, but the model may not be generalizable, because there may be groupings to the data points that weren’t taken into account. However, the favorable AUC and 5-fold validation are positive signs that it may still discriminate well in outside data sets. Potential issues with grouping in the model arise in that it may matter what other shows are in competition in any given year, as it’s unclear if shows are fighting for a fixed share of viewers in a year. It seems possible that shows have the potential to increase viewership overall and fight for ratings independently of other shows, but more likely ratings depend on other shows in the same year, and shows are fighting for a ‘slice of the pie’ of viewers. Additionally, I would’ve liked to add some type of variable for genre of a show, as that’d help tie similarities between types of shows in addition to types of channels as the categorical network variable does. Currently the scripted variable starts to do this, but there are more classifications like ‘drama’ and ‘comedy’ that may have expected viewership levels, that if they don’t achieve they’ll be canceled. Finally, show air time seems like it would be an interesting statistic to keep on the shows, as it would help normalize ratings and create a “good rating for that time” type of idea, so that shows that air at times that are traditionally associated with lower viewership are evaluated properly. Overall, the model fit is very successful, and has applications that could prove useful to the general television production community. Applications The television cancelation model’s coefficients can help us draw some interesting conclusions that come from the effects that a coefficient have in the model (positive/negative, and overall magnitudes). 1. Scripted shows are more likely to be canceled than unscripted shows, as shown by the positive coefficient on Script (2.64558). This could likely be related to the fact that there are more scripted shows that air in a given year, but the reality remains, scripted shows are more likely to be canceled. 2. Previous year demographic ratings are the most significant ratings effect on cancelation (PrevDemo -2.64102), followed by current year demographic ratings (Demo -2.22992), overall viewers (Viewers -0.77149) and previous year overall viewers (PrevView 0.02498). This is interesting, as it points to executives caring more about attracting the 18-49 demographic that advertisers often target than overall viewership. This could have broader implications for television, and provide reasoning for the common criticism of television that ‘minorities are underrepresented’ by showing that studios have to appeal to the majority voices in the 18-49 demographic, as dictated by advertisers. 3. It matters what network a show airs on. Some shows may be able to avoid cancelation by airing on a specific network, as it appears each network has it’s own standards for ratings resulting in cancelation, shown by the significance of the network categorical variable.
  • 11. Appendix Load Data data1 <- read.delim(file.choose(),header=T) attach(data1) Final Model model <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+ Script+as.factor(Network),family=binomial) summary(model) Step 1: mod.demo <-glm(Cancel~Demo,family=binomial) mod.prevdemo <-glm(Cancel~PrevDemo,family=binomial) mod.viewers <-glm(Cancel~Viewers,family=binomial) mod.prevviewers <-glm(Cancel~PrevViewers,family=binomial) mod.script <-glm(Cancel~Script,family=binomial) mod.broadcast <-glm(Cancel~Broadcast,family=binomial) mod.network <-glm(Cancel~as.factor(Network),family=binomial) #Examinep p-values round(coef(summary(mod.demo)),3) round(coef(summary(mod.prevdemo)),3) round(coef(summary(mod.viewers)),3) round(coef(summary(mod.prevviewers)),3) round(coef(summary(mod.script)),3) round(coef(summary(mod.broadcast)),3) round(coef(summary(mod.network)),3) G.demo <- mod.demo$null.deviance-mod.demo$deviance G.prevdemo <- mod.prevdemo$null.deviance-mod.prevdemo$deviance G.viewers <- mod.viewers$null.deviance-mod.viewers$deviance G.prevviewers <- mod.prevviewers$null.deviance-mod.prevviewers$deviance G.script <- mod.script$null.deviance-mod.script$deviance G.broadcast <- mod.broadcast$null.deviance-mod.broadcast$deviance G.network <- mod.network$null.deviance-mod.network$deviance Step 2: mod.1.reduce<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script, family=binomial) summary(mod.1.reduce) Step 3: mod.1.reduce<- glm(Cancel~Demo+Viewers+Script+as.factor(Network), family=binomial) betas.withPrevDemo <- mod.1.reduce$coefficients[-c(1,4,7)] betas.withPrevDemo <- mod.1.reduce$coefficients[-1] mod.1.test<- glm(Cancel~Demo+PrevDemo+Viewers+Script+as.factor(Network), family=binomial)
  • 12. betas.withPrevDemo <- mod.1.test$coefficients[-1] betas.wo.PrevDemo <-mod.1.reduce$coefficients[-1] Step 5: ##taken from http://thestatsgeek.com/2014/09/13/checking-functional-form-in-logistic- regression-using-loess/ logitloess <- function(x, y, s) { logit <- function(pr) { log(pr/(1-pr)) } if (missing(s)) { locspan <- 0.7 } else { locspan <- s } loessfit <- predict(loess(y~x,span=locspan)) pi <- pmax(pmin(loessfit,0.9999),0.0001) logitfitted <- logit(pi) plot(x, logitfitted, ylab="logit") } logitloess(Demo,Cancel,0.8) logitloess(PrevDemo,Cancel,0.8) logitloess(Viewers,Cancel,0.8) logitloess(PrevViewers,Cancel,0.8) Linear Splines #Failed method – not used in model #Create knots for both demo and viewers knotsdem <- c(.5,2.2,3.5) knotsview <- c(.2,5.5,13) Demo.splines <- my.4splines(Demo,knotsdemo) Demo.splines <- my.4splines(Demo,knotsdem) PrevDemo.splines <- my.4splines(PrevDemo,knotsdem) Viewers.splines <- my.4splines(Viewers,knotsview) PrevViewers.splines <- my.4splines(PrevViewers,knotsview) mod.linearsplines <- glm(Cancel~Demo.splines+PrevDemo.splines+Viewers.splines+PrevViewers.splines+Scrip t+ as.factor(Network),family=binomial) Fractional Polynomials mod<- mfp(Cancel~fp(Demo)+fp(PrevDemo)+fp(Viewers)+fp(PrevViewers)+Script+ as.factor(Network), family=binomial, data=data1, verbose=T) mod<- glm(Cancel~Demo+DemoSQRT+PrevDemo+Viewers+ViewersSQ+PrevViewers +Script+ as.factor(Network), family=binomial)
  • 13. Step 6: #Fit Models for all possible interactions mod.int01 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*PrevDemo, family=binomial) mod.int02 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*Viewers, family=binomial) mod.int03 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*PrevViewers, family=binomial) mod.int04 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*Script, family=binomial) mod.int05 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*as.factor(Network), family=binomial) mod.int06 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*Viewers, family=binomial) mod.int07 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*PrevViewers, family=binomial) mod.int08 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*Script, family=binomial) mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*as.factor(Network), family=binomial) mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*PrevViewers, family=binomial) mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*as.factor(Network), family=binomial) mod.int10 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*PrevViewers, family=binomial) mod.int11 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*Script, family=binomial) mod.int12 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*as.factor(Network), family=binomial) mod.int13 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevViewers*Script, family=binomial) mod.int14 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevViewers*as.factor(Network), family=binomial) mod.int15 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Script*as.factor(Network), family=binomial) coef(summary(mod.int01))[16,4] coef(summary(mod.int02))[16,4] coef(summary(mod.int03))[16,4] coef(summary(mod.int04))[16,4] coef(summary(mod.int05))[16,4] coef(summary(mod.int06))[16,4] coef(summary(mod.int07))[16,4] coef(summary(mod.int08))[16,4] coef(summary(mod.int09))[16,4] coef(summary(mod.int10))[16,4] coef(summary(mod.int11))[16,4]
  • 14. coef(summary(mod.int12))[16,4] coef(summary(mod.int13))[16,4] coef(summary(mod.int14))[16,4] coef(summary(mod.int15))[16,4] Diagnostics Hosmer-Lemeshow #Perform Hosmer-Lemeshow (HL) test via R package ResourceSelection require(ResourceSelection) mod<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+as.factor(Network), family=binomial) hoslem.test(Cancel,mod$fitted.values,g=10) Cook’s distances X <-cbind(rep(1,n),Demo,PrevDemo,Viewers,PrevViewers,Script,as.factor(Network)) p.hats <- mod$fitted.values[-1] V <- diag(p.hats*(1-p.hats)) hs <- diag(sqrt(V)%*%X%*%solve(t(X)%*%V%*%X)%*%t(X)%*%sqrt(V)) rs <- (Cancel-p.hats)/sqrt(p.hats*(1-p.hats)) delta.chis <- rs^2/(1-hs) #obtain delta betas (cook's distances) delta.beta <- rs^2*hs/(1-hs)^2 ##obtain delta deviances #first obtain deviance residuals ds <- resid(mod) delta.D <- ds^2/(1-hs) #Examine observations with large values of diagnostics which.max(delta.beta) delta.beta[114] #For some reason 114 identified 115 as max (somehow things got offset by 1) mod2 <-glm(Cancel[-115]~Demo[-115]+PrevDemo[-115]+Viewers[-115]+PrevViewers[- 115]+Script[-115]+as.factor(Network)[-115], family=binomial) 100*(mod2$coefficients-mod$coefficients)/mod$coefficients K-fold cross validation ##Create folds from original data #First k=5 folds, i.e., cut X and y into fifths X.f1 <- as.matrix(X[1:56,1:6]) X.f2 <- as.matrix(X[57:112,1:6]) X.f3 <- as.matrix(X[113:168,1:6]) X.f4 <- as.matrix(X[169:225,1:6]) X.f5 <- as.matrix(X[226:282,1:6]) y.f1 <- Cancel[1:56] y.f2 <- Cancel[57:112] y.f3 <- Cancel[113:168] y.f4 <- Cancel[169:225] y.f5 <- Cancel[226:282] ##Next, create training sets. When using Fold 1 as validation set (X.f1 and y.f1), then all other folds combined are training set, and so on. X.t1 <- rbind(X.f2,X.f3,X.f4,X.f5) X.t2 <- rbind(X.f1,X.f3,X.f4,X.f5) X.t3 <- rbind(X.f1,X.f2,X.f4,X.f5) X.t4 <-
  • 15. rbind(X.f1,X.f2,X.f3,X.f5) X.t5 <- rbind(X.f1,X.f2,X.f3,X.f4) y.t1 <-c(y.f2,y.f3,y.f4,y.f5) y.t2 <-c(y.f1,y.f3,y.f4,y.f5) y.t3 <-c(y.f1,y.f2,y.f4,y.f5) y.t4 <-c(y.f1,y.f2,y.f3,y.f5) y.t5 <- c(y.f1,y.f2,y.f3,y.f4) ###Now, use each training set to fit a regression model and each Fold as a validation set, recording the error rate each time ##Fold 1 as validation mod1<-glm(y.t1~X.t1,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi1 <- plogis(cbind(1,X.f1)%*%mod1$coefficients) yhat1 <- round(pi1) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate1 <- length(which (y.f1 != yhat1))/length(y.f1) err.rate1 ##Fold 2 as validation mod2<-glm(y.t2~X.t2,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi2 <- plogis(cbind(1,X.f2)%*%mod2$coefficients) yhat2 <- round(pi2) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate2 <- length(which (y.f2 != yhat2))/length(y.f2) err.rate2 ##Fold 3 as validation mod3<-glm(y.t3~X.t3,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi3 <- plogis(cbind(1,X.f3)%*%mod3$coefficients) yhat3 <- round(pi3) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate3 <- length(which (y.f3 != yhat3))/length(y.f3) err.rate3 ##Fold 4 as validation mod4<-glm(y.t4~X.t4,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi4 <- plogis(cbind(1,X.f4)%*%mod4$coefficients) yhat4 <- round(pi4) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate4 <- length(which (y.f4 != yhat4))/length(y.f4) err.rate4 ##Fold 5 as validation mod5<-glm(y.t5~X.t5,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi5 <- plogis(cbind(1,X.f5)%*%mod5$coefficients) yhat5 <- round(pi5) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate5 <- length(which (y.f5 != yhat5))/length(y.f5) err.rate5 #compute mean error rate over five folds mean(c(err.rate1,err.rate2,err.rate3,err.rate4,err.rate5))