The document analyzes a movie dataset from IMDB containing over 5,000 movie titles and attributes to predict movie success based on characteristics like Facebook likes. Two multiple linear regression models were created, one standard and one using stepwise variable selection. Sensitivity analysis found that total cast Facebook likes was most influential on gross revenue, while IMDB score and director Facebook likes were least influential. The analysis can help movie professionals and audiences predict success and spending.
IMDB Movie Dataset Analysis Predicts Box Office Success
1. IMDB Dataset
Aaron McClellan, Management & Strategic Leadership, Business Analytics
Introduction
For our final project,Ihave chosentoanalyze a movie dataset.Inthe dataset,there isa listof over5,000 movie titles
withseveral differentinputsto assistinanalyzing.WhatIwill be extractingfromthe datasetisthe significance of
attributesthatresultina large gross revenue of amovie.The goal of analyzingthisdatasetistosuccessfullyfigure out
whichattributesare the mostsignificantwhendeterminingfuture successof amovie title before itisreleased.Critics
and humaninstinct,whenitcomestomovies,issometimesunreliable.Iwanttobe able toaccuratelypredictwhat
attributesinfluence movie successbasedonseveral characteristicsinspecificareassuchas Facebook andthe IMDB site.
Background
Creatinga predictive model forthisdatasetisnotvital tohumanexistence,howeveritwouldbe useful forsome movie-
goers.Thisanalyzationpertainstothe entertainment/movie industry.Itcanhelpproducers,actors,actresses,directors,
filminvestors,andmovie-goersdeterminehowsuccessful the proposedmovie willbe.Withoutthe predictivemodeling,
there wouldonlybe gutdecisions/personal preferencesabouthow amovie will turnout.Noteveryone thinksthata
certainactor or actress isamazing,therefore sayingthe entiretyof the movieisamazing.Puttingitintermsof analytical
processingmakesthe predictionmore stable andunbiased.Thisprojectwouldbe deemedsignificanttothisgroupof
people mentionedpreviouslybecauseitwill be anunbiasedpredictive datasetthatwill be utilizedtodeterminegross
revenue.Everyproduceranddirectorbelieve theirmovie will be one of the greatest,andtheywill doeverything intheir
powerto make itthe greatest.However,majorityof the time,thisturnsouttobe false.Theycan take thisdatasetand
implementitintotheirthoughtprocesswhenplanningtheirmovie. Onthe flipside,Iheara lotof the time thatpeople
will gosee a movie andsay“I justwastedx amountof moneytosee that horrible film!”.Movie-goerscanuse this
datasetto make the same predictionsonce the movie isannouncedwithprimaryandsupportingactors/actresses.It
couldpossiblysave movie-goersmoneywhendebatingonwhethertogosee a movie ornot.
Goals
There are a couple of goalsthat I wishtoachieve withthisdataset.The goalsIwishto achieve are:
Assistdirectorsandproducersinmaximizingtheirpotentialrevenue of aproposedfilm
Save moneyor spendmoneywiselywhendebatingonseeinganew film
Gain practice inusingmultiple linearregression
Developmore skillinpre-processingtechniquessuchasdata partitioningandhandlingmissingdata
Learn more aboutpost processingtechnique sensitivityanalysis
Literature Review
There are some otherpeople like me whohave hadthe same ideaof analyzingamovie database.One groupof people
workedonanalysisof temporal multivariate networksderivedfromIMDB.Theyusedmethodssuchas (p,q)-core and4-
ringto identifysubgraphsandshortcycles1
. Anothergroupof individualsfromStonyBrookUniversityanalyzedamovie
datasetusingregressionandk-nearestneighbormethods2
.Anotherindividualwantedtosee how hismovie preferences
correlatedwithOscarwinningtitles.He alinearregressionmodel forhisanalysis3
.
Methodology
I obtainedmyoriginal datasetfromdatascience website,Kaggle4
.The original datasetcontained28differentvariables.
The variablesinthe datasetwere bothcategorical andnumerical data.WhenI firststartedworkingonthisdataset,I
wantedtoinclude majorityof the variablesinmyanalysis,howeverIranintoa problem.The problemwasthatwhenI
was tryingto transformmycategorical data intonumerical data.XLMinerisa great software programthatallowsforthis
type of transformation,howevermydatasetcontainednumerousattributesthathad30+ differentcategorical data.For
example,therewere 30+directorsand 30+ actors/actresses.In Figure1, you will findasample of actors/actresses.
2. Figure 1 Figure 2
XLMinerhas a limitof 30 differentcategorical categories.Becauseof this,Iwasforcedto eitherdotwothings.The first
was to use the Reduce Categoryhandle of XLMinerforall the categorical data. The onlynegative of thisisthatitcuts out
a lot of data and forcesitinto a category.Knowingthatthisisn’twhatI wantedtobecome of my dataset,Ihad to take
the otherroute. The otherroute wasto pick andchoose whichattributesIdeemedacceptabletouse inmy analysis.So,
I didnot choose attributessuchasdirectorname and actor/actressname.I will explainfurtherinthe pre-processing
portionof thisreport.
Pre-processing
As statedbefore,Ihadto pickand choose whichattributestouse inmy modeling. There were acouple of attributesthat
I thoughtwere interestingandwantedtosee if theywere significant.Theyhadtodo withnumberof likesonsocial
mediawebsite,Facebook.Majorityof myattributesinmyanalysishadto dowiththis. In additiontonotchoosingsome
attributesbecause of the categorical capon creatingdummies,Ididnotchoose attributesthatwere reallya“make -or-
break”attribute whenitcomesto successof a film.The followingattributeswere eliminatedfrommyanalysisduring
the pre-processingprocess: color,directorname,actor1 name,actor 2 name,actor 3 name,movie title,numbervoted
users,face numberinposter,plotkeywords,movie IMDBlink,numberof usersforreviews,language,country,content
rating,title year,budget,andaspectratio.The attributeslistedin Figure2 displaythe attributes thatIkeptforthe
analysis.
Once I determinedthe attributestouse,Ithenstartedworkingwiththe data.I firstnoticedthatthere wasa lot of
missingdatainthe dataset.Idecidedthatmissingdatamade the entire recordinsignificant because withoutdata,the
record isincomplete andwouldmessupmymodel.The recordsthathad missingdatawouldhave negativelyimpacted
my model sogettingridof themwas myonlyoption. Iusedthe MissingData handle feature of XLMineranddeleted
those records. Afterusingthisfeature,the numberof recordsinmy datasetdecreasedfrom5,043 to 3,879.
Upon receivinganewdatasetwithnomissingdata,Ithenpartitionedthe data.Iuseda 60/40 splitwith60% being
attributedtotrainingand40% goingto validation.Ichose topartitionmydata because Ifeltthatit wouldhelpduring
the performance period.Partitioningthe dataintosegmentsthatare easilypreservedandretrieved made my
performance runsmoothly.
3. Model #1
For my firstmodel,Ichose tocreate a standardMultiple LinearRegressionanalysistosee whichattributeswere the
mostsensitive whenoutputtinggrossrevenue.WhenIhadfirst run myanalysis,Ihad includedthe variablebudget.
Afterlookingatmymodel,Isaw that budgetcouldbe deemedanoutlier.Thiswouldskew mydatasetwhen
determiningthe mostsensitive attribute.Therefore,Idecidedthatitdidnot fitwiththe restof the variablesandwould
not be comparedwiththe attributeslistedin Figure2.My outputwasgross revenue. Forthisfirstmodel,Ididnotuse
any variable selectionmethod.Iwantedtocompare thismodel withmynextmodel thatusedavariable selection
method.Itook the data generatedfromXLMiner’sMultiple LinearRegressionhandleandbeganasensitivityanalysisfor
postprocessing.
Model #2
For my secondmodel,Ihadgenerated anotherMultiple LinearRegression.However,thistime Iusedavariable selection
methodtosee howit wouldcompare withjusta standardMultiple LinearRegression.Iusedthe stepwisevariable
selectionmethodinthismodel.Iusedthismethodbecauseitisa combinationof backwardseliminationandforward
selectionmethods.Ibelievedthatstepwise wouldgiveme amore accurate prediction.Before runningthismodel, Iused
defaultvaluesforFOUT(2.71) and FIN (3.84). I had usedthe same variablesand same outputas myfirstmodel.After
runningthe model, Ichose the lastsubsetthat wasgeneratedbecause ithadthe lowestCPvalue aswell asthe highest
adjustedRsquaredand probability. Iagaintookthe data generatedandworkedonsensitivityanalysis.
Results
The modelsthatI createdare bothof continuousmethods.Toanalyze the modelsfurther,Ineededtofindapost
processingmethodthatcorrespondedwithmymethods.Ichose todo a sensitivityanalysisforbothmodelstosee what
the relationshipwasbetween attributesinthe standardMultiple LinearRegressionandstepwise MultipleLinear
Regression. Iwantedtotake the means,minimums,maximums,andstepsof the original dataandrun themthrougha
what-if analysisusing10 stepsforthe sensitivityanalysisforeachattribute.Tocompare,Ithenhad to take the standard
deviation.Icreatedthree graphsaftergeneratingstandarddeviation:1) SensitivityAbout the Mean,2) Most Sensitive
Attribute,and3) Least SensitiveAttribute.
Performance Measures of Model #1
For my firstmodel,Ifirstlistedthe coefficientsforeachof the attributesaswell asthe intercept.Ithengatheredthe
mean,minimum,maximum, andstepforeachof the attributesfromthe new dataset(afterusingMissing Datahandle).I
thencalculatedanoutputof grossrevenue bytakingthe interceptplusthe productof eachattribute coefficientandits
mean.I thentookthisgross revenue numberandputit intothe data tablesforthe what-if analysis.Before Icouldrun
the what-if analysis,Ihadtoinsertvaluesforeachattribute inthe data table.These valueswere calculatedbytakingthe
minimumplusthe numberof stepminusone andmultiplieditbythe calculatedstepvalue. Now mydatatable was
readyfor the what-if analysis.Iusedeachattribute meanasthe columninputforthe analysis.Aftergeneratingvalues
for grossrevenue,Ithentookthe standarddeviationof those valuestocompare them. Figure3 displaysthe resultsof
the standard deviation.
4. Figure 3
NextthingIdidwas lookat thisgraph and see whichattributeswere the mostandleastsensitivewhenitcame togross
revenue. Asseenfromthe graph, the cast total Facebooklikes wasthe mostsensitive attribute. Itishardto see which
attribute wasthe leastsensitivefromthisgraph,howeveritwasthe IMDB score.It turnsout that IMDB score attribute
has little influence ongrossrevenue of afilm.
Performance Measures Model #2
The same processdescribedabove wentintocreatingthe sensitivityanalysisformysecondmodel.Thistime,the
stepwise MultipleLinearRegressionmodel hadsome changes.The firstchange wasthatit hada lowergrossrevenue
output.The secondchange was that ithad a differentleastsensitive attribute.InFigure4, youwill findthe standard
deviationsof eachattribute comparedtoone another.
Figure 4
-
20,000,000,000.00
40,000,000,000.00
60,000,000,000.00
80,000,000,000.00
100,000,000,000.00
120,000,000,000.00
140,000,000,000.00
160,000,000,000.00
180,000,000,000.00
200,000,000,000.00
STDEV.S
Attribute
Sensitivity About The Mean
-
500,000,000.00
1,000,000,000.00
1,500,000,000.00
2,000,000,000.00
2,500,000,000.00
3,000,000,000.00
STDEV.S
Attribute
Sensitivity About The Mean
5. As showninthe graph,the mostsensitiveattribute wasincompetition.The actor1 Facebooklikescame inaveryclose
secondandalmosttook overas the most sensitive attribute.However,againthe mostsensitive attributewasthe cast
total Facebooklikes.Itisimportanttorealize how close these deviationswere because youdonotsimplywantto
disregardthe numberof Facebooklikesforthe primaryactorin the film.The leastsensitiveattribute inthismodel was
the numberof Facebooklikesforthe director.Itturnsout that it doesnotreallymatterwhothe director of the filmis.
Conclusion
In conclusion,thisanalysiscomparescertainattributesregardingFacebook andIMDBsite againstthe gross revenue of a
film.The highernumberof Facebooklikesfromthe primaryactorand supportingactorsplaysa significantrole in
generatingrevenue fromafilm.Throughboth modelsandthe sensitivityanalysis,someonecaneasilysee the supportin
thisconclusion.Directorsandproducerscantake thisdatasetand implementitintotheirthoughtprocesswhen
planningtheirmovie.Movie-goerscanuse thisdatasetto make the same predictionsonce the movie isannouncedwith
primaryand supportingactors/actresses.Itcouldpossiblysave movie-goersmoneywhendebatingonwhethertogosee
a movie or not.
Thisprojecthas helpedme substantiallyinpracticingwithrunninganalysisoncertaintopicsandgeneratingaresult.It
has developedmy skillinExcel andXLMinerbyusingthe MissingData handle,the Reduce Categorieshandle,the Data
Partitionhandle,the Multiple LinearRegressionhandle,andasensitivityanalysis.Overall,the effectivenessof this
projectwasveryuseful forme inthe preparationformycareer.I can take thisprojectas proof of knowledge inthese
areas as well asknowingassociatedterms.