SlideShare a Scribd company logo
1 of 39
AUTOMATED 
DETECTION OF MOVIE 
SCRIPT GENRES 
Machine Learning Presentation – May 19, 
2014 
Graham Sack 
gas2117@columbia.edu 
Michael Jiang 
jzlpku2009@gmail.com 
Katherine Guo 
kgv2107@columbia.edu
Contents 
 Introduction 
 Genre Association 
 TFIDF and Modified TFIDF 
 Topic Modeling
Objective 
Given a movie script, 
predict what genre it is 
(e.g., Drama, Comedy, Sci- 
Fi, Horror, etc.)
Corpus 
 Built a web-scraper to 
collect screenplays 
from internet sources 
 Collected 962 
screenplays 
 Each screenplay has: 
Full text 
Multiple genre labels 
Action 
Horror 
Sci-Fi 
Thriller 
Romance 
Comedy
Example: Screenplay Text 
Scene Headings: generally identifiable as text starting 
with “INT.” or “EXT.” and/or text that is formatted in all 
caps and is left justified (for example. “INT. 
MORGAN'S HOUSE - DAY.”) facilitate feature 
extraction. From headings it is possible to extract data 
indicating whether the scene is interior vs. exterior, 
where the scene is located, and the time when it 
occurs. Scene headings may also contain tags 
indicating whether the scene is a “FLASHBACK.” 
Scene Content: 
Identifiable as text 
located in between 
scene headings 
Action: Identifiable as non-caps text with narrow 
margins. Actions may also contain the names of non-speaking 
characters, which are usually indicated in all 
caps. 
Speaking Character: Identifiable as text all-caps text 
that is located by itself and center-justified. Character 
tags also contain extractable information about 
whether the character is speaking in voiceover 
Dialogue: Identifiable as regular-caps text that is 
located by itself and center-justified with wide margins 
(in between margin width of action and margin with of 
characters). Dialogue typically follows the name of the 
speaking character, enabling matching of speaker and 
dialogue. 
Interlocutor: Characters that speak back-to-back in 
scenes can be assumed to be interlocutors engaged in 
a dialogue 
Scene Length: 
Number of lines or 
quantity of page 
space devoted to 
scene. Can be 
used to estimate 
the running time of 
the scene (e.g., 
assume that 1 
page = 1 minute 
of screen time)
Example: Screenplay Text 
Voice-over: Identifiable as “(V.O.)” 
Non-Speaking Characters: Primary identifier is 
proper names, professional designations, etc. 
appearing in all caps in action paragraphs. 
Shots: Primary identifier is text that matches a limited 
library of common phrases that are used to indicate 
shots (e.g., “PAN,” “CLOSE UP,” etc.) 
Scene Transitions: Primary identifier is text in all-caps 
that is right-justified. Secondary identifier is text 
that matches a limited library of common phrases that 
are used to indicate transition (e.g., “CUT,” “MATCH 
CUT,” “FADE OUT”, etc.) 
Character Attributes: When a major character is 
introduced, screenplays frequently specify key 
attributes such the character’s age, physical features, 
and basic personality. These attributes can be 
automatically identified and extracted as part of a 
character profile. 
Key Objects: Primary identifier is nouns appearing in 
all caps in action paragraphs. These objects generally 
play a key role in the scene.
Text Pre-Processing 
 Remove standard stop-words: 
 Prepositions: “of” “in” “to” “at”, etc. 
 Pronouns: “he” “she” “it” “they” “me” “my”, etc. 
 Helper verbs: “are”, “have,” “had”, etc. 
 Remove common film terms: 
 Camera / editing: “cut” “close” “shot” “pan” “fade” 
“angle” 
 Scene headings: “INT.” “EXT.” “day” “night” “later” 
 Dialogue instructions: “cont’d” “V.O.” “O.S.” “omit”
Contents 
 Introduction 
 Genre Association 
 TFIDF and Modified TFIDF 
 Topic Modeling
Genre Distribution 
 Drama 504 
 Musical 18 
 Comedy 299 
 Crime 173 
 Sci-Fi 142 
 Mystery 87 
 Adventure 143 
 Fantasy 98 
 Romance 168 
 Thriller 322 
 Sport 2 
 Action 253 
 Family 36 
War 23 
 Horror 131 
 Animation 29 
 Biography 3 
 Music 4 
Western 10 
 History 3 
 Film-Noir 4 
 Short 2 
 Total # of Scripts with Full Text: 962 
 Genre Labels are NOT mutually 
exclusive
Genre Association Rules 
Rule Confidence Support 
Drama --> 
Comedy 
0.230158730159 0.125813449024 
Comedy --> 
Drama 
0.387959866221 0.125813449024 
Drama --> Crime 0.214285714286 0.117136659436 
Crime --> Drama 0.624277456647 0.117136659436 
Drama --> 
0.236111111111 0.129067245119 
Romance 
Romance --> 
Drama 
0.708333333333 0.129067245119 
Drama --> Thriller 0.313492063492 0.17136659436 
Thriller --> Drama 0.490683229814 0.17136659436 
Crime --> Thriller 0.589595375723 0.110629067245 
Thriller --> Crime 0.316770186335 0.110629067245 
Thriller --> Action 0.400621118012 0.139913232104 
Action --> Thriller 0.509881422925 0.139913232104 
 Most of the screenplays have 
multiple genre labels. This allows 
us to analyze the association 
between genres: 
 Which genre labels tend to be 
related to another? 
 What rules could we generate from 
them? 
 Can adapt metrics from 
Association Rules: 
 Support 
 Confidence 
 Interest
Genre Association Rules 
Pr(G1, G2) 
Pr(G1)*Pr(G2) 
Interest(G1, G2) 
= 
> 1 (positively dependent) 
= 1 (independent) 
< 1 (negatively dependent) 
Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biopic Music Western History 
Film- 
Noir Short 
Drama 0 0.71 0.71 1.14 0.49 0.82 0.54 0.58 1.3 0.9 0.91 0.64 0.46 1.51 0.43 0.32 1.83 1.83 0.91 1.83 1.8 0.91 
Musical 0.71 0 1.54 0.3 0.36 0 1.07 2.09 1.22 0 0 0 9.96 0 1.17 10.6 0 0 0 0 0 0 
Comedy 0.71 1.54 0 0.84 0.52 0.32 0.73 1.1 1.54 0.33 1.54 0.46 1.97 0.13 0.54 2.23 0 0 0.31 0 0 0 
Crime 1.14 0.3 0.84 0 0.26 1.35 0.22 0.38 0.51 1.69 0 1.24 0 0 0.37 0 0 0 0.53 0 1.3 0 
Sci-Fi 0.49 0.36 0.52 0.26 0 0.97 2.27 1.33 0.39 1.37 0 2.16 0.54 0 1.69 0.9 0 0 0.65 0 0 0 
Mystery 0.82 0 0.32 1.35 0.97 0 0.67 0.65 0.5 2.11 0 0.46 0 0 2.18 0 0 0 1.06 0 0 0 
Adventure 0.54 1.07 0.73 0.22 2.27 0.67 0 2.57 0.5 0.74 0 2.22 2.69 1.4 0.49 4.22 0 0 1.93 0 0 0 
Fantasy 0.58 2.09 1.1 0.38 1.33 0.65 2.57 0 1.12 0.5 0 1.6 3.92 0 1.58 3.89 0 0 0 0 0 0 
Romance 1.3 1.22 1.54 0.51 0.39 0.5 0.5 1.12 0 0.39 0 0.33 1.07 1.19 0.29 0.57 0 0 0.55 1.83 0 0 
Thriller 0.9 0 0.33 1.69 1.37 2.11 0.74 0.5 0.39 0 1.43 1.46 0.08 0.5 1.95 0 0 0 0.57 0 0.7 0 
Sport 0.91 0 1.54 0 0 0 0 0 0 1.43 0 1.82 12.81 0 0 0 0 0 0 0 0 0 
Action 0.64 0 0.46 1.24 2.16 0.46 2.22 1.6 0.33 1.46 1.82 0 0.71 1.58 1 0.63 0 0 1.46 0 0 0 
Family 0.46 9.96 1.97 0 0.54 0 2.69 3.92 1.07 0.08 12.8 0.71 0 0 0.2 15.01 0 0 0 0 0 0 
War 1.51 0 0.13 0 0 0 1.4 0 1.19 0.5 0 1.58 0 0 0 0 13.4 10 0 0 0 0 
Horror 0.43 1.17 0.54 0.37 1.69 2.18 0.49 1.58 0.29 1.95 0 1 0.2 0 0 0.49 0 0 0 0 0 0 
Animation 0.32 10.6 2.23 0 0.9 0 4.22 3.89 0.57 0 0 0.63 15.01 0 0.49 0 0 0 0 0 0 0 
Biography 1.83 0 0 0 0 0 0 0 0 0 0 0 0 13.36 0 0 0 154 0 0 0 0 
Music 1.83 0 0 0 0 0 0 0 0 0 0 0 0 10.02 0 0 154 0 0 0 0 0 
Western 0.91 0 0.31 0.53 0.65 1.06 1.93 0 0.55 0.57 0 1.46 0 0 0 0 0 0 0 0 0 0 
History 1.83 0 0 0 0 0 0 0 1.83 0 0 0 0 0 0 0 0 0 0 0 0 0 
Film-Noir 1.83 0 0 1.33 0 0 0 0 0 0.72 0 0 0 0 0 0 0 0 0 0 0 0 
Short 0.91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Genre Association Rules
Genre Association Rules 
Example 
Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biopic Music Western History 
 Big genres like Drama, Comedy are more “mixable” 
/ adaptive to other genres, which means the interest 
is mostly clustered around 1 
 However, niche genres like Family show extremes 
of interest: 
 Lots of very low interest (near 0)  ‘Family’ is 
incompatible with some genres (like Mystery, Horror, 
and War) 
 Lots of very high interest (>>1)  ‘Family’ is highly 
compatible with others (like Comedy, Animation, and 
Film- 
Noir Short 
Drama 0 0.71 0.71 1.14 0.49 0.82 0.54 0.58 1.3 0.9 0.91 0.64 0.46 1.51 0.43 0.32 1.83 1.83 0.91 1.83 1.8 0.91 
Comedy 0.71 1.54 0 0.84 0.52 0.32 0.73 1.1 1.54 0.33 1.54 0.46 1.97 0.13 0.54 2.23 0 0 0.31 0 0 0 
Family 0.46 9.96 1.97 0 0.54 0 2.69 3.92 1.07 0.08 12.8 0.71 0 0 0.2 15.01 0 0 0 0 0 0
Contents 
 Introduction 
 Genre Association 
 TFIDF and Modified TFIDF 
 Topic Modeling
TF-IDF 
 In total, 95341 unlemmatized word types 
(features) which is too many for processing 
 Basic idea: extract 10 keywords for each movie 
and combine together as a keyword list (library), 
which is later used as the feature list. 
 For example, the keywords for BraveHeart is: 
{broadsword, william, barn, knights, king, …} 
 In total, after TFIDF, only 4498 word types 
 For new incoming movies, the counts of keywords 
would then be the feature vector 
 Bag of words assumption
Naïve Bayes Classifier TFIDF 
Features 
Row Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Drama 78 48 42 17 0.65 0.82 0.62 0.47 0.71 
Musical 1 2 179 3 0.97 0.25 0.33 0.99 0.29 
Comedy 46 39 78 22 0.67 0.68 0.54 0.67 0.6 
Crime 17 50 111 7 0.69 0.71 0.25 0.69 0.37 
Scifi 17 19 144 5 0.87 0.77 0.47 0.88 0.59 
Mystery 10 25 143 7 0.83 0.59 0.29 0.85 0.38 
Adventure 19 21 138 7 0.85 0.73 0.48 0.87 0.58 
Fantasy 12 21 144 8 0.84 0.6 0.36 0.87 0.45 
Romance 28 51 98 8 0.68 0.78 0.35 0.66 0.49 
Thriller 41 27 96 21 0.74 0.66 0.6 0.78 0.63 
Sport - - 184 1 0.99 - NaN 1 NaN 
Action 38 19 121 7 0.86 0.84 0.67 0.86 0.75 
Family 4 3 176 2 0.97 0.67 0.57 0.98 0.62 
War 2 8 174 1 0.95 0.67 0.2 0.96 0.31 
Horror 22 24 128 11 0.81 0.67 0.48 0.84 0.56 
Animation 6 4 175 - 0.98 1 0.6 0.98 0.75 
Biography - - 185 - 1 NaN NaN 1 NaN 
Music - - 185 - 1 NaN NaN 1 NaN 
Western 1 2 182 - 0.99 1 0.33 0.99 0.5 
History - - 185 - 1 NaN NaN 1 NaN 
Film-Noir - - 183 2 0.99 - NaN 1 NaN 
Short - - 184 1 0.99 - NaN 1 NaN 
Grand Total 21.4 22.7 147.0 7.6 0.88 0.72 0.45 0.88 0.54
Boosting Classifier TFIDF 
Features 
 Tuning parameter K = number of trees in forest 
1.05 
1 
0.95 
0.9 
0.85 
0.8 
0.75 
0.7 
0.65 
0.6 
20 100 1000 10000 
Drama 
Musical 
Comedy 
Crime 
Sci-Fi 
Mystery 
Adventure 
Fantasy 
Romance 
Thriller 
Sport 
Action 
Family 
War 
Horror 
Animation 
Biography 
Music 
Western
Boosting Classifier TFIDF 
Features 
20 100 1000 10000 NB 
Drama 0.66 0.66 0.67 0.68 0.65 
Musical 0.98 0.98 0.98 0.98 0.97 
Comedy 0.73 0.75 0.77 0.79 0.67 
Crime 0.80 0.83 0.83 0.83 0.69 
Sci-Fi 0.89 0.88 0.90 0.90 0.87 
Mystery 0.88 0.90 0.91 0.90 0.83 
Adventure 0.85 0.87 0.88 0.88 0.85 
Fantasy 0.86 0.88 0.89 0.89 0.84 
Romance 0.81 0.82 0.84 0.84 0.68 
Thriller 0.72 0.72 0.74 0.75 0.74 
Sport 1.00 1.00 1.00 1.00 0.99 
Action 0.79 0.81 0.83 0.84 0.86 
Family 0.96 0.96 0.96 0.96 0.97 
War 0.97 0.97 0.97 0.97 0.95 
Horror 0.87 0.90 0.91 0.91 0.81 
Animation 0.97 0.96 0.97 0.97 0.98 
Biography 1.00 1.00 1.00 1.00 1.00 
Music 0.99 0.99 0.99 0.99 1.00 
Western 0.99 0.99 0.99 0.99 0.99 
History 1.00 1.00 1.00 1.00 1.00 
Film-Noir 0.99 0.99 0.99 0.99 0.99 
Short 1.00 1.00 1.00 1.00 0.99
Modified TFIDF 
 In normal TFIDF, we penalize the occurrence 
of number of documents in which a word 
appears 
 However, we don’t want to penalize if the 
document belongs to the same genre of that 
word, instead, we only penalize on the out-of-genre 
documents 
 Modify it as:
Example of Modified TFIDF 
 5 docs 
 doc1 = [ ‘man’ ‘star’ ‘ship’ ‘laser’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’] 
(Sci-Fi) 
 doc2 = [‘man’ ‘steve ‘ship’ ‘water water water water water water water ] (Sci-Fi) 
 doc3 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond] 
(Sci-Fi) 
 doc4 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond] 
(Sci-Fi) 
 doc5 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond] 
(Sci-Fi) 
 doc6 = [‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘bed ‘chair ‘lamp] 
(Family) 
 When we extract keywords for doc1: 
 Normal TFIDF: 
 Score(star) = 9/1=9 
 Score(ship) = 1/5=0.2 
 Modified TFIDF: 
 Score(star) = log2(9+1) * 1/(0+1)=3.3
Contents 
 Introduction 
 Genre Association 
 TFIDF and Modified TFIDF 
 Topic Modeling
Topic Modeling 
 Idea: Use topics for dimension reduction of raw 
word frequencies 
 Instead of tens of thousands of word frequencies, 
cluster words into a few hundred topics 
 Topic prevalence becomes features for learning 
algorithms 
 Two phases of Machine Learning: 
1. Unsupervised learning to cluster words into topics 
2. Supervised learning of models relating topic 
distributions to genre labels
Topic Modeling with LDA 
Source: David Blei, “Probabilistic Topic Modeling” (ACM, 
2012)
Implementation Details 
 Extensive Text Pre-Processing: 
 Remove stop-words and film terms 
 Remove character names (more on this…) 
 Convert files to word-count vectors 
 Used Gensim implementation of Latent Dirichlet Allocation to 
extract topics 
 Varied num_topics = 32, 64, 128, 256, 512 
 Which is best? Looked at… 
 Change in model performance (accuracy, F-measure) as number of topics 
increased 
 Manually inspected topics to see if they were crystallizing into a intelligible 
themes 
 128 or 256 seem best and roughly equivalent 
 Several different supervised learning algorithms: 
 Logistic regression 
 Naïve bayes 
 SVM RBF
Example: Interpretable Topics 
Num_Topics = 
256 
Topic #126 Topic # 212 Topic #100 
Weigh 
t 
Word 
0.005 shore 
0.005 swim 
0.005 beat 
0.004 just 
0.004 move 
0.004 way 
0.004 room 
0.004 along 
0.004 bow 
0.004 port 
0.004 open 
0.004 take 
0.004 warrior 
0.004 harbor 
0.004 arctic 
0.004 another 
0.004 sub 
0.004 end 
0.004 hull 
0.004 like 
0.004 forward 
0.004 line 
0.003 toward 
0.003 
swimmin 
g 
0.003 sailor 
Weigh 
t 
Word 
0.037 boat 
0.035 water 
0.019 deck 
0.017 ship 
0.014 island 
0.013 sea 
0.011 ocean 
0.011 back 
0.01 look 
0.009 beach 
0.008 hold 
0.008 come 
0.008 see 
0.007 radio 
0.007 light 
0.007 chimera 
0.007 dock 
0.007 crew 
0.006 cabin 
0.006 
continuou 
s 
0.006 one 
0.006 surface 
0.006 raft 
0.005 foot 
Sports Space Travel Ships & Sailing 
0.005 
underwat 
er 
Weigh 
t 
Word 
0.005 mcginty 
0.005 throw 
0.005 two 
0.005 take 
0.005 gon 
0.005 five 
0.005 left 
0.005 second 
0.005 end 
0.005 see 
0.005 score 
0.005 win 
0.004 pitch 
0.004 head 
0.004 hand 
0.004 pas 
0.004 come 
0.004 point 
0.004 time 
0.004 just 
0.004 good 
0.004 pull 
0.004 dallas 
0.004 year 
0.004 mike 
Weigh 
t 
Word 
0.039 game 
0.027 ball 
0.02 player 
0.017 field 
0.016 team 
0.012 play 
0.011 coach 
0.01 football 
0.01 guy 
0.009 one 
0.009 locker 
0.009 hit 
0.009 run 
0.009 look 
0.009 big 
0.008 back 
0.008 line 
0.008 got 
0.007 first 
0.007 get 
0.006 right 
0.006 three 
0.006 like 
0.006 walk 
0.006 stadium 
Weigh 
t 
Word 
0.017 ship 
0.011 control 
0.009 space 
0.007 cockpit 
0.007 one 
0.006 light 
0.006 see 
0.006 move 
0.006 two 
0.005 begin 
0.005 turn 
0.005 robot 
0.005 get 
0.005 back 
0.004 corridor 
0.004 bay 
0.004 bridge 
0.004 horizon 
0.004 around 
0.004 look 
0.004 pilot 
0.004 planet 
0.004 right 
0.004 going 
0.004 just 
Weigh 
t 
Word 
0.004 panel 
0.004 head 
0.004 lewis 
0.004 come 
0.004 system 
0.004 like 
0.004 radio 
0.004 base 
0.004 away 
0.004 giant 
0.003 cloud 
0.003 air 
0.003 event 
0.003 laser 
0.003 power 
0.003 know 
0.003 toward 
0.003 main 
0.003 star 
0.003 huge 
0.003 door 
0.003 suddenly 
0.003 will 
0.003 falcon 
0.003 hit
Highest Prevalence of Topic 
#126 
Film 
Topic 
Prevalenc 
e 
Replacements, The 0.63 
Program, The 0.19 
Moneyball 0.16 
Major League 0.16 
Blind Side, The 0.14 
Love and Basketball 0.14 
Bull Durham 0.14 
Sugar 0.12 
Forrest Gump 0.11 
Semi-Pro 0.10 
Two For The Money 0.10 
Damned United, The 0.09 
Sandlot Kids, The 0.09 
Field of Dreams 0.08 
Tin Cup 0.08 
Invictus 0.07 
eXistenZ 0.06 
Buffy the Vampire Slayer 0.05 
The Rage: Carrie 2 0.05 
Game 6 0.05 
Cincinnati Kid, The 0.05
Highest Prevalence of Topic 
#212 
Film 
Topic 
Prevalence 
Star Wars: The Empire Strikes Back 0.85 
Event Horizon 0.59 
Star Wars: A New Hope 0.47 
Dark Star 0.32 
Alien 0.25 
Lost in Space 0.21 
Jason X 0.21 
TRON 0.19 
Star Wars: The Phantom Menace 0.19 
Dune 0.16 
Independence Day 0.16 
Pandorum 0.16 
Wall-E 0.16 
Mission to Mars 0.15 
Airplane 2: The Sequel 0.14 
Leviathan 0.13 
Abyss, The 0.13 
Prometheus 0.12 
Pitch Black 0.11 
Aliens 0.11 
Sphere 0.11 
Oblivion 0.11 
Heavy Metal 0.10 
Thor 0.09 
Star Wars: Return of the Jedi 0.09 
Moon 0.09
Highest Prevalence of Topic 
#100 
Film 
Topic 
Prevalence 
Ghost Ship 0.70 
Life of Pi 0.17 
Hard Rain 0.13 
Jaws 2 0.13 
Jaws 0.12 
Master and Commander 0.11 
Titanic 0.11 
Deep Rising 0.11 
Big Blue, The 0.10 
Abyss, The 0.10 
Cast Away 0.10 
Pirates of the Caribbean 0.10 
King Kong 0.09 
Pearl Harbor 0.09 
Lake Placid 0.09 
Friday the 13th: Jason Takes 
Manhattan 0.09 
Mud 0.08 
Commando 0.06 
G.I. Jane 0.06 
Jurassic Park III 0.06 
Sphere 0.05 
Apocalypse Now 0.05 
Blood and Wine 0.04 
Leviathan 0.04 
I Still Know What You Did Last 
Summer 0.04
Character Names! 
Topic #203 (of 
 Infuriating problem! Character 256) 
0.038*kirk + 0.032*decker + 0.019*bridge + 
0.019*spock + 0.015*captain + 
0.013*mccoy + 0.012*viewer + 
0.012*enterprise + 0.011*ilia + 0.010*now + 
0.009*console + 0.009*sir + 0.009*scott + 
0.008*crew + 0.007*shuttle + 0.007*vulcan 
+ 0.007*space + 0.007*cloud + 
0.006*intercom + 0.006*starfleet + 
0.006*ship + 0.006*station + 0.005*sulu + 
0.005*warp + 0.005*klingon + 0.005*control 
+ 0.005*toward + 0.005*energy + 
0.005*chekov + 0.004*moment + 
0.004*main + 0.004*another + 
0.004*science + 0.004*vessel + 
0.004*chamber + 0.004*power + 
0.004*transporter + 0.003*pod + 
0.003*uhura + 0.003*alien + 
0.003*engineering + 0.003*voice + 
0.003*deck + 0.003*continues + 
0.003*move + 0.003*ahead + 
0.003*camera + 0.003*see + 0.003*one + 
0.003*computer 
names spoil otherwise good 
topics 
 Character names are some of 
the most frequent words in 
screenplays and therefore 
dominate topics 
 But! They don’t have 
predictive value, since they 
are highly unlikely to appear 
to other screenplays in the 
same genre 
 Need to eliminate them!
Strategies for Removing Character 
Names 
1. Document-Level: 
 Identify using formatting information: names tend to 
appear in all caps in the center of a line 
 Remove all cases in all caps (‘STEVE”) or title case 
(‘Steve’) 
2. Corpus-Level: 
 Names tend to be very frequent within a single 
document, but do not recur across documents 
 Remove all words that occur in ≤ 3 documents. This 
also helps to eliminate other noise (e.g., typos, 
“aaaargh”, etc.) 
 Problem: 
Note: In retrospect, it would have 
been wiser to extract only verbs 
and nouns using a POS-tagger… 
 Franchise films with many sequels (e.g., Star Wars) 
 Very common names (John, Sue, David, etc.)
Example: Screenplay Text 
Scene Headings: generally identifiable as text starting 
with “INT.” or “EXT.” and/or text that is formatted in all 
caps and is left justified (for example. “INT. 
MORGAN'S HOUSE - DAY.”) facilitate feature 
extraction. From headings it is possible to extract data 
indicating whether the scene is interior vs. exterior, 
where the scene is located, and the time when it 
occurs. Scene headings may also contain tags 
indicating whether the scene is a “FLASHBACK.” 
Scene Content: 
Identifiable as text 
located in between 
scene headings 
Action: Identifiable as non-caps text with narrow 
margins. Actions may also contain the names of non-speaking 
characters, which are usually indicated in all 
caps. 
Speaking Character: Identifiable as text all-caps text 
that is located by itself and center-justified. Character 
tags also contain extractable information about 
whether the character is speaking in voiceover 
Dialogue: Identifiable as regular-caps text that is 
located by itself and center-justified with wide margins 
(in between margin width of action and margin with of 
characters). Dialogue typically follows the name of the 
speaking character, enabling matching of speaker and 
dialogue. 
Interlocutor: Characters that speak back-to-back in 
scenes can be assumed to be interlocutors engaged in 
a dialogue 
Scene Length: 
Number of lines or 
quantity of page 
space devoted to 
scene. Can be 
used to estimate 
the running time of 
the scene (e.g., 
assume that 1 
page = 1 minute 
of screen time)
Results: LR vs. NB vs. SVM 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	86 								3	1 						5	45 								3	8 								0	.91 								0	.91 								0	.83 								0	.95 								0	.84 Action 								2	2 								1	1 								8	2 											7	 								0	.85 								0	.85 								0	.76 								0	.88 								0	.71 
Adventure 								8	4 										8	 						6	70 								3	8 								0	.94 								0	.94 								0	.69 								0	.99 								0	.79 Adventure 								1	2 											4	 								9	7 											9	 								0	.89 								0	.89 								0	.57 								0	.96 								0	.65 
Comedy 						2	10 								3	6 						5	04 								5	0 								0	.89 								0	.89 								0	.81 								0	.93 								0	.83 Comedy 								2	3 								1	4 								6	9 								1	6 								0	.75 								0	.75 								0	.59 								0	.83 								0	.61 
Crime 						1	04 								2	4 						6	26 								4	6 								0	.91 								0	.91 								0	.69 								0	.96 								0	.75 Crime 											6	 											9	 								9	0 								1	7 								0	.79 								0	.79 								0	.26 								0	.91 								0	.32 
Drama 						3	89 								8	5 						2	76 								5	0 								0	.83 								0	.83 								0	.89 								0	.76 								0	.85 Drama 								4	9 								2	7 								3	0 								1	6 								0	.65 								0	.65 								0	.75 								0	.53 								0	.70 
Fantasy 								4	7 										3	 						7	19 								3	1 								0	.96 								0	.96 								0	.60 								1	.00 								0	.73 Fantasy 											3	 											5	 								9	7 								1	7 								0	.82 								0	.82 								0	.15 								0	.95 								0	.21 
Horror 								7	9 								1	2 						6	77 								3	2 								0	.95 								0	.95 								0	.71 								0	.98 								0	.78 Horror 											8	 											2	 						1	00 								1	2 								0	.89 								0	.89 								0	.40 								0	.98 								0	.53 
Romance 								5	9 								1	0 						6	42 								8	9 								0	.88 								0	.88 								0	.40 								0	.98 								0	.54 Romance 							-	 											1	 						1	01 								2	0 								0	.83 								0	.83 										-	 								0	.99 #DIV/0! 
Sci-Fi 								9	9 										7	 						6	70 								2	4 								0	.96 								0	.96 								0	.80 								0	.99 								0	.86 Sci-Fi 								1	2 											5	 								9	8 											7	 								0	.90 								0	.90 								0	.63 								0	.95 								0	.67 
Thriller 						2	37 								5	7 						4	58 								4	8 								0	.87 								0	.87 								0	.83 								0	.89 								0	.82 Thriller 								2	6 								1	5 								7	0 								1	1 								0	.79 								0	.79 								0	.70 								0	.82 								0	.67 
Western 										7	 								- 						7	91 										2	 								1	.00 								1	.00 								0	.78 								1	.00 								0	.88 Western 											1	 								- 						1	21 							-	 								1	.00 								1	.00 								1	.00 								1	.00 								1	.00 
Grand	Total 			136.5 					24.8 			598.0 					40.7 								0	.92 								0	.92 								0	.73 								0	.95 								0	.79 Grand	Total 					1	4.7 							8	.5 					8	6.8 					1	2.0 								0	.83 								0	.83 								0	.53 								0	.89 								0	.61 
NAÏVE SVM RBF 
BAYES 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	41 								4	4 						5	32 								8	3 								0	.84 								0	.84 								0	.63 								0	.92 								0	.69 Action 								2	0 								1	0 								8	3 											9	 								0	.84 								0	.84 								0	.69 								0	.89 								0	.68 
Adventure 								4	3 								3	2 						6	46 								7	9 								0	.86 								0	.86 								0	.35 								0	.95 								0	.44 Adventure 											2	 								- 						1	01 								1	9 								0	.84 								0	.84 								0	.10 								1	.00 								0	.17 
Comedy 						1	51 								7	0 						4	70 						1	09 								0	.78 								0	.78 								0	.58 								0	.87 								0	.63 Comedy 								1	5 								1	2 								7	1 								2	4 								0	.70 								0	.70 								0	.38 								0	.86 								0	.45 
Crime 								2	4 								3	2 						6	18 						1	26 								0	.80 								0	.80 								0	.16 								0	.95 								0	.23 Crime 											1	 											1	 								9	8 								2	2 								0	.81 								0	.81 								0	.04 								0	.99 								0	.08 
Drama 						3	65 						1	57 						2	04 								7	4 								0	.71 								0	.71 								0	.83 								0	.57 								0	.76 Drama 								5	3 								3	0 								2	7 								1	2 								0	.66 								0	.66 								0	.82 								0	.47 								0	.72 
Fantasy 								1	5 								3	1 						6	91 								6	3 								0	.88 								0	.88 								0	.19 								0	.96 								0	.24 Fantasy 							-	 								- 						1	02 								2	0 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Horror 								2	3 								2	5 						6	64 								8	8 								0	.86 								0	.86 								0	.21 								0	.96 								0	.29 Horror 							-	 								- 						1	02 								2	0 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Romance 								2	0 								2	5 						6	27 						1	28 								0	.81 								0	.81 								0	.14 								0	.96 								0	.21 Romance 							-	 								- 						1	02 								2	0 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Sci-Fi 								5	0 								2	8 						6	49 								7	3 								0	.87 								0	.87 								0	.41 								0	.96 								0	.50 Sci-Fi 											6	 											2	 						1	01 								1	3 								0	.88 								0	.88 								0	.32 								0	.98 								0	.44 
Thriller 						1	81 								8	3 						4	32 						1	04 								0	.77 								0	.77 								0	.64 								0	.84 								0	.66 Thriller 								1	9 								1	0 								7	5 								1	8 								0	.77 								0	.77 								0	.51 								0	.88 								0	.58 
Western 										1	 								2	6 						7	65 										8	 								0	.96 								0	.96 								0	.11 								0	.97 								0	.06 Western 							-	 								- 						1	21 											1	 								0	.99 								0	.99 										-	 								1	.00 #DIV/0! 
Grand	Total 					92.2 					50.3 			572.5 					85.0 								0	.83 								0	.83 								0	.39 								0	.90 								0	.43 Grand	Total 					1	0.5 							5	.9 					8	9.4 					1	6.2 								0	.82 								0	.82 								0	.26 								0	.92 								0	.45 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	16 								3	7 						5	39 						1	08 								0	.82 								0	.82 								0	.52 								0	.94 								0	.62 Action 								1	5 											7	 								8	6 								1	4 								0	.83 								0	.83 								0	.52 								0	.92 								0	.59 
Adventure 										2	 										1	 						6	77 						1	20 								0	.85 								0	.85 								0	.02 								1	.00 								0	.03 Adventure 							-	 								- 						1	01 								2	1 								0	.83 								0	.83 										-	 								1	.00 #DIV/0! 
Comedy 						1	12 								5	1 						4	89 						1	48 								0	.75 								0	.75 								0	.43 								0	.91 								0	.53 Comedy 								1	4 											8	 								7	5 								2	5 								0	.73 								0	.73 								0	.36 								0	.90 								0	.46 
Crime 								1	3 								1	5 						6	35 						1	37 								0	.81 								0	.81 								0	.09 								0	.98 								0	.15 Crime 							-	 											1	 								9	8 								2	3 								0	.80 								0	.80 										-	 								0	.99 #DIV/0! 
Drama 						3	65 						1	61 						2	00 								7	4 								0	.71 								0	.71 								0	.83 								0	.55 								0	.76 Drama 								5	1 								3	0 								2	7 								1	4 								0	.64 								0	.64 								0	.78 								0	.47 								0	.70 
Fantasy 								- 								- 						7	22 								7	8 								0	.90 								0	.90 										-	 								1	.00 #DIV/0! Fantasy 							-	 								- 						1	02 								2	0 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Horror 										6	 										4	 						6	85 						1	05 								0	.86 								0	.86 								0	.05 								0	.99 								0	.10 Horror 							-	 								- 						1	02 								2	0 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Romance 										5	 										1	 						6	51 						1	43 								0	.82 								0	.82 								0	.03 								1	.00 								0	.06 Romance 							-	 								- 						1	02 								2	0 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Sci-Fi 								1	8 										7	 						6	70 						1	05 								0	.86 								0	.86 								0	.15 								0	.99 								0	.24 Sci-Fi 							-	 								- 						1	03 								1	9 								0	.84 								0	.84 										-	 								1	.00 #DIV/0! 
Thriller 						1	48 								6	7 						4	48 						1	37 								0	.75 								0	.75 								0	.52 								0	.87 								0	.59 Thriller 								1	9 											9	 								7	6 								1	8 								0	.78 								0	.78 								0	.51 								0	.89 								0	.58 
Western 								- 								- 						7	91 										9	 								0	.99 								0	.99 										-	 								1	.00 #DIV/0! Western 							-	 								- 						1	21 											1	 								0	.99 								0	.99 										-	 								1	.00 #DIV/0! 
Grand	Total 					71.4 					31.3 			591.5 			105.8 								0	.83 								0	.83 								0	.24 								0	.93 								0	.34 Grand	Total 							9	.0 							5	.0 					9	0.3 					1	7.7 								0	.81 								0	.81 								0	.20 								0	.93 								0	.58 
LOGISTIC 
REG 
TRAINING / IN-SAMPLE (800 
Scripts) 
TESTING / OUT-OF-SAMPLE (156 Scripts) 
Num_Topics = 
256
Results: Number of Topics 
SVM_RBF (C=100, GAMMA=1.0) 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	77 								3	3 						5	43 								4	7 								0	.90 								0	.90 								0	.79 								0	.94 								0	.82 Action 								2	4 								1	2 								8	1 											5	 								0	.86 								0	.86 								0	.83 								0	.87 								0	.74 
Adventure 								7	0 								1	5 						6	63 								5	2 								0	.92 								0	.92 								0	.57 								0	.98 								0	.68 Adventure 								1	3 											6	 								9	5 											8	 								0	.89 								0	.89 								0	.62 								0	.94 								0	.65 
Comedy 						1	93 								5	2 						4	88 								6	7 								0	.85 								0	.85 								0	.74 								0	.90 								0	.76 Comedy 								1	6 											9	 								7	4 								2	3 								0	.74 								0	.74 								0	.41 								0	.89 								0	.50 
Crime 								5	9 										9	 						6	41 								9	1 								0	.88 								0	.88 								0	.39 								0	.99 								0	.54 Crime 											8	 								1	0 								8	9 								1	5 								0	.80 								0	.80 								0	.35 								0	.90 								0	.39 
Drama 						3	80 								8	9 						2	72 								5	9 								0	.82 								0	.82 								0	.87 								0	.75 								0	.84 Drama 								4	7 								2	9 								2	8 								1	8 								0	.61 								0	.61 								0	.72 								0	.49 								0	.67 
Fantasy 								2	6 										2	 						7	20 								5	2 								0	.93 								0	.93 								0	.33 								1	.00 								0	.49 Fantasy 											3	 								- 						1	02 								1	7 								0	.86 								0	.86 								0	.15 								1	.00 								0	.26 
Horror 								5	9 								1	3 						6	76 								5	2 								0	.92 								0	.92 								0	.53 								0	.98 								0	.64 Horror 											8	 											5	 								9	7 								1	2 								0	.86 								0	.86 								0	.40 								0	.95 								0	.48 
Romance 								5	3 										8	 						6	44 								9	5 								0	.87 								0	.87 								0	.36 								0	.99 								0	.51 Romance 											2	 											3	 								9	9 								1	8 								0	.83 								0	.83 								0	.10 								0	.97 								0	.16 
Sci-Fi 								8	3 								1	8 						6	59 								4	0 								0	.93 								0	.93 								0	.67 								0	.97 								0	.74 Sci-Fi 								1	0 											3	 						1	00 											9	 								0	.90 								0	.90 								0	.53 								0	.97 								0	.63 
Thriller 						2	18 								7	0 						4	45 								6	7 								0	.83 								0	.83 								0	.76 								0	.86 								0	.76 Thriller 								3	0 								1	8 								6	7 											7	 								0	.80 								0	.80 								0	.81 								0	.79 								0	.71 
Western 										7	 										1	 						7	90 										2	 								1	.00 								1	.00 								0	.78 								1	.00 								0	.82 Western 											1	 								- 						1	21 							-	 								1	.00 								1	.00 								1	.00 								1	.00 								1	.00 
Grand	Total 			120.5 					28.2 			594.6 					56.7 								0	.89 								0	.89 								0	.62 								0	.94 								0	.69 Grand	Total 					1	4.7 							8	.6 					8	6.6 					1	2.0 								0	.83 								0	.83 								0	.54 								0	.89 								0	.56 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	77 								3	2 						5	44 								4	7 								0	.90 								0	.90 								0	.79 								0	.94 								0	.82 Action 								2	1 								1	2 								8	1 											8	 								0	.84 								0	.84 								0	.72 								0	.87 								0	.68 
Adventure 								7	5 								1	6 						6	62 								4	7 								0	.92 								0	.92 								0	.61 								0	.98 								0	.70 Adventure 											9	 											2	 								9	9 								1	2 								0	.89 								0	.89 								0	.43 								0	.98 								0	.56 
Comedy 						1	68 								3	2 						5	08 								9	2 								0	.85 								0	.85 								0	.65 								0	.94 								0	.73 Comedy 								1	2 								1	1 								7	2 								2	7 								0	.69 								0	.69 								0	.31 								0	.87 								0	.39 
Crime 								8	1 								2	6 						6	24 								6	9 								0	.88 								0	.88 								0	.54 								0	.96 								0	.63 Crime 											1	 											1	 								9	8 								2	2 								0	.81 								0	.81 								0	.04 								0	.99 								0	.08 
Drama 						3	82 						1	04 						2	57 								5	7 								0	.80 								0	.80 								0	.87 								0	.71 								0	.83 Drama 								4	4 								2	6 								3	1 								2	1 								0	.61 								0	.61 								0	.68 								0	.54 								0	.65 
Fantasy 								2	6 										1	 						7	21 								5	2 								0	.93 								0	.93 								0	.33 								1	.00 								0	.50 Fantasy 											1	 								- 						1	02 								1	9 								0	.84 								0	.84 								0	.05 								1	.00 								0	.10 
Horror 								6	4 										8	 						6	81 								4	7 								0	.93 								0	.93 								0	.58 								0	.99 								0	.70 Horror 											9	 											5	 								9	7 								1	1 								0	.87 								0	.87 								0	.45 								0	.95 								0	.53 
Romance 								6	9 								1	6 						6	36 								7	9 								0	.88 								0	.88 								0	.47 								0	.98 								0	.59 Romance 											1	 											1	 						1	01 								1	9 								0	.84 								0	.84 								0	.05 								0	.99 								0	.09 
Sci-Fi 								8	3 										6	 						6	71 								4	0 								0	.94 								0	.94 								0	.67 								0	.99 								0	.78 Sci-Fi 								1	0 											3	 						1	00 											9	 								0	.90 								0	.90 								0	.53 								0	.97 								0	.63 
Thriller 						2	23 								5	1 						4	64 								6	2 								0	.86 								0	.86 								0	.78 								0	.90 								0	.80 Thriller 								2	4 								1	0 								7	5 								1	3 								0	.81 								0	.81 								0	.65 								0	.88 								0	.68 
Western 										7	 								- 						7	91 										2	 								1	.00 								1	.00 								0	.78 								1	.00 								0	.88 Western 											1	 								- 						1	21 							-	 								1	.00 								1	.00 								1	.00 								1	.00 								1	.00 
Grand	Total 			123.2 					26.5 			596.3 					54.0 								0	.90 								0	.90 								0	.64 								0	.94 								0	.72 Grand	Total 					1	2.1 							6	.5 					8	8.8 					1	4.6 								0	.83 								0	.83 								0	.45 								0	.91 								0	.49 
64 Topics 32 Topics 
TRAINING (IN-SAMPLE) TESTING (OUT-OF-SAMPLE)
TRAINING (IN-SAMPLE) TESTING (OUT-OF-SAMPLE) 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	84 								3	1 						5	45 								4	0 								0	.91 								0	.91 								0	.82 								0	.95 								0	.84 Action 								2	1 											7	 								8	6 											8	 								0	.88 								0	.88 								0	.72 								0	.92 								0	.74 
Adventure 								8	2 										7	 						6	71 								4	0 								0	.94 								0	.94 								0	.67 								0	.99 								0	.78 Adventure 								1	0 											5	 								9	6 								1	1 								0	.87 								0	.87 								0	.48 								0	.95 								0	.56 
Comedy 						1	88 								3	4 						5	06 								7	2 								0	.87 								0	.87 								0	.72 								0	.94 								0	.78 Comedy 								1	7 											7	 								7	6 								2	2 								0	.76 								0	.76 								0	.44 								0	.92 								0	.54 
Crime 								8	8 								2	8 						6	22 								6	2 								0	.89 								0	.89 								0	.59 								0	.96 								0	.66 Crime 											6	 											8	 								9	1 								1	7 								0	.80 								0	.80 								0	.26 								0	.92 								0	.32 
Drama 						3	89 								8	0 						2	81 								5	0 								0	.84 								0	.84 								0	.89 								0	.78 								0	.86 Drama 								5	0 								3	2 								2	5 								1	5 								0	.61 								0	.61 								0	.77 								0	.44 								0	.68 
Fantasy 								4	2 								- 						7	22 								3	6 								0	.96 								0	.96 								0	.54 								1	.00 								0	.70 Fantasy 											2	 											1	 						1	01 								1	8 								0	.84 								0	.84 								0	.10 								0	.99 								0	.17 
Horror 								8	6 								1	3 						6	76 								2	5 								0	.95 								0	.95 								0	.77 								0	.98 								0	.82 Horror 											8	 											2	 						1	00 								1	2 								0	.89 								0	.89 								0	.40 								0	.98 								0	.53 
Romance 								6	8 								1	4 						6	38 								8	0 								0	.88 								0	.88 								0	.46 								0	.98 								0	.59 Romance 											2	 											4	 								9	8 								1	8 								0	.82 								0	.82 								0	.10 								0	.96 								0	.15 
Sci-Fi 								9	1 										7	 						6	70 								3	2 								0	.95 								0	.95 								0	.74 								0	.99 								0	.82 Sci-Fi 								1	3 											1	 						1	02 											6	 								0	.94 								0	.94 								0	.68 								0	.99 								0	.79 
Thriller 						2	30 								4	2 						4	73 								5	5 								0	.88 								0	.88 								0	.81 								0	.92 								0	.83 Thriller 								2	7 								1	1 								7	4 								1	0 								0	.83 								0	.83 								0	.73 								0	.87 								0	.72 
Western 										7	 								- 						7	91 										2	 								1	.00 								1	.00 								0	.78 								1	.00 								0	.88 Western 											1	 								- 						1	21 							-	 								1	.00 								1	.00 								1	.00 								1	.00 								1	.00 
Grand	Total 			132.3 					23.3 			599.5 					44.9 								0	.91 								0	.91 								0	.71 								0	.95 								0	.78 Grand	Total 					1	4.3 							7	.1 					8	8.2 					1	2.5 								0	.84 								0	.84 								0	.52 								0	.90 								0	.56 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	86 								3	1 						5	45 								3	8 								0	.91 								0	.91 								0	.83 								0	.95 								0	.84 Action 								2	2 								1	1 								8	2 											7	 								0	.85 								0	.85 								0	.76 								0	.88 								0	.71 
Adventure 								8	4 										8	 						6	70 								3	8 								0	.94 								0	.94 								0	.69 								0	.99 								0	.79 Adventure 								1	2 											4	 								9	7 											9	 								0	.89 								0	.89 								0	.57 								0	.96 								0	.65 
Comedy 						2	10 								3	6 						5	04 								5	0 								0	.89 								0	.89 								0	.81 								0	.93 								0	.83 Comedy 								2	3 								1	4 								6	9 								1	6 								0	.75 								0	.75 								0	.59 								0	.83 								0	.61 
Crime 						1	04 								2	4 						6	26 								4	6 								0	.91 								0	.91 								0	.69 								0	.96 								0	.75 Crime 											6	 											9	 								9	0 								1	7 								0	.79 								0	.79 								0	.26 								0	.91 								0	.32 
Drama 						3	89 								8	5 						2	76 								5	0 								0	.83 								0	.83 								0	.89 								0	.76 								0	.85 Drama 								4	9 								2	7 								3	0 								1	6 								0	.65 								0	.65 								0	.75 								0	.53 								0	.70 
Fantasy 								4	7 										3	 						7	19 								3	1 								0	.96 								0	.96 								0	.60 								1	.00 								0	.73 Fantasy 											3	 											5	 								9	7 								1	7 								0	.82 								0	.82 								0	.15 								0	.95 								0	.21 
Horror 								7	9 								1	2 						6	77 								3	2 								0	.95 								0	.95 								0	.71 								0	.98 								0	.78 Horror 											8	 											2	 						1	00 								1	2 								0	.89 								0	.89 								0	.40 								0	.98 								0	.53 
Romance 								5	9 								1	0 						6	42 								8	9 								0	.88 								0	.88 								0	.40 								0	.98 								0	.54 Romance 							-	 											1	 						1	01 								2	0 								0	.83 								0	.83 										-	 								0	.99 #DIV/0! 
Sci-Fi 								9	9 										7	 						6	70 								2	4 								0	.96 								0	.96 								0	.80 								0	.99 								0	.86 Sci-Fi 								1	2 											5	 								9	8 											7	 								0	.90 								0	.90 								0	.63 								0	.95 								0	.67 
Thriller 						2	37 								5	7 						4	58 								4	8 								0	.87 								0	.87 								0	.83 								0	.89 								0	.82 Thriller 								2	6 								1	5 								7	0 								1	1 								0	.79 								0	.79 								0	.70 								0	.82 								0	.67 
Western 										7	 								- 						7	91 										2	 								1	.00 								1	.00 								0	.78 								1	.00 								0	.88 Western 											1	 								- 						1	21 							-	 								1	.00 								1	.00 								1	.00 								1	.00 								1	.00 
Grand	Total 			136.5 					24.8 			598.0 					40.7 								0	.92 								0	.92 								0	.73 								0	.95 								0	.79 Grand	Total 					1	4.7 							8	.5 					8	6.8 					1	2.0 								0	.83 								0	.83 								0	.53 								0	.89 								0	.61 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 						1	95 								2	3 						5	53 								2	9 								0	.94 								0	.94 								0	.87 								0	.96 								0	.88 Action 								2	3 								1	2 								8	1 											6	 								0	.85 								0	.85 								0	.79 								0	.87 								0	.72 
Adventure 								8	6 										7	 						6	71 								3	6 								0	.95 								0	.95 								0	.70 								0	.99 								0	.80 Adventure 								1	2 											4	 								9	7 											9	 								0	.89 								0	.89 								0	.57 								0	.96 								0	.65 
Comedy 						1	94 								3	6 						5	04 								6	6 								0	.87 								0	.87 								0	.75 								0	.93 								0	.79 Comedy 								2	3 								1	7 								6	6 								1	6 								0	.73 								0	.73 								0	.59 								0	.80 								0	.58 
Crime 						1	10 								1	3 						6	37 								4	0 								0	.93 								0	.93 								0	.73 								0	.98 								0	.81 Crime 											6	 											3	 								9	6 								1	7 								0	.84 								0	.84 								0	.26 								0	.97 								0	.38 
Drama 						3	98 								7	1 						2	90 								4	1 								0	.86 								0	.86 								0	.91 								0	.80 								0	.88 Drama 								4	9 								2	6 								3	1 								1	6 								0	.66 								0	.66 								0	.75 								0	.54 								0	.70 
Fantasy 								3	7 								- 						7	22 								4	1 								0	.95 								0	.95 								0	.47 								1	.00 								0	.64 Fantasy 											2	 											1	 						1	01 								1	8 								0	.84 								0	.84 								0	.10 								0	.99 								0	.17 
Horror 								8	9 								1	3 						6	76 								2	2 								0	.96 								0	.96 								0	.80 								0	.98 								0	.84 Horror 											7	 											2	 						1	00 								1	3 								0	.88 								0	.88 								0	.35 								0	.98 								0	.48 
Romance 								7	7 								1	0 						6	42 								7	1 								0	.90 								0	.90 								0	.52 								0	.98 								0	.66 Romance 											1	 											4	 								9	8 								1	9 								0	.81 								0	.81 								0	.05 								0	.96 								0	.08 
Sci-Fi 								9	2 								1	2 						6	65 								3	1 								0	.95 								0	.95 								0	.75 								0	.98 								0	.81 Sci-Fi 								1	1 											4	 								9	9 											8	 								0	.90 								0	.90 								0	.58 								0	.96 								0	.65 
Thriller 						2	38 								3	9 						4	76 								4	7 								0	.89 								0	.89 								0	.84 								0	.92 								0	.85 Thriller 								2	7 								1	0 								7	5 								1	0 								0	.84 								0	.84 								0	.73 								0	.88 								0	.73 
Western 										8	 								- 						7	91 										1	 								1	.00 								1	.00 								0	.89 								1	.00 								0	.94 Western 											1	 											1	 						1	20 							-	 								0	.99 								0	.99 								1	.00 								0	.99 								0	.67 
Grand	Total 			138.5 					20.4 			602.5 					38.6 								0	.93 								0	.93 								0	.75 								0	.96 								0	.81 Grand	Total 					1	4.7 							7	.6 					8	7.6 					1	2.0 								0	.84 								0	.84 								0	.53 								0	.90 								0	.53 
512 Topics 256 Topics 128 Topics 
Results: Number of Topics 
SVM_RBF (C=100, GAMMA=1.0)
Results: Predictability by Genre 
SVM_RBF (C=100, GAMMA=1.0) TOPICS=128 TESTING 
Row	Labels TP FP TN FN Accuracy Recall Precision Specificity F 
Action 								2	1 											7	 								8	6 											8	 								0	.88 								0	.88 								0	.72 								0	.92 								0	.74 
Adventure 								1	0 											5	 								9	6 								1	1 								0	.87 								0	.87 								0	.48 								0	.95 								0	.56 
Comedy 								1	7 											7	 								7	6 								2	2 								0	.76 								0	.76 								0	.44 								0	.92 								0	.54 
Crime 											6	 											8	 								9	1 								1	7 								0	.80 								0	.80 								0	.26 								0	.92 								0	.32 
Drama 								5	0 								3	2 								2	5 								1	5 								0	.61 								0	.61 								0	.77 								0	.44 								0	.68 
Fantasy 											2	 											1	 						1	01 								1	8 								0	.84 								0	.84 								0	.10 								0	.99 								0	.17 
Horror 											8	 											2	 						1	00 								1	2 								0	.89 								0	.89 								0	.40 								0	.98 								0	.53 
Romance 											2	 											4	 								9	8 								1	8 								0	.82 								0	.82 								0	.10 								0	.96 								0	.15 
Sci-Fi 								1	3 											1	 						1	02 											6	 								0	.94 								0	.94 								0	.68 								0	.99 								0	.79 
Thriller 								2	7 								1	1 								7	4 								1	0 								0	.83 								0	.83 								0	.73 								0	.87 								0	.72 
Western 											1	 								- 						1	21 								- 								1	.00 								1	.00 								1	.00 								1	.00 								1	.00 
Grand	Total 					1	4.3 							7	.1 					8	8.2 					1	2.5 								0	.84 								0	.84 								0	.52 								0	.90 								0	.56 
Highly Predictability (F > 
0.7): 
• Sci-Fi 
• Action 
• Thriller 
• Western 
Low Predictability (F < 0.4): 
• Crime 
• Fantasy 
• Romance
Some Visualization (Forest 
Gump)
Some Visualization (Jurassic 
Park)
Some Visualization (Les 
Miserable)
Program Architecture 
Screenplay 
Screenplay Class 
Attributes 
Title 
Text 
Genres 
Word Freq 
Topics 
Methods 
Getter and 
Setter 
Functions 
TextCleaner Class 
ScriptDatabase 
Class 
ScriptScraper Class 
Website

More Related Content

What's hot

Assassins creed
Assassins creedAssassins creed
Assassins creedbrharvey1
 
Faye young research and planning final draft as media
Faye young research and planning final draft as mediaFaye young research and planning final draft as media
Faye young research and planning final draft as mediamediafaye
 
The Dark Knight Film Opening
The Dark Knight Film OpeningThe Dark Knight Film Opening
The Dark Knight Film OpeningSam Benzie
 
Question 3 – what kind of media institutions
Question 3 – what kind of media institutionsQuestion 3 – what kind of media institutions
Question 3 – what kind of media institutionsf43g4n
 
Short film genre research
Short film genre researchShort film genre research
Short film genre researchHannah Miller
 
Genre analysis
Genre analysisGenre analysis
Genre analysissmithangus
 
Presentation1
Presentation1Presentation1
Presentation1beebarlow
 
What makes a good thriller
What makes a good thrillerWhat makes a good thriller
What makes a good thrillerWhiteJess
 
The dark knight trailer analysis
The dark knight trailer analysisThe dark knight trailer analysis
The dark knight trailer analysismichaelsmedia7
 
Theory meaning pro format task 2
Theory  meaning pro format task 2Theory  meaning pro format task 2
Theory meaning pro format task 2JoshuaMeredith2
 
What is a Thriller?
What is a Thriller?What is a Thriller?
What is a Thriller?sb08k1
 
Evaluation Part 1
Evaluation Part 1Evaluation Part 1
Evaluation Part 1akeenan42
 
Film Genres (Media Studies)
Film Genres (Media Studies)Film Genres (Media Studies)
Film Genres (Media Studies)Aycan Mehmet
 
Logo Research
Logo ResearchLogo Research
Logo ResearchGroup_10
 

What's hot (20)

Assassins creed
Assassins creedAssassins creed
Assassins creed
 
Faye young research and planning final draft as media
Faye young research and planning final draft as mediaFaye young research and planning final draft as media
Faye young research and planning final draft as media
 
The Dark Knight Film Opening
The Dark Knight Film OpeningThe Dark Knight Film Opening
The Dark Knight Film Opening
 
Scary movie
Scary movieScary movie
Scary movie
 
Question 3 – what kind of media institutions
Question 3 – what kind of media institutionsQuestion 3 – what kind of media institutions
Question 3 – what kind of media institutions
 
Movie genres
Movie genresMovie genres
Movie genres
 
Short film genre research
Short film genre researchShort film genre research
Short film genre research
 
Genre analysis
Genre analysisGenre analysis
Genre analysis
 
Presentation1
Presentation1Presentation1
Presentation1
 
Horror genre analysis.
Horror genre analysis.Horror genre analysis.
Horror genre analysis.
 
example 3
example 3example 3
example 3
 
GTA 5
GTA 5GTA 5
GTA 5
 
What makes a good thriller
What makes a good thrillerWhat makes a good thriller
What makes a good thriller
 
FMP Research
FMP ResearchFMP Research
FMP Research
 
The dark knight trailer analysis
The dark knight trailer analysisThe dark knight trailer analysis
The dark knight trailer analysis
 
Theory meaning pro format task 2
Theory  meaning pro format task 2Theory  meaning pro format task 2
Theory meaning pro format task 2
 
What is a Thriller?
What is a Thriller?What is a Thriller?
What is a Thriller?
 
Evaluation Part 1
Evaluation Part 1Evaluation Part 1
Evaluation Part 1
 
Film Genres (Media Studies)
Film Genres (Media Studies)Film Genres (Media Studies)
Film Genres (Media Studies)
 
Logo Research
Logo ResearchLogo Research
Logo Research
 

Similar to Machine learning-final-presentation v2

Games Writing - NYWF 2011 jtv
Games Writing - NYWF 2011 jtvGames Writing - NYWF 2011 jtv
Games Writing - NYWF 2011 jtvJT Velikovsky
 
Film Opening Conventions
Film Opening ConventionsFilm Opening Conventions
Film Opening ConventionsYueChunCham
 
film opening conventions and discussion of genre
film opening conventions and discussion of genrefilm opening conventions and discussion of genre
film opening conventions and discussion of genreHélène Galdin-O'Shea
 
October Flier Final
October Flier FinalOctober Flier Final
October Flier FinalDan Allen
 
Thriller sub genre homework
Thriller sub genre homeworkThriller sub genre homework
Thriller sub genre homeworkAddisonSaxby
 
Media Questionnaire
Media Questionnaire Media Questionnaire
Media Questionnaire TJB720
 
Research in Genre- Thriller
Research in Genre- ThrillerResearch in Genre- Thriller
Research in Genre- Thriller09AGuinaesilva
 

Similar to Machine learning-final-presentation v2 (9)

Genre
GenreGenre
Genre
 
Games Writing - NYWF 2011 jtv
Games Writing - NYWF 2011 jtvGames Writing - NYWF 2011 jtv
Games Writing - NYWF 2011 jtv
 
Film Opening Conventions
Film Opening ConventionsFilm Opening Conventions
Film Opening Conventions
 
Trailer analysis
Trailer analysisTrailer analysis
Trailer analysis
 
film opening conventions and discussion of genre
film opening conventions and discussion of genrefilm opening conventions and discussion of genre
film opening conventions and discussion of genre
 
October Flier Final
October Flier FinalOctober Flier Final
October Flier Final
 
Thriller sub genre homework
Thriller sub genre homeworkThriller sub genre homework
Thriller sub genre homework
 
Media Questionnaire
Media Questionnaire Media Questionnaire
Media Questionnaire
 
Research in Genre- Thriller
Research in Genre- ThrillerResearch in Genre- Thriller
Research in Genre- Thriller
 

Recently uploaded

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Machine learning-final-presentation v2

  • 1. AUTOMATED DETECTION OF MOVIE SCRIPT GENRES Machine Learning Presentation – May 19, 2014 Graham Sack gas2117@columbia.edu Michael Jiang jzlpku2009@gmail.com Katherine Guo kgv2107@columbia.edu
  • 2. Contents  Introduction  Genre Association  TFIDF and Modified TFIDF  Topic Modeling
  • 3. Objective Given a movie script, predict what genre it is (e.g., Drama, Comedy, Sci- Fi, Horror, etc.)
  • 4. Corpus  Built a web-scraper to collect screenplays from internet sources  Collected 962 screenplays  Each screenplay has: Full text Multiple genre labels Action Horror Sci-Fi Thriller Romance Comedy
  • 5. Example: Screenplay Text Scene Headings: generally identifiable as text starting with “INT.” or “EXT.” and/or text that is formatted in all caps and is left justified (for example. “INT. MORGAN'S HOUSE - DAY.”) facilitate feature extraction. From headings it is possible to extract data indicating whether the scene is interior vs. exterior, where the scene is located, and the time when it occurs. Scene headings may also contain tags indicating whether the scene is a “FLASHBACK.” Scene Content: Identifiable as text located in between scene headings Action: Identifiable as non-caps text with narrow margins. Actions may also contain the names of non-speaking characters, which are usually indicated in all caps. Speaking Character: Identifiable as text all-caps text that is located by itself and center-justified. Character tags also contain extractable information about whether the character is speaking in voiceover Dialogue: Identifiable as regular-caps text that is located by itself and center-justified with wide margins (in between margin width of action and margin with of characters). Dialogue typically follows the name of the speaking character, enabling matching of speaker and dialogue. Interlocutor: Characters that speak back-to-back in scenes can be assumed to be interlocutors engaged in a dialogue Scene Length: Number of lines or quantity of page space devoted to scene. Can be used to estimate the running time of the scene (e.g., assume that 1 page = 1 minute of screen time)
  • 6. Example: Screenplay Text Voice-over: Identifiable as “(V.O.)” Non-Speaking Characters: Primary identifier is proper names, professional designations, etc. appearing in all caps in action paragraphs. Shots: Primary identifier is text that matches a limited library of common phrases that are used to indicate shots (e.g., “PAN,” “CLOSE UP,” etc.) Scene Transitions: Primary identifier is text in all-caps that is right-justified. Secondary identifier is text that matches a limited library of common phrases that are used to indicate transition (e.g., “CUT,” “MATCH CUT,” “FADE OUT”, etc.) Character Attributes: When a major character is introduced, screenplays frequently specify key attributes such the character’s age, physical features, and basic personality. These attributes can be automatically identified and extracted as part of a character profile. Key Objects: Primary identifier is nouns appearing in all caps in action paragraphs. These objects generally play a key role in the scene.
  • 7. Text Pre-Processing  Remove standard stop-words:  Prepositions: “of” “in” “to” “at”, etc.  Pronouns: “he” “she” “it” “they” “me” “my”, etc.  Helper verbs: “are”, “have,” “had”, etc.  Remove common film terms:  Camera / editing: “cut” “close” “shot” “pan” “fade” “angle”  Scene headings: “INT.” “EXT.” “day” “night” “later”  Dialogue instructions: “cont’d” “V.O.” “O.S.” “omit”
  • 8. Contents  Introduction  Genre Association  TFIDF and Modified TFIDF  Topic Modeling
  • 9. Genre Distribution  Drama 504  Musical 18  Comedy 299  Crime 173  Sci-Fi 142  Mystery 87  Adventure 143  Fantasy 98  Romance 168  Thriller 322  Sport 2  Action 253  Family 36 War 23  Horror 131  Animation 29  Biography 3  Music 4 Western 10  History 3  Film-Noir 4  Short 2  Total # of Scripts with Full Text: 962  Genre Labels are NOT mutually exclusive
  • 10. Genre Association Rules Rule Confidence Support Drama --> Comedy 0.230158730159 0.125813449024 Comedy --> Drama 0.387959866221 0.125813449024 Drama --> Crime 0.214285714286 0.117136659436 Crime --> Drama 0.624277456647 0.117136659436 Drama --> 0.236111111111 0.129067245119 Romance Romance --> Drama 0.708333333333 0.129067245119 Drama --> Thriller 0.313492063492 0.17136659436 Thriller --> Drama 0.490683229814 0.17136659436 Crime --> Thriller 0.589595375723 0.110629067245 Thriller --> Crime 0.316770186335 0.110629067245 Thriller --> Action 0.400621118012 0.139913232104 Action --> Thriller 0.509881422925 0.139913232104  Most of the screenplays have multiple genre labels. This allows us to analyze the association between genres:  Which genre labels tend to be related to another?  What rules could we generate from them?  Can adapt metrics from Association Rules:  Support  Confidence  Interest
  • 11. Genre Association Rules Pr(G1, G2) Pr(G1)*Pr(G2) Interest(G1, G2) = > 1 (positively dependent) = 1 (independent) < 1 (negatively dependent) Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biopic Music Western History Film- Noir Short Drama 0 0.71 0.71 1.14 0.49 0.82 0.54 0.58 1.3 0.9 0.91 0.64 0.46 1.51 0.43 0.32 1.83 1.83 0.91 1.83 1.8 0.91 Musical 0.71 0 1.54 0.3 0.36 0 1.07 2.09 1.22 0 0 0 9.96 0 1.17 10.6 0 0 0 0 0 0 Comedy 0.71 1.54 0 0.84 0.52 0.32 0.73 1.1 1.54 0.33 1.54 0.46 1.97 0.13 0.54 2.23 0 0 0.31 0 0 0 Crime 1.14 0.3 0.84 0 0.26 1.35 0.22 0.38 0.51 1.69 0 1.24 0 0 0.37 0 0 0 0.53 0 1.3 0 Sci-Fi 0.49 0.36 0.52 0.26 0 0.97 2.27 1.33 0.39 1.37 0 2.16 0.54 0 1.69 0.9 0 0 0.65 0 0 0 Mystery 0.82 0 0.32 1.35 0.97 0 0.67 0.65 0.5 2.11 0 0.46 0 0 2.18 0 0 0 1.06 0 0 0 Adventure 0.54 1.07 0.73 0.22 2.27 0.67 0 2.57 0.5 0.74 0 2.22 2.69 1.4 0.49 4.22 0 0 1.93 0 0 0 Fantasy 0.58 2.09 1.1 0.38 1.33 0.65 2.57 0 1.12 0.5 0 1.6 3.92 0 1.58 3.89 0 0 0 0 0 0 Romance 1.3 1.22 1.54 0.51 0.39 0.5 0.5 1.12 0 0.39 0 0.33 1.07 1.19 0.29 0.57 0 0 0.55 1.83 0 0 Thriller 0.9 0 0.33 1.69 1.37 2.11 0.74 0.5 0.39 0 1.43 1.46 0.08 0.5 1.95 0 0 0 0.57 0 0.7 0 Sport 0.91 0 1.54 0 0 0 0 0 0 1.43 0 1.82 12.81 0 0 0 0 0 0 0 0 0 Action 0.64 0 0.46 1.24 2.16 0.46 2.22 1.6 0.33 1.46 1.82 0 0.71 1.58 1 0.63 0 0 1.46 0 0 0 Family 0.46 9.96 1.97 0 0.54 0 2.69 3.92 1.07 0.08 12.8 0.71 0 0 0.2 15.01 0 0 0 0 0 0 War 1.51 0 0.13 0 0 0 1.4 0 1.19 0.5 0 1.58 0 0 0 0 13.4 10 0 0 0 0 Horror 0.43 1.17 0.54 0.37 1.69 2.18 0.49 1.58 0.29 1.95 0 1 0.2 0 0 0.49 0 0 0 0 0 0 Animation 0.32 10.6 2.23 0 0.9 0 4.22 3.89 0.57 0 0 0.63 15.01 0 0.49 0 0 0 0 0 0 0 Biography 1.83 0 0 0 0 0 0 0 0 0 0 0 0 13.36 0 0 0 154 0 0 0 0 Music 1.83 0 0 0 0 0 0 0 0 0 0 0 0 10.02 0 0 154 0 0 0 0 0 Western 0.91 0 0.31 0.53 0.65 1.06 1.93 0 0.55 0.57 0 1.46 0 0 0 0 0 0 0 0 0 0 History 1.83 0 0 0 0 0 0 0 1.83 0 0 0 0 0 0 0 0 0 0 0 0 0 Film-Noir 1.83 0 0 1.33 0 0 0 0 0 0.72 0 0 0 0 0 0 0 0 0 0 0 0 Short 0.91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 13. Genre Association Rules Example Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biopic Music Western History  Big genres like Drama, Comedy are more “mixable” / adaptive to other genres, which means the interest is mostly clustered around 1  However, niche genres like Family show extremes of interest:  Lots of very low interest (near 0)  ‘Family’ is incompatible with some genres (like Mystery, Horror, and War)  Lots of very high interest (>>1)  ‘Family’ is highly compatible with others (like Comedy, Animation, and Film- Noir Short Drama 0 0.71 0.71 1.14 0.49 0.82 0.54 0.58 1.3 0.9 0.91 0.64 0.46 1.51 0.43 0.32 1.83 1.83 0.91 1.83 1.8 0.91 Comedy 0.71 1.54 0 0.84 0.52 0.32 0.73 1.1 1.54 0.33 1.54 0.46 1.97 0.13 0.54 2.23 0 0 0.31 0 0 0 Family 0.46 9.96 1.97 0 0.54 0 2.69 3.92 1.07 0.08 12.8 0.71 0 0 0.2 15.01 0 0 0 0 0 0
  • 14. Contents  Introduction  Genre Association  TFIDF and Modified TFIDF  Topic Modeling
  • 15. TF-IDF  In total, 95341 unlemmatized word types (features) which is too many for processing  Basic idea: extract 10 keywords for each movie and combine together as a keyword list (library), which is later used as the feature list.  For example, the keywords for BraveHeart is: {broadsword, william, barn, knights, king, …}  In total, after TFIDF, only 4498 word types  For new incoming movies, the counts of keywords would then be the feature vector  Bag of words assumption
  • 16. Naïve Bayes Classifier TFIDF Features Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Drama 78 48 42 17 0.65 0.82 0.62 0.47 0.71 Musical 1 2 179 3 0.97 0.25 0.33 0.99 0.29 Comedy 46 39 78 22 0.67 0.68 0.54 0.67 0.6 Crime 17 50 111 7 0.69 0.71 0.25 0.69 0.37 Scifi 17 19 144 5 0.87 0.77 0.47 0.88 0.59 Mystery 10 25 143 7 0.83 0.59 0.29 0.85 0.38 Adventure 19 21 138 7 0.85 0.73 0.48 0.87 0.58 Fantasy 12 21 144 8 0.84 0.6 0.36 0.87 0.45 Romance 28 51 98 8 0.68 0.78 0.35 0.66 0.49 Thriller 41 27 96 21 0.74 0.66 0.6 0.78 0.63 Sport - - 184 1 0.99 - NaN 1 NaN Action 38 19 121 7 0.86 0.84 0.67 0.86 0.75 Family 4 3 176 2 0.97 0.67 0.57 0.98 0.62 War 2 8 174 1 0.95 0.67 0.2 0.96 0.31 Horror 22 24 128 11 0.81 0.67 0.48 0.84 0.56 Animation 6 4 175 - 0.98 1 0.6 0.98 0.75 Biography - - 185 - 1 NaN NaN 1 NaN Music - - 185 - 1 NaN NaN 1 NaN Western 1 2 182 - 0.99 1 0.33 0.99 0.5 History - - 185 - 1 NaN NaN 1 NaN Film-Noir - - 183 2 0.99 - NaN 1 NaN Short - - 184 1 0.99 - NaN 1 NaN Grand Total 21.4 22.7 147.0 7.6 0.88 0.72 0.45 0.88 0.54
  • 17. Boosting Classifier TFIDF Features  Tuning parameter K = number of trees in forest 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 20 100 1000 10000 Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biography Music Western
  • 18. Boosting Classifier TFIDF Features 20 100 1000 10000 NB Drama 0.66 0.66 0.67 0.68 0.65 Musical 0.98 0.98 0.98 0.98 0.97 Comedy 0.73 0.75 0.77 0.79 0.67 Crime 0.80 0.83 0.83 0.83 0.69 Sci-Fi 0.89 0.88 0.90 0.90 0.87 Mystery 0.88 0.90 0.91 0.90 0.83 Adventure 0.85 0.87 0.88 0.88 0.85 Fantasy 0.86 0.88 0.89 0.89 0.84 Romance 0.81 0.82 0.84 0.84 0.68 Thriller 0.72 0.72 0.74 0.75 0.74 Sport 1.00 1.00 1.00 1.00 0.99 Action 0.79 0.81 0.83 0.84 0.86 Family 0.96 0.96 0.96 0.96 0.97 War 0.97 0.97 0.97 0.97 0.95 Horror 0.87 0.90 0.91 0.91 0.81 Animation 0.97 0.96 0.97 0.97 0.98 Biography 1.00 1.00 1.00 1.00 1.00 Music 0.99 0.99 0.99 0.99 1.00 Western 0.99 0.99 0.99 0.99 0.99 History 1.00 1.00 1.00 1.00 1.00 Film-Noir 0.99 0.99 0.99 0.99 0.99 Short 1.00 1.00 1.00 1.00 0.99
  • 19. Modified TFIDF  In normal TFIDF, we penalize the occurrence of number of documents in which a word appears  However, we don’t want to penalize if the document belongs to the same genre of that word, instead, we only penalize on the out-of-genre documents  Modify it as:
  • 20. Example of Modified TFIDF  5 docs  doc1 = [ ‘man’ ‘star’ ‘ship’ ‘laser’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’] (Sci-Fi)  doc2 = [‘man’ ‘steve ‘ship’ ‘water water water water water water water ] (Sci-Fi)  doc3 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond] (Sci-Fi)  doc4 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond] (Sci-Fi)  doc5 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond] (Sci-Fi)  doc6 = [‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘bed ‘chair ‘lamp] (Family)  When we extract keywords for doc1:  Normal TFIDF:  Score(star) = 9/1=9  Score(ship) = 1/5=0.2  Modified TFIDF:  Score(star) = log2(9+1) * 1/(0+1)=3.3
  • 21. Contents  Introduction  Genre Association  TFIDF and Modified TFIDF  Topic Modeling
  • 22. Topic Modeling  Idea: Use topics for dimension reduction of raw word frequencies  Instead of tens of thousands of word frequencies, cluster words into a few hundred topics  Topic prevalence becomes features for learning algorithms  Two phases of Machine Learning: 1. Unsupervised learning to cluster words into topics 2. Supervised learning of models relating topic distributions to genre labels
  • 23. Topic Modeling with LDA Source: David Blei, “Probabilistic Topic Modeling” (ACM, 2012)
  • 24. Implementation Details  Extensive Text Pre-Processing:  Remove stop-words and film terms  Remove character names (more on this…)  Convert files to word-count vectors  Used Gensim implementation of Latent Dirichlet Allocation to extract topics  Varied num_topics = 32, 64, 128, 256, 512  Which is best? Looked at…  Change in model performance (accuracy, F-measure) as number of topics increased  Manually inspected topics to see if they were crystallizing into a intelligible themes  128 or 256 seem best and roughly equivalent  Several different supervised learning algorithms:  Logistic regression  Naïve bayes  SVM RBF
  • 25. Example: Interpretable Topics Num_Topics = 256 Topic #126 Topic # 212 Topic #100 Weigh t Word 0.005 shore 0.005 swim 0.005 beat 0.004 just 0.004 move 0.004 way 0.004 room 0.004 along 0.004 bow 0.004 port 0.004 open 0.004 take 0.004 warrior 0.004 harbor 0.004 arctic 0.004 another 0.004 sub 0.004 end 0.004 hull 0.004 like 0.004 forward 0.004 line 0.003 toward 0.003 swimmin g 0.003 sailor Weigh t Word 0.037 boat 0.035 water 0.019 deck 0.017 ship 0.014 island 0.013 sea 0.011 ocean 0.011 back 0.01 look 0.009 beach 0.008 hold 0.008 come 0.008 see 0.007 radio 0.007 light 0.007 chimera 0.007 dock 0.007 crew 0.006 cabin 0.006 continuou s 0.006 one 0.006 surface 0.006 raft 0.005 foot Sports Space Travel Ships & Sailing 0.005 underwat er Weigh t Word 0.005 mcginty 0.005 throw 0.005 two 0.005 take 0.005 gon 0.005 five 0.005 left 0.005 second 0.005 end 0.005 see 0.005 score 0.005 win 0.004 pitch 0.004 head 0.004 hand 0.004 pas 0.004 come 0.004 point 0.004 time 0.004 just 0.004 good 0.004 pull 0.004 dallas 0.004 year 0.004 mike Weigh t Word 0.039 game 0.027 ball 0.02 player 0.017 field 0.016 team 0.012 play 0.011 coach 0.01 football 0.01 guy 0.009 one 0.009 locker 0.009 hit 0.009 run 0.009 look 0.009 big 0.008 back 0.008 line 0.008 got 0.007 first 0.007 get 0.006 right 0.006 three 0.006 like 0.006 walk 0.006 stadium Weigh t Word 0.017 ship 0.011 control 0.009 space 0.007 cockpit 0.007 one 0.006 light 0.006 see 0.006 move 0.006 two 0.005 begin 0.005 turn 0.005 robot 0.005 get 0.005 back 0.004 corridor 0.004 bay 0.004 bridge 0.004 horizon 0.004 around 0.004 look 0.004 pilot 0.004 planet 0.004 right 0.004 going 0.004 just Weigh t Word 0.004 panel 0.004 head 0.004 lewis 0.004 come 0.004 system 0.004 like 0.004 radio 0.004 base 0.004 away 0.004 giant 0.003 cloud 0.003 air 0.003 event 0.003 laser 0.003 power 0.003 know 0.003 toward 0.003 main 0.003 star 0.003 huge 0.003 door 0.003 suddenly 0.003 will 0.003 falcon 0.003 hit
  • 26. Highest Prevalence of Topic #126 Film Topic Prevalenc e Replacements, The 0.63 Program, The 0.19 Moneyball 0.16 Major League 0.16 Blind Side, The 0.14 Love and Basketball 0.14 Bull Durham 0.14 Sugar 0.12 Forrest Gump 0.11 Semi-Pro 0.10 Two For The Money 0.10 Damned United, The 0.09 Sandlot Kids, The 0.09 Field of Dreams 0.08 Tin Cup 0.08 Invictus 0.07 eXistenZ 0.06 Buffy the Vampire Slayer 0.05 The Rage: Carrie 2 0.05 Game 6 0.05 Cincinnati Kid, The 0.05
  • 27. Highest Prevalence of Topic #212 Film Topic Prevalence Star Wars: The Empire Strikes Back 0.85 Event Horizon 0.59 Star Wars: A New Hope 0.47 Dark Star 0.32 Alien 0.25 Lost in Space 0.21 Jason X 0.21 TRON 0.19 Star Wars: The Phantom Menace 0.19 Dune 0.16 Independence Day 0.16 Pandorum 0.16 Wall-E 0.16 Mission to Mars 0.15 Airplane 2: The Sequel 0.14 Leviathan 0.13 Abyss, The 0.13 Prometheus 0.12 Pitch Black 0.11 Aliens 0.11 Sphere 0.11 Oblivion 0.11 Heavy Metal 0.10 Thor 0.09 Star Wars: Return of the Jedi 0.09 Moon 0.09
  • 28. Highest Prevalence of Topic #100 Film Topic Prevalence Ghost Ship 0.70 Life of Pi 0.17 Hard Rain 0.13 Jaws 2 0.13 Jaws 0.12 Master and Commander 0.11 Titanic 0.11 Deep Rising 0.11 Big Blue, The 0.10 Abyss, The 0.10 Cast Away 0.10 Pirates of the Caribbean 0.10 King Kong 0.09 Pearl Harbor 0.09 Lake Placid 0.09 Friday the 13th: Jason Takes Manhattan 0.09 Mud 0.08 Commando 0.06 G.I. Jane 0.06 Jurassic Park III 0.06 Sphere 0.05 Apocalypse Now 0.05 Blood and Wine 0.04 Leviathan 0.04 I Still Know What You Did Last Summer 0.04
  • 29. Character Names! Topic #203 (of  Infuriating problem! Character 256) 0.038*kirk + 0.032*decker + 0.019*bridge + 0.019*spock + 0.015*captain + 0.013*mccoy + 0.012*viewer + 0.012*enterprise + 0.011*ilia + 0.010*now + 0.009*console + 0.009*sir + 0.009*scott + 0.008*crew + 0.007*shuttle + 0.007*vulcan + 0.007*space + 0.007*cloud + 0.006*intercom + 0.006*starfleet + 0.006*ship + 0.006*station + 0.005*sulu + 0.005*warp + 0.005*klingon + 0.005*control + 0.005*toward + 0.005*energy + 0.005*chekov + 0.004*moment + 0.004*main + 0.004*another + 0.004*science + 0.004*vessel + 0.004*chamber + 0.004*power + 0.004*transporter + 0.003*pod + 0.003*uhura + 0.003*alien + 0.003*engineering + 0.003*voice + 0.003*deck + 0.003*continues + 0.003*move + 0.003*ahead + 0.003*camera + 0.003*see + 0.003*one + 0.003*computer names spoil otherwise good topics  Character names are some of the most frequent words in screenplays and therefore dominate topics  But! They don’t have predictive value, since they are highly unlikely to appear to other screenplays in the same genre  Need to eliminate them!
  • 30. Strategies for Removing Character Names 1. Document-Level:  Identify using formatting information: names tend to appear in all caps in the center of a line  Remove all cases in all caps (‘STEVE”) or title case (‘Steve’) 2. Corpus-Level:  Names tend to be very frequent within a single document, but do not recur across documents  Remove all words that occur in ≤ 3 documents. This also helps to eliminate other noise (e.g., typos, “aaaargh”, etc.)  Problem: Note: In retrospect, it would have been wiser to extract only verbs and nouns using a POS-tagger…  Franchise films with many sequels (e.g., Star Wars)  Very common names (John, Sue, David, etc.)
  • 31. Example: Screenplay Text Scene Headings: generally identifiable as text starting with “INT.” or “EXT.” and/or text that is formatted in all caps and is left justified (for example. “INT. MORGAN'S HOUSE - DAY.”) facilitate feature extraction. From headings it is possible to extract data indicating whether the scene is interior vs. exterior, where the scene is located, and the time when it occurs. Scene headings may also contain tags indicating whether the scene is a “FLASHBACK.” Scene Content: Identifiable as text located in between scene headings Action: Identifiable as non-caps text with narrow margins. Actions may also contain the names of non-speaking characters, which are usually indicated in all caps. Speaking Character: Identifiable as text all-caps text that is located by itself and center-justified. Character tags also contain extractable information about whether the character is speaking in voiceover Dialogue: Identifiable as regular-caps text that is located by itself and center-justified with wide margins (in between margin width of action and margin with of characters). Dialogue typically follows the name of the speaking character, enabling matching of speaker and dialogue. Interlocutor: Characters that speak back-to-back in scenes can be assumed to be interlocutors engaged in a dialogue Scene Length: Number of lines or quantity of page space devoted to scene. Can be used to estimate the running time of the scene (e.g., assume that 1 page = 1 minute of screen time)
  • 32. Results: LR vs. NB vs. SVM Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 86 3 1 5 45 3 8 0 .91 0 .91 0 .83 0 .95 0 .84 Action 2 2 1 1 8 2 7 0 .85 0 .85 0 .76 0 .88 0 .71 Adventure 8 4 8 6 70 3 8 0 .94 0 .94 0 .69 0 .99 0 .79 Adventure 1 2 4 9 7 9 0 .89 0 .89 0 .57 0 .96 0 .65 Comedy 2 10 3 6 5 04 5 0 0 .89 0 .89 0 .81 0 .93 0 .83 Comedy 2 3 1 4 6 9 1 6 0 .75 0 .75 0 .59 0 .83 0 .61 Crime 1 04 2 4 6 26 4 6 0 .91 0 .91 0 .69 0 .96 0 .75 Crime 6 9 9 0 1 7 0 .79 0 .79 0 .26 0 .91 0 .32 Drama 3 89 8 5 2 76 5 0 0 .83 0 .83 0 .89 0 .76 0 .85 Drama 4 9 2 7 3 0 1 6 0 .65 0 .65 0 .75 0 .53 0 .70 Fantasy 4 7 3 7 19 3 1 0 .96 0 .96 0 .60 1 .00 0 .73 Fantasy 3 5 9 7 1 7 0 .82 0 .82 0 .15 0 .95 0 .21 Horror 7 9 1 2 6 77 3 2 0 .95 0 .95 0 .71 0 .98 0 .78 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53 Romance 5 9 1 0 6 42 8 9 0 .88 0 .88 0 .40 0 .98 0 .54 Romance - 1 1 01 2 0 0 .83 0 .83 - 0 .99 #DIV/0! Sci-Fi 9 9 7 6 70 2 4 0 .96 0 .96 0 .80 0 .99 0 .86 Sci-Fi 1 2 5 9 8 7 0 .90 0 .90 0 .63 0 .95 0 .67 Thriller 2 37 5 7 4 58 4 8 0 .87 0 .87 0 .83 0 .89 0 .82 Thriller 2 6 1 5 7 0 1 1 0 .79 0 .79 0 .70 0 .82 0 .67 Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00 Grand Total 136.5 24.8 598.0 40.7 0 .92 0 .92 0 .73 0 .95 0 .79 Grand Total 1 4.7 8 .5 8 6.8 1 2.0 0 .83 0 .83 0 .53 0 .89 0 .61 NAÏVE SVM RBF BAYES Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 41 4 4 5 32 8 3 0 .84 0 .84 0 .63 0 .92 0 .69 Action 2 0 1 0 8 3 9 0 .84 0 .84 0 .69 0 .89 0 .68 Adventure 4 3 3 2 6 46 7 9 0 .86 0 .86 0 .35 0 .95 0 .44 Adventure 2 - 1 01 1 9 0 .84 0 .84 0 .10 1 .00 0 .17 Comedy 1 51 7 0 4 70 1 09 0 .78 0 .78 0 .58 0 .87 0 .63 Comedy 1 5 1 2 7 1 2 4 0 .70 0 .70 0 .38 0 .86 0 .45 Crime 2 4 3 2 6 18 1 26 0 .80 0 .80 0 .16 0 .95 0 .23 Crime 1 1 9 8 2 2 0 .81 0 .81 0 .04 0 .99 0 .08 Drama 3 65 1 57 2 04 7 4 0 .71 0 .71 0 .83 0 .57 0 .76 Drama 5 3 3 0 2 7 1 2 0 .66 0 .66 0 .82 0 .47 0 .72 Fantasy 1 5 3 1 6 91 6 3 0 .88 0 .88 0 .19 0 .96 0 .24 Fantasy - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0! Horror 2 3 2 5 6 64 8 8 0 .86 0 .86 0 .21 0 .96 0 .29 Horror - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0! Romance 2 0 2 5 6 27 1 28 0 .81 0 .81 0 .14 0 .96 0 .21 Romance - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0! Sci-Fi 5 0 2 8 6 49 7 3 0 .87 0 .87 0 .41 0 .96 0 .50 Sci-Fi 6 2 1 01 1 3 0 .88 0 .88 0 .32 0 .98 0 .44 Thriller 1 81 8 3 4 32 1 04 0 .77 0 .77 0 .64 0 .84 0 .66 Thriller 1 9 1 0 7 5 1 8 0 .77 0 .77 0 .51 0 .88 0 .58 Western 1 2 6 7 65 8 0 .96 0 .96 0 .11 0 .97 0 .06 Western - - 1 21 1 0 .99 0 .99 - 1 .00 #DIV/0! Grand Total 92.2 50.3 572.5 85.0 0 .83 0 .83 0 .39 0 .90 0 .43 Grand Total 1 0.5 5 .9 8 9.4 1 6.2 0 .82 0 .82 0 .26 0 .92 0 .45 Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 16 3 7 5 39 1 08 0 .82 0 .82 0 .52 0 .94 0 .62 Action 1 5 7 8 6 1 4 0 .83 0 .83 0 .52 0 .92 0 .59 Adventure 2 1 6 77 1 20 0 .85 0 .85 0 .02 1 .00 0 .03 Adventure - - 1 01 2 1 0 .83 0 .83 - 1 .00 #DIV/0! Comedy 1 12 5 1 4 89 1 48 0 .75 0 .75 0 .43 0 .91 0 .53 Comedy 1 4 8 7 5 2 5 0 .73 0 .73 0 .36 0 .90 0 .46 Crime 1 3 1 5 6 35 1 37 0 .81 0 .81 0 .09 0 .98 0 .15 Crime - 1 9 8 2 3 0 .80 0 .80 - 0 .99 #DIV/0! Drama 3 65 1 61 2 00 7 4 0 .71 0 .71 0 .83 0 .55 0 .76 Drama 5 1 3 0 2 7 1 4 0 .64 0 .64 0 .78 0 .47 0 .70 Fantasy - - 7 22 7 8 0 .90 0 .90 - 1 .00 #DIV/0! Fantasy - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0! Horror 6 4 6 85 1 05 0 .86 0 .86 0 .05 0 .99 0 .10 Horror - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0! Romance 5 1 6 51 1 43 0 .82 0 .82 0 .03 1 .00 0 .06 Romance - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0! Sci-Fi 1 8 7 6 70 1 05 0 .86 0 .86 0 .15 0 .99 0 .24 Sci-Fi - - 1 03 1 9 0 .84 0 .84 - 1 .00 #DIV/0! Thriller 1 48 6 7 4 48 1 37 0 .75 0 .75 0 .52 0 .87 0 .59 Thriller 1 9 9 7 6 1 8 0 .78 0 .78 0 .51 0 .89 0 .58 Western - - 7 91 9 0 .99 0 .99 - 1 .00 #DIV/0! Western - - 1 21 1 0 .99 0 .99 - 1 .00 #DIV/0! Grand Total 71.4 31.3 591.5 105.8 0 .83 0 .83 0 .24 0 .93 0 .34 Grand Total 9 .0 5 .0 9 0.3 1 7.7 0 .81 0 .81 0 .20 0 .93 0 .58 LOGISTIC REG TRAINING / IN-SAMPLE (800 Scripts) TESTING / OUT-OF-SAMPLE (156 Scripts) Num_Topics = 256
  • 33. Results: Number of Topics SVM_RBF (C=100, GAMMA=1.0) Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 77 3 3 5 43 4 7 0 .90 0 .90 0 .79 0 .94 0 .82 Action 2 4 1 2 8 1 5 0 .86 0 .86 0 .83 0 .87 0 .74 Adventure 7 0 1 5 6 63 5 2 0 .92 0 .92 0 .57 0 .98 0 .68 Adventure 1 3 6 9 5 8 0 .89 0 .89 0 .62 0 .94 0 .65 Comedy 1 93 5 2 4 88 6 7 0 .85 0 .85 0 .74 0 .90 0 .76 Comedy 1 6 9 7 4 2 3 0 .74 0 .74 0 .41 0 .89 0 .50 Crime 5 9 9 6 41 9 1 0 .88 0 .88 0 .39 0 .99 0 .54 Crime 8 1 0 8 9 1 5 0 .80 0 .80 0 .35 0 .90 0 .39 Drama 3 80 8 9 2 72 5 9 0 .82 0 .82 0 .87 0 .75 0 .84 Drama 4 7 2 9 2 8 1 8 0 .61 0 .61 0 .72 0 .49 0 .67 Fantasy 2 6 2 7 20 5 2 0 .93 0 .93 0 .33 1 .00 0 .49 Fantasy 3 - 1 02 1 7 0 .86 0 .86 0 .15 1 .00 0 .26 Horror 5 9 1 3 6 76 5 2 0 .92 0 .92 0 .53 0 .98 0 .64 Horror 8 5 9 7 1 2 0 .86 0 .86 0 .40 0 .95 0 .48 Romance 5 3 8 6 44 9 5 0 .87 0 .87 0 .36 0 .99 0 .51 Romance 2 3 9 9 1 8 0 .83 0 .83 0 .10 0 .97 0 .16 Sci-Fi 8 3 1 8 6 59 4 0 0 .93 0 .93 0 .67 0 .97 0 .74 Sci-Fi 1 0 3 1 00 9 0 .90 0 .90 0 .53 0 .97 0 .63 Thriller 2 18 7 0 4 45 6 7 0 .83 0 .83 0 .76 0 .86 0 .76 Thriller 3 0 1 8 6 7 7 0 .80 0 .80 0 .81 0 .79 0 .71 Western 7 1 7 90 2 1 .00 1 .00 0 .78 1 .00 0 .82 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00 Grand Total 120.5 28.2 594.6 56.7 0 .89 0 .89 0 .62 0 .94 0 .69 Grand Total 1 4.7 8 .6 8 6.6 1 2.0 0 .83 0 .83 0 .54 0 .89 0 .56 Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 77 3 2 5 44 4 7 0 .90 0 .90 0 .79 0 .94 0 .82 Action 2 1 1 2 8 1 8 0 .84 0 .84 0 .72 0 .87 0 .68 Adventure 7 5 1 6 6 62 4 7 0 .92 0 .92 0 .61 0 .98 0 .70 Adventure 9 2 9 9 1 2 0 .89 0 .89 0 .43 0 .98 0 .56 Comedy 1 68 3 2 5 08 9 2 0 .85 0 .85 0 .65 0 .94 0 .73 Comedy 1 2 1 1 7 2 2 7 0 .69 0 .69 0 .31 0 .87 0 .39 Crime 8 1 2 6 6 24 6 9 0 .88 0 .88 0 .54 0 .96 0 .63 Crime 1 1 9 8 2 2 0 .81 0 .81 0 .04 0 .99 0 .08 Drama 3 82 1 04 2 57 5 7 0 .80 0 .80 0 .87 0 .71 0 .83 Drama 4 4 2 6 3 1 2 1 0 .61 0 .61 0 .68 0 .54 0 .65 Fantasy 2 6 1 7 21 5 2 0 .93 0 .93 0 .33 1 .00 0 .50 Fantasy 1 - 1 02 1 9 0 .84 0 .84 0 .05 1 .00 0 .10 Horror 6 4 8 6 81 4 7 0 .93 0 .93 0 .58 0 .99 0 .70 Horror 9 5 9 7 1 1 0 .87 0 .87 0 .45 0 .95 0 .53 Romance 6 9 1 6 6 36 7 9 0 .88 0 .88 0 .47 0 .98 0 .59 Romance 1 1 1 01 1 9 0 .84 0 .84 0 .05 0 .99 0 .09 Sci-Fi 8 3 6 6 71 4 0 0 .94 0 .94 0 .67 0 .99 0 .78 Sci-Fi 1 0 3 1 00 9 0 .90 0 .90 0 .53 0 .97 0 .63 Thriller 2 23 5 1 4 64 6 2 0 .86 0 .86 0 .78 0 .90 0 .80 Thriller 2 4 1 0 7 5 1 3 0 .81 0 .81 0 .65 0 .88 0 .68 Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00 Grand Total 123.2 26.5 596.3 54.0 0 .90 0 .90 0 .64 0 .94 0 .72 Grand Total 1 2.1 6 .5 8 8.8 1 4.6 0 .83 0 .83 0 .45 0 .91 0 .49 64 Topics 32 Topics TRAINING (IN-SAMPLE) TESTING (OUT-OF-SAMPLE)
  • 34. TRAINING (IN-SAMPLE) TESTING (OUT-OF-SAMPLE) Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 84 3 1 5 45 4 0 0 .91 0 .91 0 .82 0 .95 0 .84 Action 2 1 7 8 6 8 0 .88 0 .88 0 .72 0 .92 0 .74 Adventure 8 2 7 6 71 4 0 0 .94 0 .94 0 .67 0 .99 0 .78 Adventure 1 0 5 9 6 1 1 0 .87 0 .87 0 .48 0 .95 0 .56 Comedy 1 88 3 4 5 06 7 2 0 .87 0 .87 0 .72 0 .94 0 .78 Comedy 1 7 7 7 6 2 2 0 .76 0 .76 0 .44 0 .92 0 .54 Crime 8 8 2 8 6 22 6 2 0 .89 0 .89 0 .59 0 .96 0 .66 Crime 6 8 9 1 1 7 0 .80 0 .80 0 .26 0 .92 0 .32 Drama 3 89 8 0 2 81 5 0 0 .84 0 .84 0 .89 0 .78 0 .86 Drama 5 0 3 2 2 5 1 5 0 .61 0 .61 0 .77 0 .44 0 .68 Fantasy 4 2 - 7 22 3 6 0 .96 0 .96 0 .54 1 .00 0 .70 Fantasy 2 1 1 01 1 8 0 .84 0 .84 0 .10 0 .99 0 .17 Horror 8 6 1 3 6 76 2 5 0 .95 0 .95 0 .77 0 .98 0 .82 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53 Romance 6 8 1 4 6 38 8 0 0 .88 0 .88 0 .46 0 .98 0 .59 Romance 2 4 9 8 1 8 0 .82 0 .82 0 .10 0 .96 0 .15 Sci-Fi 9 1 7 6 70 3 2 0 .95 0 .95 0 .74 0 .99 0 .82 Sci-Fi 1 3 1 1 02 6 0 .94 0 .94 0 .68 0 .99 0 .79 Thriller 2 30 4 2 4 73 5 5 0 .88 0 .88 0 .81 0 .92 0 .83 Thriller 2 7 1 1 7 4 1 0 0 .83 0 .83 0 .73 0 .87 0 .72 Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00 Grand Total 132.3 23.3 599.5 44.9 0 .91 0 .91 0 .71 0 .95 0 .78 Grand Total 1 4.3 7 .1 8 8.2 1 2.5 0 .84 0 .84 0 .52 0 .90 0 .56 Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 86 3 1 5 45 3 8 0 .91 0 .91 0 .83 0 .95 0 .84 Action 2 2 1 1 8 2 7 0 .85 0 .85 0 .76 0 .88 0 .71 Adventure 8 4 8 6 70 3 8 0 .94 0 .94 0 .69 0 .99 0 .79 Adventure 1 2 4 9 7 9 0 .89 0 .89 0 .57 0 .96 0 .65 Comedy 2 10 3 6 5 04 5 0 0 .89 0 .89 0 .81 0 .93 0 .83 Comedy 2 3 1 4 6 9 1 6 0 .75 0 .75 0 .59 0 .83 0 .61 Crime 1 04 2 4 6 26 4 6 0 .91 0 .91 0 .69 0 .96 0 .75 Crime 6 9 9 0 1 7 0 .79 0 .79 0 .26 0 .91 0 .32 Drama 3 89 8 5 2 76 5 0 0 .83 0 .83 0 .89 0 .76 0 .85 Drama 4 9 2 7 3 0 1 6 0 .65 0 .65 0 .75 0 .53 0 .70 Fantasy 4 7 3 7 19 3 1 0 .96 0 .96 0 .60 1 .00 0 .73 Fantasy 3 5 9 7 1 7 0 .82 0 .82 0 .15 0 .95 0 .21 Horror 7 9 1 2 6 77 3 2 0 .95 0 .95 0 .71 0 .98 0 .78 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53 Romance 5 9 1 0 6 42 8 9 0 .88 0 .88 0 .40 0 .98 0 .54 Romance - 1 1 01 2 0 0 .83 0 .83 - 0 .99 #DIV/0! Sci-Fi 9 9 7 6 70 2 4 0 .96 0 .96 0 .80 0 .99 0 .86 Sci-Fi 1 2 5 9 8 7 0 .90 0 .90 0 .63 0 .95 0 .67 Thriller 2 37 5 7 4 58 4 8 0 .87 0 .87 0 .83 0 .89 0 .82 Thriller 2 6 1 5 7 0 1 1 0 .79 0 .79 0 .70 0 .82 0 .67 Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00 Grand Total 136.5 24.8 598.0 40.7 0 .92 0 .92 0 .73 0 .95 0 .79 Grand Total 1 4.7 8 .5 8 6.8 1 2.0 0 .83 0 .83 0 .53 0 .89 0 .61 Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 1 95 2 3 5 53 2 9 0 .94 0 .94 0 .87 0 .96 0 .88 Action 2 3 1 2 8 1 6 0 .85 0 .85 0 .79 0 .87 0 .72 Adventure 8 6 7 6 71 3 6 0 .95 0 .95 0 .70 0 .99 0 .80 Adventure 1 2 4 9 7 9 0 .89 0 .89 0 .57 0 .96 0 .65 Comedy 1 94 3 6 5 04 6 6 0 .87 0 .87 0 .75 0 .93 0 .79 Comedy 2 3 1 7 6 6 1 6 0 .73 0 .73 0 .59 0 .80 0 .58 Crime 1 10 1 3 6 37 4 0 0 .93 0 .93 0 .73 0 .98 0 .81 Crime 6 3 9 6 1 7 0 .84 0 .84 0 .26 0 .97 0 .38 Drama 3 98 7 1 2 90 4 1 0 .86 0 .86 0 .91 0 .80 0 .88 Drama 4 9 2 6 3 1 1 6 0 .66 0 .66 0 .75 0 .54 0 .70 Fantasy 3 7 - 7 22 4 1 0 .95 0 .95 0 .47 1 .00 0 .64 Fantasy 2 1 1 01 1 8 0 .84 0 .84 0 .10 0 .99 0 .17 Horror 8 9 1 3 6 76 2 2 0 .96 0 .96 0 .80 0 .98 0 .84 Horror 7 2 1 00 1 3 0 .88 0 .88 0 .35 0 .98 0 .48 Romance 7 7 1 0 6 42 7 1 0 .90 0 .90 0 .52 0 .98 0 .66 Romance 1 4 9 8 1 9 0 .81 0 .81 0 .05 0 .96 0 .08 Sci-Fi 9 2 1 2 6 65 3 1 0 .95 0 .95 0 .75 0 .98 0 .81 Sci-Fi 1 1 4 9 9 8 0 .90 0 .90 0 .58 0 .96 0 .65 Thriller 2 38 3 9 4 76 4 7 0 .89 0 .89 0 .84 0 .92 0 .85 Thriller 2 7 1 0 7 5 1 0 0 .84 0 .84 0 .73 0 .88 0 .73 Western 8 - 7 91 1 1 .00 1 .00 0 .89 1 .00 0 .94 Western 1 1 1 20 - 0 .99 0 .99 1 .00 0 .99 0 .67 Grand Total 138.5 20.4 602.5 38.6 0 .93 0 .93 0 .75 0 .96 0 .81 Grand Total 1 4.7 7 .6 8 7.6 1 2.0 0 .84 0 .84 0 .53 0 .90 0 .53 512 Topics 256 Topics 128 Topics Results: Number of Topics SVM_RBF (C=100, GAMMA=1.0)
  • 35. Results: Predictability by Genre SVM_RBF (C=100, GAMMA=1.0) TOPICS=128 TESTING Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Action 2 1 7 8 6 8 0 .88 0 .88 0 .72 0 .92 0 .74 Adventure 1 0 5 9 6 1 1 0 .87 0 .87 0 .48 0 .95 0 .56 Comedy 1 7 7 7 6 2 2 0 .76 0 .76 0 .44 0 .92 0 .54 Crime 6 8 9 1 1 7 0 .80 0 .80 0 .26 0 .92 0 .32 Drama 5 0 3 2 2 5 1 5 0 .61 0 .61 0 .77 0 .44 0 .68 Fantasy 2 1 1 01 1 8 0 .84 0 .84 0 .10 0 .99 0 .17 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53 Romance 2 4 9 8 1 8 0 .82 0 .82 0 .10 0 .96 0 .15 Sci-Fi 1 3 1 1 02 6 0 .94 0 .94 0 .68 0 .99 0 .79 Thriller 2 7 1 1 7 4 1 0 0 .83 0 .83 0 .73 0 .87 0 .72 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00 Grand Total 1 4.3 7 .1 8 8.2 1 2.5 0 .84 0 .84 0 .52 0 .90 0 .56 Highly Predictability (F > 0.7): • Sci-Fi • Action • Thriller • Western Low Predictability (F < 0.4): • Crime • Fantasy • Romance
  • 39. Program Architecture Screenplay Screenplay Class Attributes Title Text Genres Word Freq Topics Methods Getter and Setter Functions TextCleaner Class ScriptDatabase Class ScriptScraper Class Website