Machine learning-final-presentation v2

AUTOMATED
DETECTION OF MOVIE
SCRIPT GENRES
Machine Learning Presentation – May 19,
2014
Graham Sack
gas2117@columbia.edu
Michael Jiang
jzlpku2009@gmail.com
Katherine Guo
kgv2107@columbia.edu

Contents
 Introduction
 Genre Association
 TFIDF and Modified TFIDF
 Topic Modeling

Objective
Given a movie script,
predict what genre it is
(e.g., Drama, Comedy, Sci-
Fi, Horror, etc.)

Corpus
 Built a web-scraper to
collect screenplays
from internet sources
 Collected 962
screenplays
 Each screenplay has:
Full text
Multiple genre labels
Action
Horror
Sci-Fi
Thriller
Romance
Comedy

Example: Screenplay Text
Scene Headings: generally identifiable as text starting
with “INT.” or “EXT.” and/or text that is formatted in all
caps and is left justified (for example. “INT.
MORGAN'S HOUSE - DAY.”) facilitate feature
extraction. From headings it is possible to extract data
indicating whether the scene is interior vs. exterior,
where the scene is located, and the time when it
occurs. Scene headings may also contain tags
indicating whether the scene is a “FLASHBACK.”
Scene Content:
Identifiable as text
located in between
scene headings
Action: Identifiable as non-caps text with narrow
margins. Actions may also contain the names of non-speaking
characters, which are usually indicated in all
caps.
Speaking Character: Identifiable as text all-caps text
that is located by itself and center-justified. Character
tags also contain extractable information about
whether the character is speaking in voiceover
Dialogue: Identifiable as regular-caps text that is
located by itself and center-justified with wide margins
(in between margin width of action and margin with of
characters). Dialogue typically follows the name of the
speaking character, enabling matching of speaker and
dialogue.
Interlocutor: Characters that speak back-to-back in
scenes can be assumed to be interlocutors engaged in
a dialogue
Scene Length:
Number of lines or
quantity of page
space devoted to
scene. Can be
used to estimate
the running time of
the scene (e.g.,
assume that 1
page = 1 minute
of screen time)

Example: Screenplay Text
Voice-over: Identifiable as “(V.O.)”
Non-Speaking Characters: Primary identifier is
proper names, professional designations, etc.
appearing in all caps in action paragraphs.
Shots: Primary identifier is text that matches a limited
library of common phrases that are used to indicate
shots (e.g., “PAN,” “CLOSE UP,” etc.)
Scene Transitions: Primary identifier is text in all-caps
that is right-justified. Secondary identifier is text
that matches a limited library of common phrases that
are used to indicate transition (e.g., “CUT,” “MATCH
CUT,” “FADE OUT”, etc.)
Character Attributes: When a major character is
introduced, screenplays frequently specify key
attributes such the character’s age, physical features,
and basic personality. These attributes can be
automatically identified and extracted as part of a
character profile.
Key Objects: Primary identifier is nouns appearing in
all caps in action paragraphs. These objects generally
play a key role in the scene.

Text Pre-Processing
 Remove standard stop-words:
 Prepositions: “of” “in” “to” “at”, etc.
 Pronouns: “he” “she” “it” “they” “me” “my”, etc.
 Helper verbs: “are”, “have,” “had”, etc.
 Remove common film terms:
 Camera / editing: “cut” “close” “shot” “pan” “fade”
“angle”
 Scene headings: “INT.” “EXT.” “day” “night” “later”
 Dialogue instructions: “cont’d” “V.O.” “O.S.” “omit”

Genre Distribution
 Drama 504
 Musical 18
 Comedy 299
 Crime 173
 Sci-Fi 142
 Mystery 87
 Adventure 143
 Fantasy 98
 Romance 168
 Thriller 322
 Sport 2
 Action 253
 Family 36
War 23
 Horror 131
 Animation 29
 Biography 3
 Music 4
Western 10
 History 3
 Film-Noir 4
 Short 2
 Total # of Scripts with Full Text: 962
 Genre Labels are NOT mutually
exclusive

Genre Association Rules
Rule Confidence Support
Drama -->
Comedy
0.230158730159 0.125813449024
Comedy -->
Drama
0.387959866221 0.125813449024
Drama --> Crime 0.214285714286 0.117136659436
Crime --> Drama 0.624277456647 0.117136659436
Drama -->
0.236111111111 0.129067245119
Romance
Romance -->
Drama
0.708333333333 0.129067245119
Drama --> Thriller 0.313492063492 0.17136659436
Thriller --> Drama 0.490683229814 0.17136659436
Crime --> Thriller 0.589595375723 0.110629067245
Thriller --> Crime 0.316770186335 0.110629067245
Thriller --> Action 0.400621118012 0.139913232104
Action --> Thriller 0.509881422925 0.139913232104
 Most of the screenplays have
multiple genre labels. This allows
us to analyze the association
between genres:
 Which genre labels tend to be
related to another?
 What rules could we generate from
them?
 Can adapt metrics from
Association Rules:
 Support
 Confidence
 Interest

Pr(G1, G2)
Pr(G1)*Pr(G2)
Interest(G1, G2)
=
> 1 (positively dependent)
= 1 (independent)
< 1 (negatively dependent)
Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biopic Music Western History
Film-
Noir Short
Drama 0 0.71 0.71 1.14 0.49 0.82 0.54 0.58 1.3 0.9 0.91 0.64 0.46 1.51 0.43 0.32 1.83 1.83 0.91 1.83 1.8 0.91
Musical 0.71 0 1.54 0.3 0.36 0 1.07 2.09 1.22 0 0 0 9.96 0 1.17 10.6 0 0 0 0 0 0
Comedy 0.71 1.54 0 0.84 0.52 0.32 0.73 1.1 1.54 0.33 1.54 0.46 1.97 0.13 0.54 2.23 0 0 0.31 0 0 0
Crime 1.14 0.3 0.84 0 0.26 1.35 0.22 0.38 0.51 1.69 0 1.24 0 0 0.37 0 0 0 0.53 0 1.3 0
Sci-Fi 0.49 0.36 0.52 0.26 0 0.97 2.27 1.33 0.39 1.37 0 2.16 0.54 0 1.69 0.9 0 0 0.65 0 0 0
Mystery 0.82 0 0.32 1.35 0.97 0 0.67 0.65 0.5 2.11 0 0.46 0 0 2.18 0 0 0 1.06 0 0 0
Adventure 0.54 1.07 0.73 0.22 2.27 0.67 0 2.57 0.5 0.74 0 2.22 2.69 1.4 0.49 4.22 0 0 1.93 0 0 0
Fantasy 0.58 2.09 1.1 0.38 1.33 0.65 2.57 0 1.12 0.5 0 1.6 3.92 0 1.58 3.89 0 0 0 0 0 0
Romance 1.3 1.22 1.54 0.51 0.39 0.5 0.5 1.12 0 0.39 0 0.33 1.07 1.19 0.29 0.57 0 0 0.55 1.83 0 0
Thriller 0.9 0 0.33 1.69 1.37 2.11 0.74 0.5 0.39 0 1.43 1.46 0.08 0.5 1.95 0 0 0 0.57 0 0.7 0
Sport 0.91 0 1.54 0 0 0 0 0 0 1.43 0 1.82 12.81 0 0 0 0 0 0 0 0 0
Action 0.64 0 0.46 1.24 2.16 0.46 2.22 1.6 0.33 1.46 1.82 0 0.71 1.58 1 0.63 0 0 1.46 0 0 0
Family 0.46 9.96 1.97 0 0.54 0 2.69 3.92 1.07 0.08 12.8 0.71 0 0 0.2 15.01 0 0 0 0 0 0
War 1.51 0 0.13 0 0 0 1.4 0 1.19 0.5 0 1.58 0 0 0 0 13.4 10 0 0 0 0
Horror 0.43 1.17 0.54 0.37 1.69 2.18 0.49 1.58 0.29 1.95 0 1 0.2 0 0 0.49 0 0 0 0 0 0
Animation 0.32 10.6 2.23 0 0.9 0 4.22 3.89 0.57 0 0 0.63 15.01 0 0.49 0 0 0 0 0 0 0
Biography 1.83 0 0 0 0 0 0 0 0 0 0 0 0 13.36 0 0 0 154 0 0 0 0
Music 1.83 0 0 0 0 0 0 0 0 0 0 0 0 10.02 0 0 154 0 0 0 0 0
Western 0.91 0 0.31 0.53 0.65 1.06 1.93 0 0.55 0.57 0 1.46 0 0 0 0 0 0 0 0 0 0
History 1.83 0 0 0 0 0 0 0 1.83 0 0 0 0 0 0 0 0 0 0 0 0 0
Film-Noir 1.83 0 0 1.33 0 0 0 0 0 0.72 0 0 0 0 0 0 0 0 0 0 0 0
Short 0.91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Example
Drama Musical Comedy Crime Sci-Fi Mystery Adventure Fantasy Romance Thriller Sport Action Family War Horror Animation Biopic Music Western History
 Big genres like Drama, Comedy are more “mixable”
/ adaptive to other genres, which means the interest
is mostly clustered around 1
 However, niche genres like Family show extremes
of interest:
 Lots of very low interest (near 0)  ‘Family’ is
incompatible with some genres (like Mystery, Horror,
and War)
 Lots of very high interest (>>1)  ‘Family’ is highly
compatible with others (like Comedy, Animation, and
Film-
Noir Short
Drama 0 0.71 0.71 1.14 0.49 0.82 0.54 0.58 1.3 0.9 0.91 0.64 0.46 1.51 0.43 0.32 1.83 1.83 0.91 1.83 1.8 0.91
Comedy 0.71 1.54 0 0.84 0.52 0.32 0.73 1.1 1.54 0.33 1.54 0.46 1.97 0.13 0.54 2.23 0 0 0.31 0 0 0
Family 0.46 9.96 1.97 0 0.54 0 2.69 3.92 1.07 0.08 12.8 0.71 0 0 0.2 15.01 0 0 0 0 0 0

TF-IDF
 In total, 95341 unlemmatized word types
(features) which is too many for processing
 Basic idea: extract 10 keywords for each movie
and combine together as a keyword list (library),
which is later used as the feature list.
 For example, the keywords for BraveHeart is:
{broadsword, william, barn, knights, king, …}
 In total, after TFIDF, only 4498 word types
 For new incoming movies, the counts of keywords
would then be the feature vector
 Bag of words assumption

Naïve Bayes Classifier TFIDF
Features
Row Labels TP FP TN FN Accuracy Recall Precision Specificity F
Drama 78 48 42 17 0.65 0.82 0.62 0.47 0.71
Musical 1 2 179 3 0.97 0.25 0.33 0.99 0.29
Comedy 46 39 78 22 0.67 0.68 0.54 0.67 0.6
Crime 17 50 111 7 0.69 0.71 0.25 0.69 0.37
Scifi 17 19 144 5 0.87 0.77 0.47 0.88 0.59
Mystery 10 25 143 7 0.83 0.59 0.29 0.85 0.38
Adventure 19 21 138 7 0.85 0.73 0.48 0.87 0.58
Fantasy 12 21 144 8 0.84 0.6 0.36 0.87 0.45
Romance 28 51 98 8 0.68 0.78 0.35 0.66 0.49
Thriller 41 27 96 21 0.74 0.66 0.6 0.78 0.63
Sport - - 184 1 0.99 - NaN 1 NaN
Action 38 19 121 7 0.86 0.84 0.67 0.86 0.75
Family 4 3 176 2 0.97 0.67 0.57 0.98 0.62
War 2 8 174 1 0.95 0.67 0.2 0.96 0.31
Horror 22 24 128 11 0.81 0.67 0.48 0.84 0.56
Animation 6 4 175 - 0.98 1 0.6 0.98 0.75
Biography - - 185 - 1 NaN NaN 1 NaN
Music - - 185 - 1 NaN NaN 1 NaN
Western 1 2 182 - 0.99 1 0.33 0.99 0.5
History - - 185 - 1 NaN NaN 1 NaN
Film-Noir - - 183 2 0.99 - NaN 1 NaN
Short - - 184 1 0.99 - NaN 1 NaN
Grand Total 21.4 22.7 147.0 7.6 0.88 0.72 0.45 0.88 0.54

Boosting Classifier TFIDF
Features
 Tuning parameter K = number of trees in forest
1.05
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
20 100 1000 10000
Drama
Musical
Comedy
Crime
Sci-Fi
Mystery
Adventure
Fantasy
Romance
Thriller
Sport
Action
Family
War
Horror
Animation
Biography
Music
Western

Boosting Classifier TFIDF
Features
20 100 1000 10000 NB
Drama 0.66 0.66 0.67 0.68 0.65
Musical 0.98 0.98 0.98 0.98 0.97
Comedy 0.73 0.75 0.77 0.79 0.67
Crime 0.80 0.83 0.83 0.83 0.69
Sci-Fi 0.89 0.88 0.90 0.90 0.87
Mystery 0.88 0.90 0.91 0.90 0.83
Adventure 0.85 0.87 0.88 0.88 0.85
Fantasy 0.86 0.88 0.89 0.89 0.84
Romance 0.81 0.82 0.84 0.84 0.68
Thriller 0.72 0.72 0.74 0.75 0.74
Sport 1.00 1.00 1.00 1.00 0.99
Action 0.79 0.81 0.83 0.84 0.86
Family 0.96 0.96 0.96 0.96 0.97
War 0.97 0.97 0.97 0.97 0.95
Horror 0.87 0.90 0.91 0.91 0.81
Animation 0.97 0.96 0.97 0.97 0.98
Biography 1.00 1.00 1.00 1.00 1.00
Music 0.99 0.99 0.99 0.99 1.00
Western 0.99 0.99 0.99 0.99 0.99
History 1.00 1.00 1.00 1.00 1.00
Film-Noir 0.99 0.99 0.99 0.99 0.99
Short 1.00 1.00 1.00 1.00 0.99

Modified TFIDF
 In normal TFIDF, we penalize the occurrence
of number of documents in which a word
appears
 However, we don’t want to penalize if the
document belongs to the same genre of that
word, instead, we only penalize on the out-of-genre
documents
 Modify it as:

Example of Modified TFIDF
 5 docs
 doc1 = [ ‘man’ ‘star’ ‘ship’ ‘laser’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’ ‘star’]
(Sci-Fi)
 doc2 = [‘man’ ‘steve ‘ship’ ‘water water water water water water water ] (Sci-Fi)
 doc3 = [‘man’ ‘john’ ‘ship’ ‘diamond diamond diamond diamond diamond diamond]
(Sci-Fi)
(Sci-Fi)
(Sci-Fi)
 doc6 = [‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘man’ ‘bed ‘chair ‘lamp]
(Family)
 When we extract keywords for doc1:
 Normal TFIDF:
 Score(star) = 9/1=9
 Score(ship) = 1/5=0.2
 Modified TFIDF:
 Score(star) = log2(9+1) * 1/(0+1)=3.3

Topic Modeling
 Idea: Use topics for dimension reduction of raw
word frequencies
 Instead of tens of thousands of word frequencies,
cluster words into a few hundred topics
 Topic prevalence becomes features for learning
algorithms
 Two phases of Machine Learning:
1. Unsupervised learning to cluster words into topics
2. Supervised learning of models relating topic
distributions to genre labels

Topic Modeling with LDA
Source: David Blei, “Probabilistic Topic Modeling” (ACM,
2012)

Implementation Details
 Extensive Text Pre-Processing:
 Remove stop-words and film terms
 Remove character names (more on this…)
 Convert files to word-count vectors
 Used Gensim implementation of Latent Dirichlet Allocation to
extract topics
 Varied num_topics = 32, 64, 128, 256, 512
 Which is best? Looked at…
 Change in model performance (accuracy, F-measure) as number of topics
increased
 Manually inspected topics to see if they were crystallizing into a intelligible
themes
 128 or 256 seem best and roughly equivalent
 Several different supervised learning algorithms:
 Logistic regression
 Naïve bayes
 SVM RBF

Example: Interpretable Topics
Num_Topics =
256
Topic #126 Topic # 212 Topic #100
Weigh
t
Word
0.005 shore
0.005 swim
0.005 beat
0.004 just
0.004 move
0.004 way
0.004 room
0.004 along
0.004 bow
0.004 port
0.004 open
0.004 take
0.004 warrior
0.004 harbor
0.004 arctic
0.004 another
0.004 sub
0.004 end
0.004 hull
0.004 like
0.004 forward
0.004 line
0.003 toward
0.003
swimmin
g
0.003 sailor
Weigh
t
Word
0.037 boat
0.035 water
0.019 deck
0.017 ship
0.014 island
0.013 sea
0.011 ocean
0.011 back
0.01 look
0.009 beach
0.008 hold
0.008 come
0.008 see
0.007 radio
0.007 light
0.007 chimera
0.007 dock
0.007 crew
0.006 cabin
0.006
continuou
s
0.006 one
0.006 surface
0.006 raft
0.005 foot
Sports Space Travel Ships & Sailing
0.005
underwat
er
Weigh
t
Word
0.005 mcginty
0.005 throw
0.005 two
0.005 take
0.005 gon
0.005 five
0.005 left
0.005 second
0.005 end
0.005 see
0.005 score
0.005 win
0.004 pitch
0.004 head
0.004 hand
0.004 pas
0.004 come
0.004 point
0.004 time
0.004 just
0.004 good
0.004 pull
0.004 dallas
0.004 year
0.004 mike
Weigh
t
Word
0.039 game
0.027 ball
0.02 player
0.017 field
0.016 team
0.012 play
0.011 coach
0.01 football
0.01 guy
0.009 one
0.009 locker
0.009 hit
0.009 run
0.009 look
0.009 big
0.008 back
0.008 line
0.008 got
0.007 first
0.007 get
0.006 right
0.006 three
0.006 like
0.006 walk
0.006 stadium
Weigh
t
Word
0.017 ship
0.011 control
0.009 space
0.007 cockpit
0.007 one
0.006 light
0.006 see
0.006 move
0.006 two
0.005 begin
0.005 turn
0.005 robot
0.005 get
0.005 back
0.004 corridor
0.004 bay
0.004 bridge
0.004 horizon
0.004 around
0.004 look
0.004 pilot
0.004 planet
0.004 right
0.004 going
0.004 just
Weigh
t
Word
0.004 panel
0.004 head
0.004 lewis
0.004 come
0.004 system
0.004 like
0.004 radio
0.004 base
0.004 away
0.004 giant
0.003 cloud
0.003 air
0.003 event
0.003 laser
0.003 power
0.003 know
0.003 toward
0.003 main
0.003 star
0.003 huge
0.003 door
0.003 suddenly
0.003 will
0.003 falcon
0.003 hit

Highest Prevalence of Topic
#126
Film
Topic
Prevalenc
e
Replacements, The 0.63
Program, The 0.19
Moneyball 0.16
Major League 0.16
Blind Side, The 0.14
Love and Basketball 0.14
Bull Durham 0.14
Sugar 0.12
Forrest Gump 0.11
Semi-Pro 0.10
Two For The Money 0.10
Damned United, The 0.09
Sandlot Kids, The 0.09
Field of Dreams 0.08
Tin Cup 0.08
Invictus 0.07
eXistenZ 0.06
Buffy the Vampire Slayer 0.05
The Rage: Carrie 2 0.05
Game 6 0.05
Cincinnati Kid, The 0.05

#212
Film
Topic
Prevalence
Star Wars: The Empire Strikes Back 0.85
Event Horizon 0.59
Star Wars: A New Hope 0.47
Dark Star 0.32
Alien 0.25
Lost in Space 0.21
Jason X 0.21
TRON 0.19
Star Wars: The Phantom Menace 0.19
Dune 0.16
Independence Day 0.16
Pandorum 0.16
Wall-E 0.16
Mission to Mars 0.15
Airplane 2: The Sequel 0.14
Leviathan 0.13
Abyss, The 0.13
Prometheus 0.12
Pitch Black 0.11
Aliens 0.11
Sphere 0.11
Oblivion 0.11
Heavy Metal 0.10
Thor 0.09
Star Wars: Return of the Jedi 0.09
Moon 0.09

#100
Film
Topic
Prevalence
Ghost Ship 0.70
Life of Pi 0.17
Hard Rain 0.13
Jaws 2 0.13
Jaws 0.12
Master and Commander 0.11
Titanic 0.11
Deep Rising 0.11
Big Blue, The 0.10
Abyss, The 0.10
Cast Away 0.10
Pirates of the Caribbean 0.10
King Kong 0.09
Pearl Harbor 0.09
Lake Placid 0.09
Friday the 13th: Jason Takes
Manhattan 0.09
Mud 0.08
Commando 0.06
G.I. Jane 0.06
Jurassic Park III 0.06
Sphere 0.05
Apocalypse Now 0.05
Blood and Wine 0.04
Leviathan 0.04
I Still Know What You Did Last
Summer 0.04

Character Names!
Topic #203 (of
 Infuriating problem! Character 256)
0.038*kirk + 0.032*decker + 0.019*bridge +
0.019*spock + 0.015*captain +
0.013*mccoy + 0.012*viewer +
0.012*enterprise + 0.011*ilia + 0.010*now +
0.009*console + 0.009*sir + 0.009*scott +
0.008*crew + 0.007*shuttle + 0.007*vulcan
+ 0.007*space + 0.007*cloud +
0.006*intercom + 0.006*starfleet +
0.006*ship + 0.006*station + 0.005*sulu +
0.005*warp + 0.005*klingon + 0.005*control
+ 0.005*toward + 0.005*energy +
0.005*chekov + 0.004*moment +
0.004*main + 0.004*another +
0.004*science + 0.004*vessel +
0.004*chamber + 0.004*power +
0.004*transporter + 0.003*pod +
0.003*uhura + 0.003*alien +
0.003*engineering + 0.003*voice +
0.003*deck + 0.003*continues +
0.003*move + 0.003*ahead +
0.003*camera + 0.003*see + 0.003*one +
0.003*computer
names spoil otherwise good
topics
 Character names are some of
the most frequent words in
screenplays and therefore
dominate topics
 But! They don’t have
predictive value, since they
are highly unlikely to appear
to other screenplays in the
same genre
 Need to eliminate them!

Strategies for Removing Character
Names
1. Document-Level:
 Identify using formatting information: names tend to
appear in all caps in the center of a line
 Remove all cases in all caps (‘STEVE”) or title case
(‘Steve’)
2. Corpus-Level:
 Names tend to be very frequent within a single
document, but do not recur across documents
 Remove all words that occur in ≤ 3 documents. This
also helps to eliminate other noise (e.g., typos,
“aaaargh”, etc.)
 Problem:
Note: In retrospect, it would have
been wiser to extract only verbs
and nouns using a POS-tagger…
 Franchise films with many sequels (e.g., Star Wars)
 Very common names (John, Sue, David, etc.)

Results: LR vs. NB vs. SVM
Row Labels TP FP TN FN Accuracy Recall Precision Specificity F Row Labels TP FP TN FN Accuracy Recall Precision Specificity F
Action 1 86 3 1 5 45 3 8 0 .91 0 .91 0 .83 0 .95 0 .84 Action 2 2 1 1 8 2 7 0 .85 0 .85 0 .76 0 .88 0 .71
Adventure 8 4 8 6 70 3 8 0 .94 0 .94 0 .69 0 .99 0 .79 Adventure 1 2 4 9 7 9 0 .89 0 .89 0 .57 0 .96 0 .65
Comedy 2 10 3 6 5 04 5 0 0 .89 0 .89 0 .81 0 .93 0 .83 Comedy 2 3 1 4 6 9 1 6 0 .75 0 .75 0 .59 0 .83 0 .61
Crime 1 04 2 4 6 26 4 6 0 .91 0 .91 0 .69 0 .96 0 .75 Crime 6 9 9 0 1 7 0 .79 0 .79 0 .26 0 .91 0 .32
Drama 3 89 8 5 2 76 5 0 0 .83 0 .83 0 .89 0 .76 0 .85 Drama 4 9 2 7 3 0 1 6 0 .65 0 .65 0 .75 0 .53 0 .70
Fantasy 4 7 3 7 19 3 1 0 .96 0 .96 0 .60 1 .00 0 .73 Fantasy 3 5 9 7 1 7 0 .82 0 .82 0 .15 0 .95 0 .21
Horror 7 9 1 2 6 77 3 2 0 .95 0 .95 0 .71 0 .98 0 .78 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53
Romance 5 9 1 0 6 42 8 9 0 .88 0 .88 0 .40 0 .98 0 .54 Romance - 1 1 01 2 0 0 .83 0 .83 - 0 .99 #DIV/0!
Sci-Fi 9 9 7 6 70 2 4 0 .96 0 .96 0 .80 0 .99 0 .86 Sci-Fi 1 2 5 9 8 7 0 .90 0 .90 0 .63 0 .95 0 .67
Thriller 2 37 5 7 4 58 4 8 0 .87 0 .87 0 .83 0 .89 0 .82 Thriller 2 6 1 5 7 0 1 1 0 .79 0 .79 0 .70 0 .82 0 .67
Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00
Grand Total 136.5 24.8 598.0 40.7 0 .92 0 .92 0 .73 0 .95 0 .79 Grand Total 1 4.7 8 .5 8 6.8 1 2.0 0 .83 0 .83 0 .53 0 .89 0 .61
NAÏVE SVM RBF
BAYES
Action 1 41 4 4 5 32 8 3 0 .84 0 .84 0 .63 0 .92 0 .69 Action 2 0 1 0 8 3 9 0 .84 0 .84 0 .69 0 .89 0 .68
Adventure 4 3 3 2 6 46 7 9 0 .86 0 .86 0 .35 0 .95 0 .44 Adventure 2 - 1 01 1 9 0 .84 0 .84 0 .10 1 .00 0 .17
Comedy 1 51 7 0 4 70 1 09 0 .78 0 .78 0 .58 0 .87 0 .63 Comedy 1 5 1 2 7 1 2 4 0 .70 0 .70 0 .38 0 .86 0 .45
Crime 2 4 3 2 6 18 1 26 0 .80 0 .80 0 .16 0 .95 0 .23 Crime 1 1 9 8 2 2 0 .81 0 .81 0 .04 0 .99 0 .08
Drama 3 65 1 57 2 04 7 4 0 .71 0 .71 0 .83 0 .57 0 .76 Drama 5 3 3 0 2 7 1 2 0 .66 0 .66 0 .82 0 .47 0 .72
Fantasy 1 5 3 1 6 91 6 3 0 .88 0 .88 0 .19 0 .96 0 .24 Fantasy - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0!
Horror 2 3 2 5 6 64 8 8 0 .86 0 .86 0 .21 0 .96 0 .29 Horror - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0!
Romance 2 0 2 5 6 27 1 28 0 .81 0 .81 0 .14 0 .96 0 .21 Romance - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0!
Sci-Fi 5 0 2 8 6 49 7 3 0 .87 0 .87 0 .41 0 .96 0 .50 Sci-Fi 6 2 1 01 1 3 0 .88 0 .88 0 .32 0 .98 0 .44
Western 1 2 6 7 65 8 0 .96 0 .96 0 .11 0 .97 0 .06 Western - - 1 21 1 0 .99 0 .99 - 1 .00 #DIV/0!
Action 1 16 3 7 5 39 1 08 0 .82 0 .82 0 .52 0 .94 0 .62 Action 1 5 7 8 6 1 4 0 .83 0 .83 0 .52 0 .92 0 .59
Adventure 2 1 6 77 1 20 0 .85 0 .85 0 .02 1 .00 0 .03 Adventure - - 1 01 2 1 0 .83 0 .83 - 1 .00 #DIV/0!
Comedy 1 12 5 1 4 89 1 48 0 .75 0 .75 0 .43 0 .91 0 .53 Comedy 1 4 8 7 5 2 5 0 .73 0 .73 0 .36 0 .90 0 .46
Crime 1 3 1 5 6 35 1 37 0 .81 0 .81 0 .09 0 .98 0 .15 Crime - 1 9 8 2 3 0 .80 0 .80 - 0 .99 #DIV/0!
Drama 3 65 1 61 2 00 7 4 0 .71 0 .71 0 .83 0 .55 0 .76 Drama 5 1 3 0 2 7 1 4 0 .64 0 .64 0 .78 0 .47 0 .70
Fantasy - - 7 22 7 8 0 .90 0 .90 - 1 .00 #DIV/0! Fantasy - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0!
Horror 6 4 6 85 1 05 0 .86 0 .86 0 .05 0 .99 0 .10 Horror - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0!
Romance 5 1 6 51 1 43 0 .82 0 .82 0 .03 1 .00 0 .06 Romance - - 1 02 2 0 0 .84 0 .84 - 1 .00 #DIV/0!
Sci-Fi 1 8 7 6 70 1 05 0 .86 0 .86 0 .15 0 .99 0 .24 Sci-Fi - - 1 03 1 9 0 .84 0 .84 - 1 .00 #DIV/0!
Thriller 1 48 6 7 4 48 1 37 0 .75 0 .75 0 .52 0 .87 0 .59 Thriller 1 9 9 7 6 1 8 0 .78 0 .78 0 .51 0 .89 0 .58
Western - - 7 91 9 0 .99 0 .99 - 1 .00 #DIV/0! Western - - 1 21 1 0 .99 0 .99 - 1 .00 #DIV/0!
Grand Total 71.4 31.3 591.5 105.8 0 .83 0 .83 0 .24 0 .93 0 .34 Grand Total 9 .0 5 .0 9 0.3 1 7.7 0 .81 0 .81 0 .20 0 .93 0 .58
LOGISTIC
REG
TRAINING / IN-SAMPLE (800
Scripts)
TESTING / OUT-OF-SAMPLE (156 Scripts)
Num_Topics =
256

Results: Number of Topics
SVM_RBF (C=100, GAMMA=1.0)
Action 1 77 3 3 5 43 4 7 0 .90 0 .90 0 .79 0 .94 0 .82 Action 2 4 1 2 8 1 5 0 .86 0 .86 0 .83 0 .87 0 .74
Adventure 7 0 1 5 6 63 5 2 0 .92 0 .92 0 .57 0 .98 0 .68 Adventure 1 3 6 9 5 8 0 .89 0 .89 0 .62 0 .94 0 .65
Comedy 1 93 5 2 4 88 6 7 0 .85 0 .85 0 .74 0 .90 0 .76 Comedy 1 6 9 7 4 2 3 0 .74 0 .74 0 .41 0 .89 0 .50
Crime 5 9 9 6 41 9 1 0 .88 0 .88 0 .39 0 .99 0 .54 Crime 8 1 0 8 9 1 5 0 .80 0 .80 0 .35 0 .90 0 .39
Drama 3 80 8 9 2 72 5 9 0 .82 0 .82 0 .87 0 .75 0 .84 Drama 4 7 2 9 2 8 1 8 0 .61 0 .61 0 .72 0 .49 0 .67
Fantasy 2 6 2 7 20 5 2 0 .93 0 .93 0 .33 1 .00 0 .49 Fantasy 3 - 1 02 1 7 0 .86 0 .86 0 .15 1 .00 0 .26
Horror 5 9 1 3 6 76 5 2 0 .92 0 .92 0 .53 0 .98 0 .64 Horror 8 5 9 7 1 2 0 .86 0 .86 0 .40 0 .95 0 .48
Romance 5 3 8 6 44 9 5 0 .87 0 .87 0 .36 0 .99 0 .51 Romance 2 3 9 9 1 8 0 .83 0 .83 0 .10 0 .97 0 .16
Sci-Fi 8 3 1 8 6 59 4 0 0 .93 0 .93 0 .67 0 .97 0 .74 Sci-Fi 1 0 3 1 00 9 0 .90 0 .90 0 .53 0 .97 0 .63
Thriller 2 18 7 0 4 45 6 7 0 .83 0 .83 0 .76 0 .86 0 .76 Thriller 3 0 1 8 6 7 7 0 .80 0 .80 0 .81 0 .79 0 .71
Western 7 1 7 90 2 1 .00 1 .00 0 .78 1 .00 0 .82 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00
Action 1 77 3 2 5 44 4 7 0 .90 0 .90 0 .79 0 .94 0 .82 Action 2 1 1 2 8 1 8 0 .84 0 .84 0 .72 0 .87 0 .68
Adventure 7 5 1 6 6 62 4 7 0 .92 0 .92 0 .61 0 .98 0 .70 Adventure 9 2 9 9 1 2 0 .89 0 .89 0 .43 0 .98 0 .56
Comedy 1 68 3 2 5 08 9 2 0 .85 0 .85 0 .65 0 .94 0 .73 Comedy 1 2 1 1 7 2 2 7 0 .69 0 .69 0 .31 0 .87 0 .39
Crime 8 1 2 6 6 24 6 9 0 .88 0 .88 0 .54 0 .96 0 .63 Crime 1 1 9 8 2 2 0 .81 0 .81 0 .04 0 .99 0 .08
Drama 3 82 1 04 2 57 5 7 0 .80 0 .80 0 .87 0 .71 0 .83 Drama 4 4 2 6 3 1 2 1 0 .61 0 .61 0 .68 0 .54 0 .65
Fantasy 2 6 1 7 21 5 2 0 .93 0 .93 0 .33 1 .00 0 .50 Fantasy 1 - 1 02 1 9 0 .84 0 .84 0 .05 1 .00 0 .10
Horror 6 4 8 6 81 4 7 0 .93 0 .93 0 .58 0 .99 0 .70 Horror 9 5 9 7 1 1 0 .87 0 .87 0 .45 0 .95 0 .53
Romance 6 9 1 6 6 36 7 9 0 .88 0 .88 0 .47 0 .98 0 .59 Romance 1 1 1 01 1 9 0 .84 0 .84 0 .05 0 .99 0 .09
Sci-Fi 8 3 6 6 71 4 0 0 .94 0 .94 0 .67 0 .99 0 .78 Sci-Fi 1 0 3 1 00 9 0 .90 0 .90 0 .53 0 .97 0 .63
Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00
64 Topics 32 Topics
TRAINING (IN-SAMPLE) TESTING (OUT-OF-SAMPLE)

TRAINING (IN-SAMPLE) TESTING (OUT-OF-SAMPLE)
Action 1 84 3 1 5 45 4 0 0 .91 0 .91 0 .82 0 .95 0 .84 Action 2 1 7 8 6 8 0 .88 0 .88 0 .72 0 .92 0 .74
Adventure 8 2 7 6 71 4 0 0 .94 0 .94 0 .67 0 .99 0 .78 Adventure 1 0 5 9 6 1 1 0 .87 0 .87 0 .48 0 .95 0 .56
Comedy 1 88 3 4 5 06 7 2 0 .87 0 .87 0 .72 0 .94 0 .78 Comedy 1 7 7 7 6 2 2 0 .76 0 .76 0 .44 0 .92 0 .54
Crime 8 8 2 8 6 22 6 2 0 .89 0 .89 0 .59 0 .96 0 .66 Crime 6 8 9 1 1 7 0 .80 0 .80 0 .26 0 .92 0 .32
Drama 3 89 8 0 2 81 5 0 0 .84 0 .84 0 .89 0 .78 0 .86 Drama 5 0 3 2 2 5 1 5 0 .61 0 .61 0 .77 0 .44 0 .68
Fantasy 4 2 - 7 22 3 6 0 .96 0 .96 0 .54 1 .00 0 .70 Fantasy 2 1 1 01 1 8 0 .84 0 .84 0 .10 0 .99 0 .17
Horror 8 6 1 3 6 76 2 5 0 .95 0 .95 0 .77 0 .98 0 .82 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53
Romance 6 8 1 4 6 38 8 0 0 .88 0 .88 0 .46 0 .98 0 .59 Romance 2 4 9 8 1 8 0 .82 0 .82 0 .10 0 .96 0 .15
Sci-Fi 9 1 7 6 70 3 2 0 .95 0 .95 0 .74 0 .99 0 .82 Sci-Fi 1 3 1 1 02 6 0 .94 0 .94 0 .68 0 .99 0 .79
Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00
Action 1 86 3 1 5 45 3 8 0 .91 0 .91 0 .83 0 .95 0 .84 Action 2 2 1 1 8 2 7 0 .85 0 .85 0 .76 0 .88 0 .71
Comedy 2 10 3 6 5 04 5 0 0 .89 0 .89 0 .81 0 .93 0 .83 Comedy 2 3 1 4 6 9 1 6 0 .75 0 .75 0 .59 0 .83 0 .61
Crime 1 04 2 4 6 26 4 6 0 .91 0 .91 0 .69 0 .96 0 .75 Crime 6 9 9 0 1 7 0 .79 0 .79 0 .26 0 .91 0 .32
Drama 3 89 8 5 2 76 5 0 0 .83 0 .83 0 .89 0 .76 0 .85 Drama 4 9 2 7 3 0 1 6 0 .65 0 .65 0 .75 0 .53 0 .70
Fantasy 4 7 3 7 19 3 1 0 .96 0 .96 0 .60 1 .00 0 .73 Fantasy 3 5 9 7 1 7 0 .82 0 .82 0 .15 0 .95 0 .21
Horror 7 9 1 2 6 77 3 2 0 .95 0 .95 0 .71 0 .98 0 .78 Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53
Romance 5 9 1 0 6 42 8 9 0 .88 0 .88 0 .40 0 .98 0 .54 Romance - 1 1 01 2 0 0 .83 0 .83 - 0 .99 #DIV/0!
Sci-Fi 9 9 7 6 70 2 4 0 .96 0 .96 0 .80 0 .99 0 .86 Sci-Fi 1 2 5 9 8 7 0 .90 0 .90 0 .63 0 .95 0 .67
Western 7 - 7 91 2 1 .00 1 .00 0 .78 1 .00 0 .88 Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00
Action 1 95 2 3 5 53 2 9 0 .94 0 .94 0 .87 0 .96 0 .88 Action 2 3 1 2 8 1 6 0 .85 0 .85 0 .79 0 .87 0 .72
Comedy 1 94 3 6 5 04 6 6 0 .87 0 .87 0 .75 0 .93 0 .79 Comedy 2 3 1 7 6 6 1 6 0 .73 0 .73 0 .59 0 .80 0 .58
Crime 1 10 1 3 6 37 4 0 0 .93 0 .93 0 .73 0 .98 0 .81 Crime 6 3 9 6 1 7 0 .84 0 .84 0 .26 0 .97 0 .38
Drama 3 98 7 1 2 90 4 1 0 .86 0 .86 0 .91 0 .80 0 .88 Drama 4 9 2 6 3 1 1 6 0 .66 0 .66 0 .75 0 .54 0 .70
Fantasy 3 7 - 7 22 4 1 0 .95 0 .95 0 .47 1 .00 0 .64 Fantasy 2 1 1 01 1 8 0 .84 0 .84 0 .10 0 .99 0 .17
Horror 8 9 1 3 6 76 2 2 0 .96 0 .96 0 .80 0 .98 0 .84 Horror 7 2 1 00 1 3 0 .88 0 .88 0 .35 0 .98 0 .48
Romance 7 7 1 0 6 42 7 1 0 .90 0 .90 0 .52 0 .98 0 .66 Romance 1 4 9 8 1 9 0 .81 0 .81 0 .05 0 .96 0 .08
Sci-Fi 9 2 1 2 6 65 3 1 0 .95 0 .95 0 .75 0 .98 0 .81 Sci-Fi 1 1 4 9 9 8 0 .90 0 .90 0 .58 0 .96 0 .65
Western 8 - 7 91 1 1 .00 1 .00 0 .89 1 .00 0 .94 Western 1 1 1 20 - 0 .99 0 .99 1 .00 0 .99 0 .67
512 Topics 256 Topics 128 Topics
Results: Number of Topics
SVM_RBF (C=100, GAMMA=1.0)

Results: Predictability by Genre
SVM_RBF (C=100, GAMMA=1.0) TOPICS=128 TESTING
Row Labels TP FP TN FN Accuracy Recall Precision Specificity F
Action 2 1 7 8 6 8 0 .88 0 .88 0 .72 0 .92 0 .74
Adventure 1 0 5 9 6 1 1 0 .87 0 .87 0 .48 0 .95 0 .56
Comedy 1 7 7 7 6 2 2 0 .76 0 .76 0 .44 0 .92 0 .54
Crime 6 8 9 1 1 7 0 .80 0 .80 0 .26 0 .92 0 .32
Drama 5 0 3 2 2 5 1 5 0 .61 0 .61 0 .77 0 .44 0 .68
Fantasy 2 1 1 01 1 8 0 .84 0 .84 0 .10 0 .99 0 .17
Horror 8 2 1 00 1 2 0 .89 0 .89 0 .40 0 .98 0 .53
Romance 2 4 9 8 1 8 0 .82 0 .82 0 .10 0 .96 0 .15
Sci-Fi 1 3 1 1 02 6 0 .94 0 .94 0 .68 0 .99 0 .79
Thriller 2 7 1 1 7 4 1 0 0 .83 0 .83 0 .73 0 .87 0 .72
Western 1 - 1 21 - 1 .00 1 .00 1 .00 1 .00 1 .00
Grand Total 1 4.3 7 .1 8 8.2 1 2.5 0 .84 0 .84 0 .52 0 .90 0 .56
Highly Predictability (F >
0.7):
• Sci-Fi
• Action
• Thriller
• Western
Low Predictability (F < 0.4):
• Crime
• Fantasy
• Romance

Some Visualization (Forest
Gump)

Some Visualization (Jurassic
Park)

Some Visualization (Les
Miserable)

Program Architecture
Screenplay
Screenplay Class
Attributes
Title
Text
Genres
Word Freq
Topics
Methods
Getter and
Setter
Functions
TextCleaner Class
ScriptDatabase
Class
ScriptScraper Class
Website

Machine learning-final-presentation v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning-final-presentation v2

Similar to Machine learning-final-presentation v2 (9)

Recently uploaded

Recently uploaded (20)

Machine learning-final-presentation v2