5. Why Shakespeare?
Polonius: What do you read, my
lord?
Hamlet: Words, words, words.
P: What is the matter, my lord?
H: Between who?
P: I mean, the matter that you
read, my lord.
--II.2.184
9. Challenges
⢠Language, especially English, is messy
⢠Texts are usually unstructured
⢠Pronunciation is not standard
⢠Reading is pretty hard!
20. First steps with
natural language processing (NLP)
What are
Shakespeareâs most
interesting rhymes?
21. Shakespeareâs Sonnets
⢠A sonnet is 14 line poem
⢠There are many different rhyme schemes a
sonnet can have; Shakespeare was pretty
unique in choosing one
⢠This is a huge win for us, since we can âhard
codeâ his rhyme scheme in our analysis
22. Shall I compare thee to a summerâs day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summerâs lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or natureâs changing course untrimm'd;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou owâst;
Nor shall death brag thou wanderâst in his shade,
When in eternal lines to time thou growâst:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
http://www.poetryfoundation.org/poem/174354
a
b
a
b
c
d
c
d
e
f
e
f
g
g
Sonnet 18
23. Rhyme Distribution
⢠Most common rhymes
⢠nltk.FreqDict
Frequency Distribution
⢠Given a word, what is the frequency distribution of
the words that rhyme with it?
⢠nltk.ConditionalFreqDict
Conditional Frequency Distribution
29. Our Classifier
Can we write code to tell if a given speech is
from a tragedy or comedy?
30. â Requires labeled text
â (in this case, speeches labeled by genre)
â [(<speech>, <genre>), ...]
â Requires âtrainingâ
â Predicts labels of text
Classifiers: overview
32. Vectorizers (or Feature Extractors)
â A vectorizer, or feature extractor, transforms a text into
quantifiable information about the text.
â Theoretically, these features could be anything. i.e.:
â How many capital letters does the text contain?
â Does the text end with an exclamation point?
â In practice, a common model is âBag of Wordsâ.
33. Bag of Words is a kind of feature extraction
where:
â The set of features is the set of all words in
the text youâre analyzing
â A single text is represented by how many of
each word appears in it
Bag of Words
34. Bag of Words: Simple Example
Two texts:
â âHello, Will!â
â âHello, Globe!â
35. Bag of Words: Simple Example
Two texts:
â âHello, Will!â
â âHello, Globe!â
Bag: [âHelloâ, âWillâ, âGlobeâ]
âHelloâ âWillâ âGlobeâ
36. Bag of Words: Simple Example
Two texts:
â âHello, Will!â
â âHello, Globe!â
Bag: [âHelloâ, âWillâ, âGlobeâ]
âHelloâ âWillâ âGlobeâ
âHello,
Willâ
1 1 0
âHello,
Globeâ
1 0 1
37. Bag of Words: Simple Example
Two texts:
â âHello, Will!â
â âHello, Globe!â
âHelloâ âWillâ âGlobeâ
âHello,
Willâ
1 1 0
âHello,
Globeâ
1 0 1
âHello, Willâ â âA text that contains one instance of the
word âHelloâ, contains one instance of the word âWillâ, and
does not contain the word âGlobeâ.
(Less readable for us, more readable for computers!)
39. Why are these called âVectorizersâ?
text_1 = "words, words, words"
text_2 = "words, words, birds"
# times âbirdsâ is used
# times
âwordsâ is
used
text_2
text_1
42. Classification: Steps
1) Split pre-labeled text into training and testing
sets
2) Vectorize text (extract features)
3) Train classifier
4) Test classifier
Text â Features â Labels
46. test_speech = test_speeches[0]
print test_speech
Farewell, Andronicus, my noble father,
The woefull'st man that ever liv'd in Rome.
Farewell, proud Rome, till Lucius come again;
He loves his pledges dearer than his life.
...
(From Titus Andronicus, III.1.288-300)
Classifier Testing
49. Critiques
⢠"Bag of Words" assumes a correlation
between word use and label. This
correlation is stronger in some cases
than in others.
⢠Beware of highly-disproportionate
training data.