More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora

More Data Trumps Smarter Algorithms:
Training Computational Models of
Semantics on Large Corpora
Gabriel Recchia, Cognitive Science
Michael N. Jones, Psychology
Indiana University, Bloomington

car / automobile
lad / wizard
gem / jewel
glass / magician
journey / voyage
asylum / madhouse
0.975
0.175
0.875
0.025
0.875
0.9

Compares the probability of observing word x
and word y together (the joint probability) with
the probabilities of observing x and y
independently (chance)
Pointwise Mutual Information (PMI)
I(x,y) = log2
P(x,y)
P(x)P(y)
(Church & Hanks, 1990)

• Build a term-document matrix where element
(i,j) describes the frequency of term i in
document j
• Apply log-entropy weighing scheme to
decrease the weight of high-frequency words
• Use singular value decomposition to find an
approximation to the term-document matrix with
lower rank k
• Optimize k for the task at hand
Latent Semantic Analysis (LSA)
(Landauer & Dumais, 1997)

• Criticisms
• Scalability
• Incrementality
• Lessons from computational linguistics:
simple models that can be trained on more data
often outperform complex models that are
restricted to less
Latent Semantic Analysis (LSA)

• Forced-choice tests
• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)
• ESL synonymy test (Turney, 2001) (ESL)
• Semantic similarity judgments
• Rubenstein & Goodenough, 1965 (RG)
• Miller & Charles, 1991 (MC)
• Resnik, 1995 (R)
• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)

PMI vs. LSA (Budiu et al., 2007)
Task PMI (TASA) LSA (TASA)
ESL .22 .44
TOEFL .22 .60
RG .61 .64
R .61 .71
MC .65 .75
WS353 .58 .60

PMI vs. LSA (Budiu et al., 2007)
Task PMI (Stanford) LSA (TASA)
ESL .52 .44
TOEFL .51 .60
RG .75 .64
R .83 .71
MC .79 .75
WS353 .71 .60

• Budiu et al. (2007) concluded that PMI
performs better when given more data
• But, they had a confound: Corpus size
was confounded with document size and
type of text (web documents vs. carefully
constructed sentences in textbooks)

PMI vs. LSA
Experiment 1
Task PMI (Wiki subset) LSA (Wiki subset)
ESL .35 .36
TOEFL .41 .44
RG .47 .62
R .46 .60
MC .46 .46
WS353 .54 .57

PMI vs. LSA
Experiment 1
Task PMI (full Wikipedia) LSA (Wiki subset)
ESL .62 .36
TOEFL .64 .44
RG .78 .62
R .86 .60
MC .76 .46
WS353 .73 .57

Experiment 2
PMI trained on lots of data outperforms
LSA trained on less. How does it
compare with other measures of
semantic relatedness?
To find out: compare with other publicly
available measures at the Rensselaer
Measures of Semantic Relatedness
website, cwl-projects.cogsci.rpi.edu/msr
(Veksler et al., 2008)

• Latent Semantic Analysis (Landauer & Dumais, 1997)
• Spreading Activation (Anderson & Pirolli, 1984; Farahat,
Pirolli, & Markova, 2004)
• Normalized Search Similarity (Cilibrasi and Vitányi,
2007; Veksler et al., 2008)
• WordNet::Similarity vector measure
(Pedersen et al., 2004)
Experiment 2

Experiment 2
PMI using
full Wikipedia
WN NSS.F NSS.T SA.N SA.W LSA.T
ESL .62 .70 .44 .56 .39 .51 .44
TOEFL .64 .87 .59 .50 .61 .59 .55
RG .78 .88 .62 .53 .49 .39 .69
R .86 .90 .56 .54 .49 .52 .74
MC .76 .77 .61 .56 .45 .45 .61
WS353 .73 .46 .60 .59 .40 .38 .60
• WN: Wordnet::Similarity vector measure
• NSS.F: Normalized Search Similarity, using Factiva business news corpus
• NSS.T: Normalized Search Similarity, using TASA corpus
• SA.N: Spreading Activation, using Google counts restricted to nytimes.com
• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org
• LSA.T: LSA, using TASA corpus

Discussion
Released a tool for calculating PMI
scores:
http://www.indiana.edu/~clcl/lmoss/

• PMI does not take latent information
into account; unmodified, is not
plausible
• However, its success when scaled to
data on the order of human experience
favors models based on simple co-
occurrences (e.g. models based on
vector addition)
Discussion

• Simple, scalable, incremental models of
semantic similarity show promise
• Suggests complexity should be left in
the data rather than added to the model
• Publicly available tool allows non-
programmers to retrieve corpus-specific
semantic similarity scores
Conclusions

More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora

Recommended

Recommended

More Related Content

Similar to More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora

Similar to More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora (20)

Recently uploaded

Recently uploaded (20)

More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora