This document compares the performance of Pointwise Mutual Information (PMI) and Latent Semantic Analysis (LSA) on several semantic similarity tasks when trained on different sized corpora. It finds that PMI trained on a large Wikipedia corpus outperforms LSA trained on a smaller subset, showing that simpler models can perform better when given more data. It also compares PMI trained on Wikipedia to other publicly available measures of semantic relatedness, finding PMI performs competitively. The document concludes that simple, scalable models that leverage large amounts of data show promise for modeling semantics.
More Data Trumps Smarter Algorithms in Training Computational Models
1. More Data Trumps Smarter Algorithms:
Training Computational Models of
Semantics on Large Corpora
Gabriel Recchia, Cognitive Science
Michael N. Jones, Psychology
Indiana University, Bloomington
3. Compares the probability of observing word x
and word y together (the joint probability) with
the probabilities of observing x and y
independently (chance)
Pointwise Mutual Information (PMI)
I(x,y) = log2
P(x,y)
P(x)P(y)
(Church & Hanks, 1990)
4. • Build a term-document matrix where element
(i,j) describes the frequency of term i in
document j
• Apply log-entropy weighing scheme to
decrease the weight of high-frequency words
• Use singular value decomposition to find an
approximation to the term-document matrix with
lower rank k
• Optimize k for the task at hand
Latent Semantic Analysis (LSA)
(Landauer & Dumais, 1997)
5. • Criticisms
• Scalability
• Incrementality
• Lessons from computational linguistics:
simple models that can be trained on more data
often outperform complex models that are
restricted to less
Latent Semantic Analysis (LSA)
7. PMI vs. LSA (Budiu et al., 2007)
Task PMI (TASA) LSA (TASA)
ESL .22 .44
TOEFL .22 .60
RG .61 .64
R .61 .71
MC .65 .75
WS353 .58 .60
8. PMI vs. LSA (Budiu et al., 2007)
Task PMI (Stanford) LSA (TASA)
ESL .52 .44
TOEFL .51 .60
RG .75 .64
R .83 .71
MC .79 .75
WS353 .71 .60
9. • Budiu et al. (2007) concluded that PMI
performs better when given more data
• But, they had a confound: Corpus size
was confounded with document size and
type of text (web documents vs. carefully
constructed sentences in textbooks)
12. Experiment 2
PMI trained on lots of data outperforms
LSA trained on less. How does it
compare with other measures of
semantic relatedness?
To find out: compare with other publicly
available measures at the Rensselaer
Measures of Semantic Relatedness
website, cwl-projects.cogsci.rpi.edu/msr
(Veksler et al., 2008)
17. • PMI does not take latent information
into account; unmodified, is not
plausible
• However, its success when scaled to
data on the order of human experience
favors models based on simple co-
occurrences (e.g. models based on
vector addition)
Discussion
18. • Simple, scalable, incremental models of
semantic similarity show promise
• Suggests complexity should be left in
the data rather than added to the model
• Publicly available tool allows non-
programmers to retrieve corpus-specific
semantic similarity scores
Conclusions