SlideShare a Scribd company logo
1 of 18
More Data Trumps Smarter Algorithms:
Training Computational Models of
Semantics on Large Corpora
Gabriel Recchia, Cognitive Science
Michael N. Jones, Psychology
Indiana University, Bloomington
car / automobile
lad / wizard
gem / jewel
glass / magician
journey / voyage
asylum / madhouse
0.975
0.175
0.875
0.025
0.875
0.9
Compares the probability of observing word x
and word y together (the joint probability) with
the probabilities of observing x and y
independently (chance)
Pointwise Mutual Information (PMI)
I(x,y) = log2
P(x,y)
P(x)P(y)
(Church & Hanks, 1990)
• Build a term-document matrix where element
(i,j) describes the frequency of term i in
document j
• Apply log-entropy weighing scheme to
decrease the weight of high-frequency words
• Use singular value decomposition to find an
approximation to the term-document matrix with
lower rank k
• Optimize k for the task at hand
Latent Semantic Analysis (LSA)
(Landauer & Dumais, 1997)
• Criticisms
• Scalability
• Incrementality
• Lessons from computational linguistics:
simple models that can be trained on more data
often outperform complex models that are
restricted to less
Latent Semantic Analysis (LSA)
• Forced-choice tests
• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)
• ESL synonymy test (Turney, 2001) (ESL)
• Semantic similarity judgments
• Rubenstein & Goodenough, 1965 (RG)
• Miller & Charles, 1991 (MC)
• Resnik, 1995 (R)
• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)
PMI vs. LSA (Budiu et al., 2007)
Task PMI (TASA) LSA (TASA)
ESL .22 .44
TOEFL .22 .60
RG .61 .64
R .61 .71
MC .65 .75
WS353 .58 .60
PMI vs. LSA (Budiu et al., 2007)
Task PMI (Stanford) LSA (TASA)
ESL .52 .44
TOEFL .51 .60
RG .75 .64
R .83 .71
MC .79 .75
WS353 .71 .60
• Budiu et al. (2007) concluded that PMI
performs better when given more data
• But, they had a confound: Corpus size
was confounded with document size and
type of text (web documents vs. carefully
constructed sentences in textbooks)
PMI vs. LSA
Experiment 1
Task PMI (Wiki subset) LSA (Wiki subset)
ESL .35 .36
TOEFL .41 .44
RG .47 .62
R .46 .60
MC .46 .46
WS353 .54 .57
PMI vs. LSA
Experiment 1
Task PMI (full Wikipedia) LSA (Wiki subset)
ESL .62 .36
TOEFL .64 .44
RG .78 .62
R .86 .60
MC .76 .46
WS353 .73 .57
Experiment 2
PMI trained on lots of data outperforms
LSA trained on less. How does it
compare with other measures of
semantic relatedness?
To find out: compare with other publicly
available measures at the Rensselaer
Measures of Semantic Relatedness
website, cwl-projects.cogsci.rpi.edu/msr
(Veksler et al., 2008)
• Latent Semantic Analysis (Landauer & Dumais, 1997)
• Spreading Activation (Anderson & Pirolli, 1984; Farahat,
Pirolli, & Markova, 2004)
• Normalized Search Similarity (Cilibrasi and Vitányi,
2007; Veksler et al., 2008)
• WordNet::Similarity vector measure
(Pedersen et al., 2004)
Experiment 2
Experiment 2
PMI using
full Wikipedia
WN NSS.F NSS.T SA.N SA.W LSA.T
ESL .62 .70 .44 .56 .39 .51 .44
TOEFL .64 .87 .59 .50 .61 .59 .55
RG .78 .88 .62 .53 .49 .39 .69
R .86 .90 .56 .54 .49 .52 .74
MC .76 .77 .61 .56 .45 .45 .61
WS353 .73 .46 .60 .59 .40 .38 .60
• WN: Wordnet::Similarity vector measure
• NSS.F: Normalized Search Similarity, using Factiva business news corpus
• NSS.T: Normalized Search Similarity, using TASA corpus
• SA.N: Spreading Activation, using Google counts restricted to nytimes.com
• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org
• LSA.T: LSA, using TASA corpus
Discussion
Released a tool for calculating PMI
scores:
http://www.indiana.edu/~clcl/lmoss/
Discussion
• PMI does not take latent information
into account; unmodified, is not
plausible
• However, its success when scaled to
data on the order of human experience
favors models based on simple co-
occurrences (e.g. models based on
vector addition)
Discussion
• Simple, scalable, incremental models of
semantic similarity show promise
• Suggests complexity should be left in
the data rather than added to the model
• Publicly available tool allows non-
programmers to retrieve corpus-specific
semantic similarity scores
Conclusions

More Related Content

Similar to More Data Trumps Smarter Algorithms in Training Computational Models

Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionAlessandro Suglia
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label LearningGrigorios Tsoumakas
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Project
 
Towards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD DatasetsTowards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD DatasetsBlerina Spahiu
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervisedMadhav Sigdel
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Association for Computational Linguistics
 
Qualitative differences between human behvaioral data and co-occurrence mode...
Qualitative differences between  human behvaioral data and co-occurrence mode...Qualitative differences between  human behvaioral data and co-occurrence mode...
Qualitative differences between human behvaioral data and co-occurrence mode...Gabriel Recchia
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Subspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralSubspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralkhondekarLutfulHassa
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationPaul Houle
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationMario Sangiorgio
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
 

Similar to More Data Trumps Smarter Algorithms in Training Computational Models (20)

Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
 
Towards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD DatasetsTowards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD Datasets
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
Qualitative differences between human behvaioral data and co-occurrence mode...
Qualitative differences between  human behvaioral data and co-occurrence mode...Qualitative differences between  human behvaioral data and co-occurrence mode...
Qualitative differences between human behvaioral data and co-occurrence mode...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Subspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralSubspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectral
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Fusing semantic data
Fusing semantic dataFusing semantic data
Fusing semantic data
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 

More Data Trumps Smarter Algorithms in Training Computational Models

  • 1. More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora Gabriel Recchia, Cognitive Science Michael N. Jones, Psychology Indiana University, Bloomington
  • 2. car / automobile lad / wizard gem / jewel glass / magician journey / voyage asylum / madhouse 0.975 0.175 0.875 0.025 0.875 0.9
  • 3. Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance) Pointwise Mutual Information (PMI) I(x,y) = log2 P(x,y) P(x)P(y) (Church & Hanks, 1990)
  • 4. • Build a term-document matrix where element (i,j) describes the frequency of term i in document j • Apply log-entropy weighing scheme to decrease the weight of high-frequency words • Use singular value decomposition to find an approximation to the term-document matrix with lower rank k • Optimize k for the task at hand Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997)
  • 5. • Criticisms • Scalability • Incrementality • Lessons from computational linguistics: simple models that can be trained on more data often outperform complex models that are restricted to less Latent Semantic Analysis (LSA)
  • 6. • Forced-choice tests • TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL) • ESL synonymy test (Turney, 2001) (ESL) • Semantic similarity judgments • Rubenstein & Goodenough, 1965 (RG) • Miller & Charles, 1991 (MC) • Resnik, 1995 (R) • Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)
  • 7. PMI vs. LSA (Budiu et al., 2007) Task PMI (TASA) LSA (TASA) ESL .22 .44 TOEFL .22 .60 RG .61 .64 R .61 .71 MC .65 .75 WS353 .58 .60
  • 8. PMI vs. LSA (Budiu et al., 2007) Task PMI (Stanford) LSA (TASA) ESL .52 .44 TOEFL .51 .60 RG .75 .64 R .83 .71 MC .79 .75 WS353 .71 .60
  • 9. • Budiu et al. (2007) concluded that PMI performs better when given more data • But, they had a confound: Corpus size was confounded with document size and type of text (web documents vs. carefully constructed sentences in textbooks)
  • 10. PMI vs. LSA Experiment 1 Task PMI (Wiki subset) LSA (Wiki subset) ESL .35 .36 TOEFL .41 .44 RG .47 .62 R .46 .60 MC .46 .46 WS353 .54 .57
  • 11. PMI vs. LSA Experiment 1 Task PMI (full Wikipedia) LSA (Wiki subset) ESL .62 .36 TOEFL .64 .44 RG .78 .62 R .86 .60 MC .76 .46 WS353 .73 .57
  • 12. Experiment 2 PMI trained on lots of data outperforms LSA trained on less. How does it compare with other measures of semantic relatedness? To find out: compare with other publicly available measures at the Rensselaer Measures of Semantic Relatedness website, cwl-projects.cogsci.rpi.edu/msr (Veksler et al., 2008)
  • 13. • Latent Semantic Analysis (Landauer & Dumais, 1997) • Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004) • Normalized Search Similarity (Cilibrasi and Vitányi, 2007; Veksler et al., 2008) • WordNet::Similarity vector measure (Pedersen et al., 2004) Experiment 2
  • 14. Experiment 2 PMI using full Wikipedia WN NSS.F NSS.T SA.N SA.W LSA.T ESL .62 .70 .44 .56 .39 .51 .44 TOEFL .64 .87 .59 .50 .61 .59 .55 RG .78 .88 .62 .53 .49 .39 .69 R .86 .90 .56 .54 .49 .52 .74 MC .76 .77 .61 .56 .45 .45 .61 WS353 .73 .46 .60 .59 .40 .38 .60 • WN: Wordnet::Similarity vector measure • NSS.F: Normalized Search Similarity, using Factiva business news corpus • NSS.T: Normalized Search Similarity, using TASA corpus • SA.N: Spreading Activation, using Google counts restricted to nytimes.com • SA.W: Spreading Activation, using Google counts restricted to wikipedia.org • LSA.T: LSA, using TASA corpus
  • 15. Discussion Released a tool for calculating PMI scores: http://www.indiana.edu/~clcl/lmoss/
  • 17. • PMI does not take latent information into account; unmodified, is not plausible • However, its success when scaled to data on the order of human experience favors models based on simple co- occurrences (e.g. models based on vector addition) Discussion
  • 18. • Simple, scalable, incremental models of semantic similarity show promise • Suggests complexity should be left in the data rather than added to the model • Publicly available tool allows non- programmers to retrieve corpus-specific semantic similarity scores Conclusions