SlideShare a Scribd company logo
More Data Trumps Smarter Algorithms:
Training Computational Models of
Semantics on Large Corpora
Gabriel Recchia, Cognitive Science
Michael N. Jones, Psychology
Indiana University, Bloomington
car / automobile
lad / wizard
gem / jewel
glass / magician
journey / voyage
asylum / madhouse
0.975
0.175
0.875
0.025
0.875
0.9
Compares the probability of observing word x
and word y together (the joint probability) with
the probabilities of observing x and y
independently (chance)
Pointwise Mutual Information (PMI)
I(x,y) = log2
P(x,y)
P(x)P(y)
(Church & Hanks, 1990)
• Build a term-document matrix where element
(i,j) describes the frequency of term i in
document j
• Apply log-entropy weighing scheme to
decrease the weight of high-frequency words
• Use singular value decomposition to find an
approximation to the term-document matrix with
lower rank k
• Optimize k for the task at hand
Latent Semantic Analysis (LSA)
(Landauer & Dumais, 1997)
• Criticisms
• Scalability
• Incrementality
• Lessons from computational linguistics:
simple models that can be trained on more data
often outperform complex models that are
restricted to less
Latent Semantic Analysis (LSA)
• Forced-choice tests
• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)
• ESL synonymy test (Turney, 2001) (ESL)
• Semantic similarity judgments
• Rubenstein & Goodenough, 1965 (RG)
• Miller & Charles, 1991 (MC)
• Resnik, 1995 (R)
• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)
PMI vs. LSA (Budiu et al., 2007)
Task PMI (TASA) LSA (TASA)
ESL .22 .44
TOEFL .22 .60
RG .61 .64
R .61 .71
MC .65 .75
WS353 .58 .60
PMI vs. LSA (Budiu et al., 2007)
Task PMI (Stanford) LSA (TASA)
ESL .52 .44
TOEFL .51 .60
RG .75 .64
R .83 .71
MC .79 .75
WS353 .71 .60
• Budiu et al. (2007) concluded that PMI
performs better when given more data
• But, they had a confound: Corpus size
was confounded with document size and
type of text (web documents vs. carefully
constructed sentences in textbooks)
PMI vs. LSA
Experiment 1
Task PMI (Wiki subset) LSA (Wiki subset)
ESL .35 .36
TOEFL .41 .44
RG .47 .62
R .46 .60
MC .46 .46
WS353 .54 .57
PMI vs. LSA
Experiment 1
Task PMI (full Wikipedia) LSA (Wiki subset)
ESL .62 .36
TOEFL .64 .44
RG .78 .62
R .86 .60
MC .76 .46
WS353 .73 .57
Experiment 2
PMI trained on lots of data outperforms
LSA trained on less. How does it
compare with other measures of
semantic relatedness?
To find out: compare with other publicly
available measures at the Rensselaer
Measures of Semantic Relatedness
website, cwl-projects.cogsci.rpi.edu/msr
(Veksler et al., 2008)
• Latent Semantic Analysis (Landauer & Dumais, 1997)
• Spreading Activation (Anderson & Pirolli, 1984; Farahat,
Pirolli, & Markova, 2004)
• Normalized Search Similarity (Cilibrasi and Vitányi,
2007; Veksler et al., 2008)
• WordNet::Similarity vector measure
(Pedersen et al., 2004)
Experiment 2
Experiment 2
PMI using
full Wikipedia
WN NSS.F NSS.T SA.N SA.W LSA.T
ESL .62 .70 .44 .56 .39 .51 .44
TOEFL .64 .87 .59 .50 .61 .59 .55
RG .78 .88 .62 .53 .49 .39 .69
R .86 .90 .56 .54 .49 .52 .74
MC .76 .77 .61 .56 .45 .45 .61
WS353 .73 .46 .60 .59 .40 .38 .60
• WN: Wordnet::Similarity vector measure
• NSS.F: Normalized Search Similarity, using Factiva business news corpus
• NSS.T: Normalized Search Similarity, using TASA corpus
• SA.N: Spreading Activation, using Google counts restricted to nytimes.com
• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org
• LSA.T: LSA, using TASA corpus
Discussion
Released a tool for calculating PMI
scores:
http://www.indiana.edu/~clcl/lmoss/
Discussion
• PMI does not take latent information
into account; unmodified, is not
plausible
• However, its success when scaled to
data on the order of human experience
favors models based on simple co-
occurrences (e.g. models based on
vector addition)
Discussion
• Simple, scalable, incremental models of
semantic similarity show promise
• Suggests complexity should be left in
the data rather than added to the model
• Publicly available tool allows non-
programmers to retrieve corpus-specific
semantic similarity scores
Conclusions

More Related Content

Similar to More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora

Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Claudio Greco
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Alessandro Suglia
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
Grigorios Tsoumakas
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Project
 
Towards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD DatasetsTowards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD Datasets
Blerina Spahiu
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
Madhav Sigdel
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Association for Computational Linguistics
 
Qualitative differences between human behvaioral data and co-occurrence mode...
Qualitative differences between  human behvaioral data and co-occurrence mode...Qualitative differences between  human behvaioral data and co-occurrence mode...
Qualitative differences between human behvaioral data and co-occurrence mode...
Gabriel Recchia
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Subspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralSubspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectral
khondekarLutfulHassa
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
Paul Houle
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
Mario Sangiorgio
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
Rebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
Sudarsun Santhiappan
 
Fusing semantic data
Fusing semantic dataFusing semantic data
Fusing semantic data
Andriy Nikolov
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
IJERA Editor
 

Similar to More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora (20)

Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
 
Towards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD DatasetsTowards Automatic Classification of LOD Datasets
Towards Automatic Classification of LOD Datasets
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
Qualitative differences between human behvaioral data and co-occurrence mode...
Qualitative differences between  human behvaioral data and co-occurrence mode...Qualitative differences between  human behvaioral data and co-occurrence mode...
Qualitative differences between human behvaioral data and co-occurrence mode...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Subspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralSubspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectral
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Fusing semantic data
Fusing semantic dataFusing semantic data
Fusing semantic data
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 

Recently uploaded

South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
สมใจ จันสุกสี
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 

More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora

  • 1. More Data Trumps Smarter Algorithms: Training Computational Models of Semantics on Large Corpora Gabriel Recchia, Cognitive Science Michael N. Jones, Psychology Indiana University, Bloomington
  • 2. car / automobile lad / wizard gem / jewel glass / magician journey / voyage asylum / madhouse 0.975 0.175 0.875 0.025 0.875 0.9
  • 3. Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance) Pointwise Mutual Information (PMI) I(x,y) = log2 P(x,y) P(x)P(y) (Church & Hanks, 1990)
  • 4. • Build a term-document matrix where element (i,j) describes the frequency of term i in document j • Apply log-entropy weighing scheme to decrease the weight of high-frequency words • Use singular value decomposition to find an approximation to the term-document matrix with lower rank k • Optimize k for the task at hand Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997)
  • 5. • Criticisms • Scalability • Incrementality • Lessons from computational linguistics: simple models that can be trained on more data often outperform complex models that are restricted to less Latent Semantic Analysis (LSA)
  • 6. • Forced-choice tests • TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL) • ESL synonymy test (Turney, 2001) (ESL) • Semantic similarity judgments • Rubenstein & Goodenough, 1965 (RG) • Miller & Charles, 1991 (MC) • Resnik, 1995 (R) • Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)
  • 7. PMI vs. LSA (Budiu et al., 2007) Task PMI (TASA) LSA (TASA) ESL .22 .44 TOEFL .22 .60 RG .61 .64 R .61 .71 MC .65 .75 WS353 .58 .60
  • 8. PMI vs. LSA (Budiu et al., 2007) Task PMI (Stanford) LSA (TASA) ESL .52 .44 TOEFL .51 .60 RG .75 .64 R .83 .71 MC .79 .75 WS353 .71 .60
  • 9. • Budiu et al. (2007) concluded that PMI performs better when given more data • But, they had a confound: Corpus size was confounded with document size and type of text (web documents vs. carefully constructed sentences in textbooks)
  • 10. PMI vs. LSA Experiment 1 Task PMI (Wiki subset) LSA (Wiki subset) ESL .35 .36 TOEFL .41 .44 RG .47 .62 R .46 .60 MC .46 .46 WS353 .54 .57
  • 11. PMI vs. LSA Experiment 1 Task PMI (full Wikipedia) LSA (Wiki subset) ESL .62 .36 TOEFL .64 .44 RG .78 .62 R .86 .60 MC .76 .46 WS353 .73 .57
  • 12. Experiment 2 PMI trained on lots of data outperforms LSA trained on less. How does it compare with other measures of semantic relatedness? To find out: compare with other publicly available measures at the Rensselaer Measures of Semantic Relatedness website, cwl-projects.cogsci.rpi.edu/msr (Veksler et al., 2008)
  • 13. • Latent Semantic Analysis (Landauer & Dumais, 1997) • Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004) • Normalized Search Similarity (Cilibrasi and Vitányi, 2007; Veksler et al., 2008) • WordNet::Similarity vector measure (Pedersen et al., 2004) Experiment 2
  • 14. Experiment 2 PMI using full Wikipedia WN NSS.F NSS.T SA.N SA.W LSA.T ESL .62 .70 .44 .56 .39 .51 .44 TOEFL .64 .87 .59 .50 .61 .59 .55 RG .78 .88 .62 .53 .49 .39 .69 R .86 .90 .56 .54 .49 .52 .74 MC .76 .77 .61 .56 .45 .45 .61 WS353 .73 .46 .60 .59 .40 .38 .60 • WN: Wordnet::Similarity vector measure • NSS.F: Normalized Search Similarity, using Factiva business news corpus • NSS.T: Normalized Search Similarity, using TASA corpus • SA.N: Spreading Activation, using Google counts restricted to nytimes.com • SA.W: Spreading Activation, using Google counts restricted to wikipedia.org • LSA.T: LSA, using TASA corpus
  • 15. Discussion Released a tool for calculating PMI scores: http://www.indiana.edu/~clcl/lmoss/
  • 17. • PMI does not take latent information into account; unmodified, is not plausible • However, its success when scaled to data on the order of human experience favors models based on simple co- occurrences (e.g. models based on vector addition) Discussion
  • 18. • Simple, scalable, incremental models of semantic similarity show promise • Suggests complexity should be left in the data rather than added to the model • Publicly available tool allows non- programmers to retrieve corpus-specific semantic similarity scores Conclusions