SlideShare a Scribd company logo
1 of 23
KNOWLEDGE
EXTRACTION
FROM
WIKIPEDIA
Ofer Egozi
Doug Lenat
“Intelligence is 10 million rules…”
Cyc, 1984
(#$genls #$Tree-ThePlant #$Plant)
(#$implies (#$and
(#$isa ?OBJ ?SUBSET)
(#$genls ?SUBSET
?SUPERSET))
(#$isa ?OBJ ?SUPERSET))
…an oak is a plant
Predicted to complete in 10 years.
Cyc Today
Can make impressive inferences, such as:
• You have to be awake to eat
• You cannot remember events that have not happened yet
• If you cut a lump of peanut butter in half, each half is also a
lump of peanut butter; if you cut a table in half, neither half
is a table
• When people die, they stay dead
But after 30 years and 700 man-years, only 2M+
rules…
What went wrong?
Knowledge Acquisition
Machine Translation
Rule-Based Machine Translation (1970s):
• Dictionary for both languages
• Rules representing language structure
• Parsing sentences to find structure
• Mapping between structures
Built by human experts, accumulating rules over
time.
Rules end up conflicting and ambiguous
‫תפוח‬ ‫אוכל‬ ‫ילד‬
Object-verb-subject
Boy eats apple
Subject-verb-object
Machine Translation
Statistical Translation (1990s):
• Massive bilingual corpora
• Corpus alignment
• Calculate probability for word in 1st language
to match word in 2nd language
• Use n-gram to build models that take context into account
Franz Och
Built by data scientists, no linguists needed
Improves as more data gets added
Encyclopedia?
Asymptotic goal: Enter “the world’s most general
knowledge,” down to ever more detailed levels. A
preliminary milestone would be to finish encoding a one-
volume desk encyclopedia...
…There are approximately 30,000 articles in a typical one-
volume desk encyclopedia… For comparison, the
Encyclopedia Brittanica has nine times as many
articles... A conservative estimate for the data enterers’
rate is one paragraph per day; this would make their total
effort about 150 man-years.
Doug Lenat, 1985
Wikipedia
Un+Structured Data
YAGO
 “Yet Another Great Ontology”, 2007, MPI
 10M entities, 120M facts
 http://en.wikipedia.org/wiki/Albert_Einstein
 (AlbertEinstein, bornInYear, 1879)
 (AlbertEinstein, hasWonPrize, NobelPrize)
 (AlbertEinstein, isA, Physicist)
 Uses the WordNet curated ontology, and
expands it into Wikipedia entities
 E.g. Albert Einstein is a Person
YAGO
YAGO
 Knowledge acquisition:
 Work started in 2006
 2007: 1M entities, 5M facts
 2012: 10M entities, 120M facts
 Now adding places
 Data export
 Query over SPARQL
DBpedia
 Created an ontology from scratch
 Crowdsourced the rule definition and mining
 More coverage, but less coherent model and
structure
 2.3M entities, 400M facts
 Uses YAGO ontology as part of resources
 Data export, and SPARQL queries
ESA
 Explicit Semantic Analysis
 Prof. Shaul Markovitch, Dr. Evgeniy
Gabrilovich and yours truly
 The name is a pun on Latent Semantic
Analysis (LSA) – a quick context recap
follows…
Latent Semantic Analysis
 Technique to find “hidden” semantic relations
between groups of terms in documents
ESA
 Wikipedia articles are clear, coherent
and universal semantic concepts
Panther
a
Article words are associated with the concept
(TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]
ESA
Cat
Panthera
[0.92]
Cat
[0.95]
Jane
Fonda
[0.07]
The semantics of a word is the vector
of its associations with Wikipedia concepts
ESA
button
Dick
Button
[0.84]
Button
[0.93]
Game
Controlle
r
[0.32]
Mouse
(computing
)
[0.81]
mouse
Mouse
(computing
)
[0.84]
Mouse
(rodent)
[0.91]
John
Steinbec
k
[0.17]
Mickey
Mouse
[0.81]
mouse button
Drag-
and-drop
[0.91]
Mouse
(computing
)
[0.95]
Mouse
(rodent)
[0.56]
Game
Controlle
r
[0.64]
mouse button
The semantics of a text fragment is the average
vector (centroid) of the semantics of its words
Uses of ESA
 Text Categorization
 Semantic Relatedness
 Information Retrieval
More semantic projects
 Word-sense disambiguation
 Multi-lingual dictionary from language links
 Cross-lingual search (Cross-Lingual-ESA)
 WikiData
Questions?
References
 Cyc:
 Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge
Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985
 Cycorp: http://www.cyc.com/
 YAGO:
 Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW
2007
 YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/
 ESA:
 E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge,
AAAI 2006
 E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, IJCAI 2007
 Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011
 Others:
 Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL
HLT, 2007
 Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947,
2008
 Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval,
2008

More Related Content

Similar to Extracting Meaning from Wikipedia

State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...UNESCO Venice Office
 
Dialogare con agenti artificiali
Dialogare con agenti artificiali  Dialogare con agenti artificiali
Dialogare con agenti artificiali Agnese Augello
 
Artificial intelligence(01)
Artificial intelligence(01)Artificial intelligence(01)
Artificial intelligence(01)Nazir Ahmed
 
Encylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
Encylopedia of Life Informatics (Data Model) Workshop: Engaging PartnersEncylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
Encylopedia of Life Informatics (Data Model) Workshop: Engaging PartnersMartin Kalfatovic
 
Will Robots Inherit Earth
Will Robots Inherit EarthWill Robots Inherit Earth
Will Robots Inherit Earthelliando dias
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmfulgreenwop
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for DiscoveryOCLC
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowskaguest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World LazowskaWCET
 
Bat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationBat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationXin-She Yang
 
Between Biological and Digital Memory Prof David Wishart
Between Biological and Digital Memory       Prof David WishartBetween Biological and Digital Memory       Prof David Wishart
Between Biological and Digital Memory Prof David WishartGraham Steel
 
Looking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic WebLooking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic WebValentina Presutti
 
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...Numenta
 
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and CommunicationSetting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and Communicationvbrant
 
20110122 vibrant final
20110122 vibrant final20110122 vibrant final
20110122 vibrant finalagosti
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligenceUmesh Meher
 
UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04Rafael Alvarado
 

Similar to Extracting Meaning from Wikipedia (20)

State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
 
Advances In Wsd Acl 2005
Advances In Wsd Acl 2005Advances In Wsd Acl 2005
Advances In Wsd Acl 2005
 
Dialogare con agenti artificiali
Dialogare con agenti artificiali  Dialogare con agenti artificiali
Dialogare con agenti artificiali
 
Artificial intelligence(01)
Artificial intelligence(01)Artificial intelligence(01)
Artificial intelligence(01)
 
Encylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
Encylopedia of Life Informatics (Data Model) Workshop: Engaging PartnersEncylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
Encylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
 
Will Robots Inherit Earth
Will Robots Inherit EarthWill Robots Inherit Earth
Will Robots Inherit Earth
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Bat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationBat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective Optimisation
 
Between Biological and Digital Memory Prof David Wishart
Between Biological and Digital Memory       Prof David WishartBetween Biological and Digital Memory       Prof David Wishart
Between Biological and Digital Memory Prof David Wishart
 
Looking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic WebLooking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic Web
 
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
 
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and CommunicationSetting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
 
20110122 vibrant final
20110122 vibrant final20110122 vibrant final
20110122 vibrant final
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04
 
Intoduction of Artificial Intelligence
Intoduction of Artificial IntelligenceIntoduction of Artificial Intelligence
Intoduction of Artificial Intelligence
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Extracting Meaning from Wikipedia

  • 2. Doug Lenat “Intelligence is 10 million rules…” Cyc, 1984 (#$genls #$Tree-ThePlant #$Plant) (#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) …an oak is a plant Predicted to complete in 10 years.
  • 3. Cyc Today Can make impressive inferences, such as: • You have to be awake to eat • You cannot remember events that have not happened yet • If you cut a lump of peanut butter in half, each half is also a lump of peanut butter; if you cut a table in half, neither half is a table • When people die, they stay dead But after 30 years and 700 man-years, only 2M+ rules… What went wrong?
  • 5.
  • 6. Machine Translation Rule-Based Machine Translation (1970s): • Dictionary for both languages • Rules representing language structure • Parsing sentences to find structure • Mapping between structures Built by human experts, accumulating rules over time. Rules end up conflicting and ambiguous ‫תפוח‬ ‫אוכל‬ ‫ילד‬ Object-verb-subject Boy eats apple Subject-verb-object
  • 7. Machine Translation Statistical Translation (1990s): • Massive bilingual corpora • Corpus alignment • Calculate probability for word in 1st language to match word in 2nd language • Use n-gram to build models that take context into account Franz Och Built by data scientists, no linguists needed Improves as more data gets added
  • 8. Encyclopedia? Asymptotic goal: Enter “the world’s most general knowledge,” down to ever more detailed levels. A preliminary milestone would be to finish encoding a one- volume desk encyclopedia... …There are approximately 30,000 articles in a typical one- volume desk encyclopedia… For comparison, the Encyclopedia Brittanica has nine times as many articles... A conservative estimate for the data enterers’ rate is one paragraph per day; this would make their total effort about 150 man-years. Doug Lenat, 1985
  • 11. YAGO  “Yet Another Great Ontology”, 2007, MPI  10M entities, 120M facts  http://en.wikipedia.org/wiki/Albert_Einstein  (AlbertEinstein, bornInYear, 1879)  (AlbertEinstein, hasWonPrize, NobelPrize)  (AlbertEinstein, isA, Physicist)  Uses the WordNet curated ontology, and expands it into Wikipedia entities  E.g. Albert Einstein is a Person
  • 12. YAGO
  • 13. YAGO  Knowledge acquisition:  Work started in 2006  2007: 1M entities, 5M facts  2012: 10M entities, 120M facts  Now adding places  Data export  Query over SPARQL
  • 14. DBpedia  Created an ontology from scratch  Crowdsourced the rule definition and mining  More coverage, but less coherent model and structure  2.3M entities, 400M facts  Uses YAGO ontology as part of resources  Data export, and SPARQL queries
  • 15. ESA  Explicit Semantic Analysis  Prof. Shaul Markovitch, Dr. Evgeniy Gabrilovich and yours truly  The name is a pun on Latent Semantic Analysis (LSA) – a quick context recap follows…
  • 16. Latent Semantic Analysis  Technique to find “hidden” semantic relations between groups of terms in documents
  • 17. ESA  Wikipedia articles are clear, coherent and universal semantic concepts Panther a Article words are associated with the concept (TF.IDF) Cat [0.92] Leopard [0.84] Roar [0.77]
  • 18. ESA Cat Panthera [0.92] Cat [0.95] Jane Fonda [0.07] The semantics of a word is the vector of its associations with Wikipedia concepts
  • 20. Uses of ESA  Text Categorization  Semantic Relatedness  Information Retrieval
  • 21. More semantic projects  Word-sense disambiguation  Multi-lingual dictionary from language links  Cross-lingual search (Cross-Lingual-ESA)  WikiData
  • 23. References  Cyc:  Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985  Cycorp: http://www.cyc.com/  YAGO:  Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW 2007  YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/  ESA:  E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge, AAAI 2006  E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, IJCAI 2007  Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011  Others:  Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL HLT, 2007  Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947, 2008  Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval, 2008

Editor's Notes

  1. Lenat actually explained that Cyc will solve the bottleneck by moving it to the decision of what data to enter, rather than the entry process itself. Compared to entering specific rules, entering facts and generalized rules is certainly better, but still manual.
  2. Fast forward 20 years…
  3. Fast forward 20 years…
  4. There were quite a few efforts to use this wealth of information, I’ll speak about one that was quite impressive in its breadth and comparable to Cyc