SlideShare a Scribd company logo
1 of 156
Download to read offline
Learning Multilingual Semantics
from Big Data on the Web
Gerard de Melo
Assistant Professor, Tsinghua University
http://gerard.demelo.org
Learning Multilingual Semantics
from Big Data on the Web
Gerard de Melo
Assistant Professor, Tsinghua University
http://gerard.demelo.org
Life at Tsinghua
Life at Tsinghua
Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web
Matej Kren: Idiom. Prague Municipal Library https://www.flickr.com/photos/ill-padrino/6437837857/
From Big Data toFrom Big Data to
Multilingual Semantics?Multilingual Semantics?
From Big Data toFrom Big Data to
Multilingual Semantics?Multilingual Semantics?
Image:
Brett Ryder
Manual Knowledge OrganizationManual Knowledge Organization
Image: http://commons.wikimedia.org/wiki/File:Mundaneum_Tir%C3%A4ng_Karteikaarten.jpg
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Manual Knowledge OrganizationManual Knowledge Organization
Image: Mundaneum
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Alex Wright: This was a sort of
“analog search engine”
Alex Wright: This was a sort of
“analog search engine”
Zipfian DistributionZipfian DistributionZipfian DistributionZipfian Distribution
https://commons.wikimedia.org/wiki/File:Moby_Dick_Words.gif
Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web
+
Goal: Large YetGoal: Large Yet
Reasonably Clean KnowledgeReasonably Clean Knowledge
Goal: Large YetGoal: Large Yet
Reasonably Clean KnowledgeReasonably Clean Knowledge
Theological Hall, Strahov Monastery Library, Prague
OutlineOutline
Large-Scale Knowledge Graphs
Semantics in Action
Models for the Future
OutlineOutline
Large-Scale Knowledge Graphs
Semantics in Action
Models for the Future
Lexical Knowledge
Portuguese-Chinese Dictionary by Ruggieri et al. (1580s)
The first European-Chinese dictionary
https://commons.wikimedia.org/wiki/File:Ricci-Ruggieri-Portuguese-Chinese-dictionary-page-1.png
Provides translations, antonyms, etc.
WiktionaryWiktionary
WiktionaryWiktionary
WiktionaryWiktionary
e.g. “salary” < Lat. “salarius” < Lat. “sal” (salt)
Etymological WordnetEtymological Wordnet
LREC 2014LREC 2014
Etymological WordnetEtymological Wordnet
LREC 2014LREC 2014
Etymological WordnetEtymological Wordnet
Etymological WordnetEtymological Wordnet
Old English
Example
Old English
Example
Lexical AmbiguitiesLexical Ambiguities
Hipsters in London
Images:
https://www.flickr.com/photos/poisonbabyfood/4274634681
https://www.facebook.com/alexander.balabanov.82
Lexical AmbiguitiesLexical Ambiguities
Reunion
Lexical AmbiguitiesLexical Ambiguities
Reunion
Images:
https://commons.wikimedia.org/wiki/File:Reunions_Class_of_82_2007.jpg
https://commons.wikimedia.org/wiki/File:Riviere_Langevin_Trou_Noir_P1440224-35.jpg
and many more...and many more...
Lexical AmbiguitiesLexical Ambiguities
Lexical Knowledge Bases
Multilingual Lexical Knowledge
UWN (de Melo & Weikum 2009)
UWN: Universal Wordnet
Before:
manual work over
two decades but not
many large wordnets
Before:
manual work over
two decades but not
many large wordnets
Our Approach:
● Exploit translation
resources on the Web
● Learn regression model
with sophisticated
graph-based features
Our Approach:
● Exploit translation
resources on the Web
● Learn regression model
with sophisticated
graph-based features
Gerard de Melo
UWN: Universal Wordnet
Gerard de Melo
UWN: Universal Wordnet
over 1,000,000 words in over 100 languages
CIKM 2009CIKM 2009 ICGL 2008ICGL 2008
Best Paper AwardBest Paper Award
ICGL 2008ICGL 2008
Best Paper AwardBest Paper Award
Gerard de Melo
UWN: Taxonomy
UWN: Getting StartedUWN: Getting Started
Simple API for JVM Languages
val uwn = new UWN(new File("plugins/"))
for (m <- uwn.getMeanings("souris", "fra"))
println(m)
Or Just Download the TSV File
Simple API for JVM Languages
val uwn = new UWN(new File("plugins/"))
for (m <- uwn.getMeanings("souris", "fra"))
println(m)
Or Just Download the TSV File
Adding Other Sources
Gerard de Melo
Language-specific,Language-specific,
Domain-specific,Domain-specific,
Arbitrary DatabasesArbitrary Databases
Language-specific,Language-specific,
Domain-specific,Domain-specific,
Arbitrary DatabasesArbitrary Databases
Adding Other SourcesAdding Other SourcesAdding Other SourcesAdding Other Sources
https://commons.wikimedia.org/wiki/File:Encyclopedia_Britannica_in_the_library_of_The_Kings_School,_Goa.jpg
Adding Other SourcesAdding Other SourcesAdding Other SourcesAdding Other Sources
Rob Matthews: printed small sample of Wikipedia
Actually, a printed
Wikipedia corresponds to
2000 Britannica volumes
Source:
http://www.labnol.org/internet/wikipedia-printed-book/9136/
Actually, a printed
Wikipedia corresponds to
2000 Britannica volumes
Source:
http://www.labnol.org/internet/wikipedia-printed-book/9136/
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Use
Identity Links
to connect
What is
equivalent
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Merging Structured DataMerging Structured Data
Trentino Trentino-
Alto Adige
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
One bad link isOne bad link is
enough to make aenough to make a
connected componentconnected component
inconsistentinconsistent
One bad link isOne bad link is
enough to make aenough to make a
connected componentconnected component
inconsistentinconsistent
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Source: Peter Mika
Entity Integration:
Challenges
Entity Integration:
Challenges
Merging Structured DataMerging Structured Data
Distinctness Assertions
Di
=
({en: Province of Trento,
en:Trentino},
{en:Trentino-South Tyrol,
en:Trentino-Alto Adige/Südtirol})
Distinctness Assertions
Di
=
({en: Province of Trento,
en:Trentino},
{en:Trentino-South Tyrol,
en:Trentino-Alto Adige/Südtirol})
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
How to reconcileHow to reconcile
equivalenceequivalence
andand
distinctnessdistinctness
evidence?evidence?
How to reconcileHow to reconcile
equivalenceequivalence
andand
distinctnessdistinctness
evidence?evidence?
a) ignore somea) ignore some
equivalence informationequivalence information
(delete certain edges)(delete certain edges)
a) ignore somea) ignore some
equivalence informationequivalence information
(delete certain edges)(delete certain edges)
b) ignore someb) ignore some
distinctness informationdistinctness information
(remove node from(remove node from
distinctness assertion)distinctness assertion)
b) ignore someb) ignore some
distinctness informationdistinctness information
(remove node from(remove node from
distinctness assertion)distinctness assertion)
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Min. cost solution:Min. cost solution:
NP-hardNP-hard
APX-hardAPX-hard
Min. cost solution:Min. cost solution:
NP-hardNP-hard
APX-hardAPX-hard
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Finally, use region growingFinally, use region growing
algorithm in the spiritalgorithm in the spirit
of Leighton & Rao 1988of Leighton & Rao 1988
Finally, use region growingFinally, use region growing
algorithm in the spiritalgorithm in the spirit
of Leighton & Rao 1988of Leighton & Rao 1988
Linear Program RelaxationLinear Program RelaxationLinear Program RelaxationLinear Program Relaxation
Approximation Guarantee:Approximation Guarantee:
4ln(nq+1)4ln(nq+1)
for n distinctness assertions,for n distinctness assertions,
q=max |Dq=max |Di,ji,j
||
but independent of |Dbut independent of |Dii
| !| !
Approximation Guarantee:Approximation Guarantee:
4ln(nq+1)4ln(nq+1)
for n distinctness assertions,for n distinctness assertions,
q=max |Dq=max |Di,ji,j
||
but independent of |Dbut independent of |Dii
| !| !
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
Linear Program RelaxationLinear Program RelaxationLinear Program RelaxationLinear Program Relaxation
Nice:Nice:
This generalizes theThis generalizes the
Hungarian AlgorithmHungarian Algorithm
to various advancedto various advanced
types of non-standardtypes of non-standard
matchingsmatchings
(cf. de Melo. AAAI 2013)(cf. de Melo. AAAI 2013)
Nice:Nice:
This generalizes theThis generalizes the
Hungarian AlgorithmHungarian Algorithm
to various advancedto various advanced
types of non-standardtypes of non-standard
matchingsmatchings
(cf. de Melo. AAAI 2013)(cf. de Melo. AAAI 2013)
Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
Separated ConceptsSeparated Concepts
(Multilingual Wikipedia)(Multilingual Wikipedia)
Separated ConceptsSeparated Concepts
(Multilingual Wikipedia)(Multilingual Wikipedia)
Application:
Lexvo.org
Semantic WebSemantic Web
Journal 2014Journal 2014
Semantic WebSemantic Web
Journal 2014Journal 2014
Lexvo.orgLexvo.org
Semantic WebSemantic Web
Journal 2014Journal 2014
Semantic WebSemantic Web
Journal 2014Journal 2014
Semantic WebSemantic Web
Journal 2014Journal 2014
Semantic WebSemantic Web
Journal 2014Journal 2014
InterdisciplinaryInterdisciplinary
Work, e.g. inWork, e.g. in
Digital HumanitiesDigital Humanities
InterdisciplinaryInterdisciplinary
Work, e.g. inWork, e.g. in
Digital HumanitiesDigital Humanities
Lexvo.orgLexvo.org
Taxonomic Organization
a user wants
a list of
„Art Schools in
Europe“
Multilingual Taxonomies
a Swedish user
wants
a list of
„Konstskolor i
Europa“
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
Taxonomic Integration:Taxonomic Integration:
MENTA ApproachMENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
Predict Individual
Taxonomic Links:
Article → Category
Category → WordNet
Predict Individual
Taxonomic Links:
Article → Category
Category → WordNet
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Predict Individual
Taxonomic Links:
Article → Category
Category → WordNet
Predict Individual
Taxonomic Links:
Article → Category
Category → WordNet
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Image: https://de.wikipedia.org/wiki/Datei:Bersntol_palae.jpg
Fersental
(Bersntol, Valle dei Mòcheni)
Fersental
(Bersntol, Valle dei Mòcheni)
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
https://de.wikipedia.org/wiki/Datei:Language_distribution_Trentino_2011.png
Fersental
(Bersntol, Valle dei Mòcheni)
Fersental
(Bersntol, Valle dei Mòcheni)
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Use Identity Constraint
Algorithm to form
equivalence classes
Use Identity Constraint
Algorithm to form
equivalence classes
Markov Chain Random
Walk with Restarts
to Rank Parents
Markov Chain Random
Walk with Restarts
to Rank Parents
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Taxonomic Integration:Taxonomic Integration:
MENTAMENTA
Bansal et al.Bansal et al.
ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up
Bansal et al.Bansal et al.
ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up
Bansal et al.Bansal et al.
ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up
Bansal et al.Bansal et al.
ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up
Belief PropagationBelief Propagation
exploiting Kirchhoff’sexploiting Kirchhoff’s
Matrix Tree TheoremMatrix Tree Theorem
for efficient handling offor efficient handling of
tree factortree factor
Belief PropagationBelief Propagation
exploiting Kirchhoff’sexploiting Kirchhoff’s
Matrix Tree TheoremMatrix Tree Theorem
for efficient handling offor efficient handling of
tree factortree factor
Chu-Liu-EdmondsChu-Liu-Edmonds
directed spanning treedirected spanning tree
algorithm for decodingalgorithm for decoding
Chu-Liu-EdmondsChu-Liu-Edmonds
directed spanning treedirected spanning tree
algorithm for decodingalgorithm for decoding
New Algorithm:
Structured Output Prediction
New Algorithm:
Structured Output Prediction
UWN/MENTA
CIKM 2010CIKM 2010
Best Paper AwardBest Paper Award
CIKM 2010CIKM 2010
Best Paper AwardBest Paper Award
Biggest (ontological)Biggest (ontological)
taxonomytaxonomy
Biggest (ontological)Biggest (ontological)
taxonomytaxonomy
UWN/MENTA
multilingual extension of WordNet for
word senses and taxonomical information over 200 languages
Gerard de Melo
OutlineOutline
Large-Scale Knowledge Graphs
Semantics in Action
Models for the Future
Language EducationLanguage EducationLanguage EducationLanguage Education
UWNUWNUWNUWN
http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/
UWNUWNUWNUWN
http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/
UWNUWNUWNUWN
http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/
UWNUWNUWNUWN
http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/http://www.lexvo.org/uwn/
Application: Sense-DisambiguatedApplication: Sense-Disambiguated
Example SentencesExample Sentences
Application: Sense-DisambiguatedApplication: Sense-Disambiguated
Example SentencesExample Sentences
Application: Sense-DisambiguatedApplication: Sense-Disambiguated
Example SentencesExample Sentences
Application: Sense-DisambiguatedApplication: Sense-Disambiguated
Example SentencesExample Sentences
Application: Sense-DisambiguatedApplication: Sense-Disambiguated
Example SentencesExample Sentences
Application: Sense-DisambiguatedApplication: Sense-Disambiguated
Example SentencesExample Sentences
Application:
Monolingual Language Users
Application:
Monolingual Language Users
Application:
Monolingual Language Users
Application:
Monolingual Language Users
ThesauriThesauri
See also: de Melo & Weikum (2008).
ThesauriThesauri
Borin, Allwood, de Melo. LREC 2014.
Application: Machine TranslationApplication: Machine Translation
OpenWN-PT:
Used by Google Translate
OpenWN-PT:
Used by Google Translate
Machine LearningMachine Learning
Examples
Incorrect
Correct
Machine LearningMachine Learning
Examples
LearningLearning
Incorrect
Correct
ClassifierModel
Machine LearningMachine Learning
Examples
LearningLearning
Incorrect
Correct
ClassifierModel
???
Machine LearningMachine Learning
Examples
Probably
Incorrect!
LearningLearning PredictionPrediction
Incorrect
Correct
ClassifierModel
Better Machine LearningBetter Machine Learning
Examples
Probably
Incorrect!
LearningLearning PredictionPrediction
Incorrect
Correct
ClassifierModel
Better
Classifier!
+ Better
Labels
for Test
Data
MT?
MT?
MT?
UWN Senses in MT?
Issue: Senses
should be less
fine-grained
Issue: Senses
should be less
fine-grained
No Word Left Behind
Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
No Word Left Behind
Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
No Word Left Behind
Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
No Word Left Behind
Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
Similar: Part-Of-Speech TaggingSimilar: Part-Of-Speech Tagging
● British fans gathered at the stadium to...
ADJECTIVE
“Didgeridoo”
is similar to:
“horn” (NOUN)
“drums” (NOUN)
“accordion” (NOUN)
“Didgeridoo”
is similar to:
“horn” (NOUN)
“drums” (NOUN)
“accordion” (NOUN)
Didgeridoo fans gathered at the park to...
???
Similar: Part-Of-Speech TaggingSimilar: Part-Of-Speech Tagging
● British fans gathered at the stadium to...
ADJECTIVE
Gaelic “didiridiú”
translates to
“didgeridoo” (NOUN)
in English
Gaelic “didiridiú”
translates to
“didgeridoo” (NOUN)
in English
...Astrálach is ea an didiridiú
???
Sentence LevelSentence Level
Sentence LevelSentence Level
Sentence LevelSentence Level
What about
Document-Level Tasks?
What about
Document-Level Tasks?
Public Domain Image from https://pixabay.com/en/book-text-read-paper-education-451067/
“new” 1.0
“york” 1.0
“jaguar” 1.0
“automobile” 0.0
“car” 0.0
“10th” 1.0
“street” 1.0
“show” 1.0
... ...
New_York 1.0
Jaguar (car) 0.0
Jaguar (animal) 1.0
Automobile/Car 0.0
10th Street 1.0
Performance 1.0
... ...
“10th street new york jaguar show”
Similar:
“10th New show in York”
“New Jaguar show”
“Show New Street in York”
“10th street new york jaguar show”
Similar:
“10th street nyc jaguar show”
Document LevelDocument Level
“new” 1.0
“york” 1.0
“jaguar” 1.0
“automobile” 0.0
“car” 0.0
“10th” 1.0
“street” 1.0
“show” 1.0
... ...
New_York 1.0
Jaguar (car) 0.0
Jaguar (animal) 1.0
Automobile/Car 0.0
10th Street 1.0
Performance 1.0
... ...
Animal 0.5
Vehicle 0.0
“10th street new york jaguar show”
Similar:
“10th New show in York”
“New Jaguar show”
“Show New Street in York”
“10th street new york jaguar show”
Similar:
“10th street nyc jaguar show”
“10th street nyc animal show”
“Exposición de jaguares Nueva York”
Expansion
(de Melo &
Siersdorfer
2007)
Document LevelDocument Level
Given: training documents with class labels
Goal: guess class labels for test documents in
some other language
Result: better than plain machine translation.
See de Melo & Siersdorfer 2007.
Multilingual Tasks:
Cross-Lingual Text Classification
Multilingual Tasks:
Cross-Lingual Text Classification
Underlying frame:
Commercial transfer
Capture the “who-did-what-to-whom”
Microsoft bought the patent from Nokia.
Nokia sold the patent to Microsoft.
The patent was acquired by Microsoft [from Nokia].
The patent was sold [by Nokia] to Microsoft.
Sentence-Level SemanticsSentence-Level Semantics
Buyer: Microsoft
Seller: Nokia
Product: The patent
FrameBase.org
Bringing knowledge into a standard form
based on natural language (FrameNet)
Bringing knowledge into a standard form
based on natural language (FrameNet)
ESWC 2015
Best Student
Paper Nominee
ESWC 2015
Best Student
Paper Nominee
Relation IntegrationRelation Integration
X isAuthorOf Y
Y writtenBy X
X wrote Y
Y writtenInYear Z
ESWC 2015
Best Student
Paper Nominee
ESWC 2015
Best Student
Paper Nominee
Relation IntegrationRelation Integration
YAGO: isMarriedTo predicateYAGO: isMarriedTo predicate
Freebase: Marriage EntityFreebase: Marriage Entity
Challenge:
Modelling
Differences
Challenge:
Modelling
Differences
Search Interfaces
“Which companies were created during the
last century in Silicon Valley ?”
YAGO2:
WWW 2011
Best Demo Award
YAGO2:
WWW 2011
Best Demo Award
Gerard de Melo
Answering Questions
IBM's Jeopardy!-winning Watson
system
Gerard de Melo
Answering Questions
IBM's Jeopardy!-winning Watson
system
Gerard de Melo
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
Jiaqiang Chen and Gerard de Melo 2015
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
The Roman Empire was remarkably
multicultural, with ”a rather astonishing
cohesive capacity” to create a sense
of shared identity while encompassing
diverse peoples within its political
system over a long span of time.
Jiaqiang Chen and Gerard de Melo 2015
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
The Roman Empire was remarkably
multicultural, with ”a rather astonishing
cohesive capacity” to create a sense
of shared identity while encompassing
diverse peoples within its political
system over a long span of time.
syntactic
Jiaqiang Chen and Gerard de Melo 2015
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
The Roman Empire was remarkably
multicultural, with ”a rather astonishing
cohesive capacity” to create a sense
of shared identity while encompassing
diverse peoples within its political
system over a long span of time.
syntactic semantic!
Jiaqiang Chen and Gerard de Melo 2015
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
The Roman Empire was remarkably
multicultural, with ”a rather astonishing
cohesive capacity” to create a sense
of shared identity while encompassing
diverse peoples within its political
system over a long span of time.
semantic!syntactic syntactic?
Jiaqiang Chen and Gerard de Melo 2015
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
The Roman Empire was remarkably
multicultural, with ”a rather astonishing
cohesive capacity” to create a sense
of shared identity while encompassing
diverse peoples within its political
system over a long span of time.
semantic!syntactic syntactic? ?
Jiaqiang Chen and Gerard de Melo 2015
What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?
The Roman Empire was remarkably
multicultural, with ”a rather astonishing
cohesive capacity” to create a sense
of shared identity while encompassing
diverse peoples within its political
system over a long span of time.
semantic!syntactic ?
Word2Vec Solution:
Subsampling
Word2Vec Solution:
Subsampling
syntactic?
Jiaqiang Chen and Gerard de Melo 2015
Word2Vec ApproachWord2Vec ApproachWord2Vec ApproachWord2Vec Approach
Alexandre Duret-Lutz
https://www.flickr.com/photos/gadl/110845690/
Take everything
we can get
Take everything
we can get
Our Proposal:Our Proposal:
Extract the Most Valuable PartsExtract the Most Valuable Parts
Our Proposal:Our Proposal:
Extract the Most Valuable PartsExtract the Most Valuable Parts
Theological Hall, Strahov Monastery Library, Prague
…Greek and Roman mythology...
Our Proposal:Our Proposal:
Extract the Most Valuable PartsExtract the Most Valuable Parts
Our Proposal:Our Proposal:
Extract the Most Valuable PartsExtract the Most Valuable Parts
semantic!
look for semantically
salient contexts in text!
look for semantically
salient contexts in text!
Jiaqiang Chen and Gerard de Melo 2015
Two WorldsTwo Worlds
Jiaqiang Chen and Gerard de Melo 2015
Distributional Semantics:
Use all available text
(Symbolic) Information Extraction:
Look for valuable connections
Proposed Research Program:
Joint Training
Proposed Research Program:
Joint Training
Better
Word Embeddings
Joint Training
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Proposed Research Program:
Joint Training
Proposed Research Program:
Joint Training
Better
Word Embeddings
Joint Training
Jiaqiang Chen and Gerard de Melo 2015
Use parallel
threads
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Joint TrainingJoint Training
Preliminary Experiments:Preliminary Experiments:
Joint TrainingJoint Training
Recently lots of related work: E.g.
Faruqui et al., Hill & Korhonen,
Wang et al., Johansson & Nieto Piña
Recently lots of related work: E.g.
Faruqui et al., Hill & Korhonen,
Wang et al., Johansson & Nieto Piña
Jiaqiang Chen and Gerard de Melo 2015
Preliminary Experiments:Preliminary Experiments:
Joint TrainingJoint Training
Preliminary Experiments:Preliminary Experiments:
Joint TrainingJoint Training
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Joint TrainingJoint Training
Preliminary Experiments:Preliminary Experiments:
Joint TrainingJoint Training
Use negative samplingUse negative sampling
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Jiaqiang Chen and Gerard de Melo 2015
Variant 1: Definition ExtractionVariant 1: Definition Extraction
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Jiaqiang Chen and Gerard de Melo 2015
Definitions
befuddle: to becloud and confuse as with liquor
befuddled: dazed by alcoholic drink
befuddled: confused and vague used especially of thinking
beg: to ask earnestly for, to entreat or supplicate for, to
beseech
Variant 1: Definition ExtractionVariant 1: Definition Extraction
Source: GCIDE
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Synonyms
effectual: effectual efficacious effective
effectuality: effectiveness effectivity effectualness
efficacious: effectual
efficaciousness: efficacy
Jiaqiang Chen and Gerard de Melo 2015
Variant 1: Definition ExtractionVariant 1: Definition Extraction
Source: GCIDE
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Jiaqiang Chen and Gerard de Melo 2015
Variant 2: List ExtractionVariant 2: List Extraction
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Jiaqiang Chen and Gerard de Melo 2015
● Look for repeated occurrences of commas
● Short units of roughly equal length
● noun phrases, adjectives
Variant 2: List ExtractionVariant 2: List Extraction
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Jiaqiang Chen and Gerard de Melo 2015
● Look for repeated occurrences of commas
● Short units of roughly equal length
● noun phrases, adjectives
● Also: Hearst patterns, e.g.
“cities such as New York, London, ...”
Variant 2: List ExtractionVariant 2: List Extraction
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Preliminary Experiments:Preliminary Experiments:
Information ExtractionInformation Extraction
Jiaqiang Chen and Gerard de Melo 2015
Extracted Lists
player captain manager director vice-chairman
group race culture religion organisation person person
Italian Mexican Chinese Creole French
Self-Portraits Portraits iris Still-Lives with Sunflowers view
from the Asylum Works after Millet Vineyards
ballscrews leadscrews worm gear screwjacks linear
actuator
Cleveland Essex Lincolnshire Northamptonshire
Nottinghamshire Thames Valley South Wales
ant.py dimdriver.py dimdriverdatafile.py
dimdriverdatasetdef.py dimexception.py dimmaker.py
dimoperators.py dimparser.py dimrex.py dimension.py
Variant 2: List ExtractionVariant 2: List Extraction
Preliminary Experiments:Preliminary Experiments:
SetupSetup
Preliminary Experiments:Preliminary Experiments:
SetupSetup
Wikipedia 2010
normalize to lower case and remove special characters
Contain 1,205,009,010 words
Select words appearing at least 50 times
Vocabulary size 220,521
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
SetupSetup
Preliminary Experiments:Preliminary Experiments:
SetupSetup
Wikipedia 2010
normalize to lower case and remove special characters
Contain 1,205,009,010 words
Select words appearing at least 50 times
Vocabulary size 220,521
Balance Components
simply by controlling
starting learning rates:
0.05 for CBOW, varying
rates for extracted information
Balance Components
simply by controlling
starting learning rates:
0.05 for CBOW, varying
rates for extracted information
Vector dim. 300
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
Results on WS353Results on WS353
Preliminary Experiments:Preliminary Experiments:
Results on WS353Results on WS353
Positive effect from
0.001 until around 0.04
Positive effect from
0.001 until around 0.04
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
ExampleExample
Preliminary Experiments:Preliminary Experiments:
ExampleExample
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Preliminary Experiments:Preliminary Experiments:
ExampleExample
Preliminary Experiments:Preliminary Experiments:
ExampleExample
Jiaqiang Chen and Gerard de Melo 2015
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
Best Paper Award
at NAACL 2015
Vector Space
Modeling Workshop
OutlineOutline
Large-Scale Knowledge Graphs
Semantics in Action
Models for the Future
History Repeating?History Repeating?History Repeating?History Repeating?
SMTSMT NMTNMT
Phrase-Based SMT
Hierarchical Phrases
WSD, MEANT etc.
Phrase-Based SMT
Hierarchical Phrases
WSD, MEANT etc.
Extended NMT?Extended NMT?
Well-Known IssuesWell-Known IssuesWell-Known IssuesWell-Known Issues
Source: The New Yorker
Future:Future:
Learning Common-SenseLearning Common-Sense
Future:Future:
Learning Common-SenseLearning Common-Sense
Learning Common-SenseLearning Common-SenseLearning Common-SenseLearning Common-Sense
WebChild
AAAI 2014
WSDM 2014
AAAI 2011
WebChild
AAAI 2014
WSDM 2014
AAAI 2011
Lexical Intensity OrderingsLexical Intensity Orderings
hothot
warmwarm
fieryfiery
scorchingscorching
<
<
<
weak
strong
TACL 2013TACL 2013
Knowlywood: Human ActivitiesKnowlywood: Human Activities
CIKM 2015CIKM 2015
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
http://www.wikihow.com/Read-a-Book-to-a-Baby-or-Infant#/Image:Read-a-Book-to-a-Baby-or-Infant-Step-5.jpg
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
x x
x x
petronia
sparrow
parched
arid
xdry
x bird
http://www.wikihow.com/Read-a-Book-to-a-Baby-or-Infant#/Image:Read-a-Book-to-a-Baby-or-Infant-Step-5.jpg
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
x x
x x
petronia
sparrow
parched
arid
xdry
x bird
http://www.wikihow.com/Read-a-Book-to-a-Baby-or-Infant#/Image:Read-a-Book-to-a-Baby-or-Infant-Step-5.jpg
Should account for
relationships
(incl. affordances,
causality, etc.)
Should account for
relationships
(incl. affordances,
causality, etc.)
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
Assume that she
is learning
just from text
Assume that she
is learning
just from text
1. Gather large amounts of Patterns
2. Use Web-Scale Data (Google N-Grams,
derived from 10^12 words of text)
Hearst-style
Bootstrapping with
large
numbers of seeds
Gerard de Melo
Information Extraction from TextInformation Extraction from Text
Extension to RelationshipsExtension to Relationships
Commonsense word relationships
extracted from Google 1T n-grams
24 relations bootstrapped via ConceptNet
→ 1,158,141 triples
Jiaqiang Chen, Niket Tandon, Gerard de Melo. WI 2015
Extension to RelationshipsExtension to Relationships
earring hasProperty gorgeous
concept definedAs theory
sonar partOf submarine
predator desires food
Commonsense word relationships
extracted from Google 1T n-grams
24 relations bootstrapped via ConceptNet
→ 1,158,141 triples
Jiaqiang Chen, Niket Tandon, Gerard de Melo. WI 2015
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
Jiaqiang Chen, Niket Tandon, Gerard de Melo. WI 2015
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
What causes cancer?
Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships
Can cats fly?
Summary
Large-Scale Knowledge Graphs
► Universal WordNet/MENTA:
large multilingual taxonomy
► Etymological WordNet
Semantics in Action, e.g.
► Lexvo.org
► Question Answering
with YAGO
Future Perspectives
► Vector Representations
► Common-Sense for NLU
More Information:
www.demelo.org
gdm@demelo.org
More Information:
www.demelo.org
gdm@demelo.org
Gerard de Melo

More Related Content

What's hot

Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
emmanuel_jamin
 
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Artificial Intelligence Institute at UofSC
 

What's hot (20)

Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
 
How to Build Linked Data Sites with Drupal 7 and RDFa
How to Build Linked Data Sites with Drupal 7 and RDFaHow to Build Linked Data Sites with Drupal 7 and RDFa
How to Build Linked Data Sites with Drupal 7 and RDFa
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly Record
 
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
 
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open UniversityWorking with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open University
 
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge ChapterDBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data
 
Open Data - a goldmine (JavaZone 2009)
Open Data - a goldmine (JavaZone 2009)Open Data - a goldmine (JavaZone 2009)
Open Data - a goldmine (JavaZone 2009)
 
The Web We Mix - benevolent AIs for a resilient web
The Web We Mix - benevolent AIs for a resilient webThe Web We Mix - benevolent AIs for a resilient web
The Web We Mix - benevolent AIs for a resilient web
 
Transcript - Provenance and Social Science data
Transcript  - Provenance and Social Science dataTranscript  - Provenance and Social Science data
Transcript - Provenance and Social Science data
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Linked Data and Sevices
Linked Data and SevicesLinked Data and Sevices
Linked Data and Sevices
 
Knowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly CommunicationKnowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly Communication
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - Shorthouse
 
Data, data, data
Data, data, dataData, data, data
Data, data, data
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
 
Semantic Web Applications in Libraries: The Road to BIBFRAME
Semantic Web Applications in Libraries: The Road to BIBFRAMESemantic Web Applications in Libraries: The Road to BIBFRAME
Semantic Web Applications in Libraries: The Road to BIBFRAME
 

Similar to Learning Multilingual Semantics from Big Data on the Web

Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
Richard Littauer
 
Using the Semantic Web, and Contributing to it
Using the Semantic Web, and Contributing to itUsing the Semantic Web, and Contributing to it
Using the Semantic Web, and Contributing to it
Mathieu d'Aquin
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
Jie Bao
 
Towards Linked Ontologies and Data on the Semantic Web
Towards Linked Ontologies and Data on the Semantic WebTowards Linked Ontologies and Data on the Semantic Web
Towards Linked Ontologies and Data on the Semantic Web
Jie Bao
 

Similar to Learning Multilingual Semantics from Big Data on the Web (20)

Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
 
ITWS Capstone (RPI, Fall 2013)
ITWS Capstone (RPI, Fall 2013)ITWS Capstone (RPI, Fall 2013)
ITWS Capstone (RPI, Fall 2013)
 
Using the Semantic Web, and Contributing to it
Using the Semantic Web, and Contributing to itUsing the Semantic Web, and Contributing to it
Using the Semantic Web, and Contributing to it
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives TaiwanA Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
 
Knowledge Graphs and Milestone
Knowledge Graphs and MilestoneKnowledge Graphs and Milestone
Knowledge Graphs and Milestone
 
Semantic web an overview and projects
Semantic web   an  overview and projectsSemantic web   an  overview and projects
Semantic web an overview and projects
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
Semantic Representation of Provenance in Wikipedia
Semantic Representation of Provenance in WikipediaSemantic Representation of Provenance in Wikipedia
Semantic Representation of Provenance in Wikipedia
 
Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)
 
Towards Linked Ontologies and Data on the Semantic Web
Towards Linked Ontologies and Data on the Semantic WebTowards Linked Ontologies and Data on the Semantic Web
Towards Linked Ontologies and Data on the Semantic Web
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and Terminology
 
Beautifying Data in the real world
Beautifying Data in the real worldBeautifying Data in the real world
Beautifying Data in the real world
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
Data and science
Data and scienceData and science
Data and science
 
ITWS Capstone Lecture (Spring 2013)
ITWS Capstone Lecture (Spring 2013)ITWS Capstone Lecture (Spring 2013)
ITWS Capstone Lecture (Spring 2013)
 
Linking Open Data
Linking Open DataLinking Open Data
Linking Open Data
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
 
Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?
 

More from Gerard de Melo

From Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated DataFrom Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated Data
Gerard de Melo
 

More from Gerard de Melo (15)

SEMAC Graph Node Embeddings for Link Prediction
SEMAC Graph Node Embeddings for Link PredictionSEMAC Graph Node Embeddings for Link Prediction
SEMAC Graph Node Embeddings for Link Prediction
 
How to Manage your Research
How to Manage your ResearchHow to Manage your Research
How to Manage your Research
 
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
Knowlywood: Mining Activity Knowledge from Hollywood NarrativesKnowlywood: Mining Activity Knowledge from Hollywood Narratives
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
 
From Big Data to Valuable Knowledge
From Big Data to Valuable KnowledgeFrom Big Data to Valuable Knowledge
From Big Data to Valuable Knowledge
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
 
Searching the Web of Data (Tutorial)
Searching the Web of Data (Tutorial)Searching the Web of Data (Tutorial)
Searching the Web of Data (Tutorial)
 
From Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated DataFrom Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated Data
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram Data
 
UWN: A Large Multilingual Lexical Knowledge Base
UWN: A Large Multilingual Lexical Knowledge BaseUWN: A Large Multilingual Lexical Knowledge Base
UWN: A Large Multilingual Lexical Knowledge Base
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using Ontologies
 
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Extracting Sense-Disambiguated Example Sentences From Parallel CorporaExtracting Sense-Disambiguated Example Sentences From Parallel Corpora
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
 
Towards a Universal Wordnet by Learning from Combined Evidence
Towards a Universal Wordnet by Learning from Combined EvidenceTowards a Universal Wordnet by Learning from Combined Evidence
Towards a Universal Wordnet by Learning from Combined Evidence
 
Not Quite the Same: Identity Constraints for the Web of Linked Data
Not Quite the Same: Identity Constraints for the Web of Linked DataNot Quite the Same: Identity Constraints for the Web of Linked Data
Not Quite the Same: Identity Constraints for the Web of Linked Data
 
Good, Great, Excellent: Global Inference of Semantic Intensities
Good, Great, Excellent: Global Inference of Semantic IntensitiesGood, Great, Excellent: Global Inference of Semantic Intensities
Good, Great, Excellent: Global Inference of Semantic Intensities
 
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged OntologyYAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 

Learning Multilingual Semantics from Big Data on the Web

  • 1. Learning Multilingual Semantics from Big Data on the Web Gerard de Melo Assistant Professor, Tsinghua University http://gerard.demelo.org Learning Multilingual Semantics from Big Data on the Web Gerard de Melo Assistant Professor, Tsinghua University http://gerard.demelo.org
  • 4. Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web Matej Kren: Idiom. Prague Municipal Library https://www.flickr.com/photos/ill-padrino/6437837857/
  • 5. From Big Data toFrom Big Data to Multilingual Semantics?Multilingual Semantics? From Big Data toFrom Big Data to Multilingual Semantics?Multilingual Semantics? Image: Brett Ryder
  • 6. Manual Knowledge OrganizationManual Knowledge Organization Image: http://commons.wikimedia.org/wiki/File:Mundaneum_Tir%C3%A4ng_Karteikaarten.jpg Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries
  • 7. Manual Knowledge OrganizationManual Knowledge Organization Image: Mundaneum Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries Alex Wright: This was a sort of “analog search engine” Alex Wright: This was a sort of “analog search engine”
  • 8. Zipfian DistributionZipfian DistributionZipfian DistributionZipfian Distribution https://commons.wikimedia.org/wiki/File:Moby_Dick_Words.gif
  • 9. Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web +
  • 10. Goal: Large YetGoal: Large Yet Reasonably Clean KnowledgeReasonably Clean Knowledge Goal: Large YetGoal: Large Yet Reasonably Clean KnowledgeReasonably Clean Knowledge Theological Hall, Strahov Monastery Library, Prague
  • 11. OutlineOutline Large-Scale Knowledge Graphs Semantics in Action Models for the Future
  • 12. OutlineOutline Large-Scale Knowledge Graphs Semantics in Action Models for the Future
  • 13. Lexical Knowledge Portuguese-Chinese Dictionary by Ruggieri et al. (1580s) The first European-Chinese dictionary https://commons.wikimedia.org/wiki/File:Ricci-Ruggieri-Portuguese-Chinese-dictionary-page-1.png
  • 14. Provides translations, antonyms, etc. WiktionaryWiktionary
  • 17. e.g. “salary” < Lat. “salarius” < Lat. “sal” (salt) Etymological WordnetEtymological Wordnet
  • 18. LREC 2014LREC 2014 Etymological WordnetEtymological Wordnet
  • 19. LREC 2014LREC 2014 Etymological WordnetEtymological Wordnet
  • 20. Etymological WordnetEtymological Wordnet Old English Example Old English Example
  • 26. Multilingual Lexical Knowledge UWN (de Melo & Weikum 2009)
  • 27. UWN: Universal Wordnet Before: manual work over two decades but not many large wordnets Before: manual work over two decades but not many large wordnets Our Approach: ● Exploit translation resources on the Web ● Learn regression model with sophisticated graph-based features Our Approach: ● Exploit translation resources on the Web ● Learn regression model with sophisticated graph-based features Gerard de Melo
  • 29. UWN: Universal Wordnet over 1,000,000 words in over 100 languages CIKM 2009CIKM 2009 ICGL 2008ICGL 2008 Best Paper AwardBest Paper Award ICGL 2008ICGL 2008 Best Paper AwardBest Paper Award Gerard de Melo
  • 31. UWN: Getting StartedUWN: Getting Started Simple API for JVM Languages val uwn = new UWN(new File("plugins/")) for (m <- uwn.getMeanings("souris", "fra")) println(m) Or Just Download the TSV File Simple API for JVM Languages val uwn = new UWN(new File("plugins/")) for (m <- uwn.getMeanings("souris", "fra")) println(m) Or Just Download the TSV File
  • 32. Adding Other Sources Gerard de Melo Language-specific,Language-specific, Domain-specific,Domain-specific, Arbitrary DatabasesArbitrary Databases Language-specific,Language-specific, Domain-specific,Domain-specific, Arbitrary DatabasesArbitrary Databases
  • 33. Adding Other SourcesAdding Other SourcesAdding Other SourcesAdding Other Sources https://commons.wikimedia.org/wiki/File:Encyclopedia_Britannica_in_the_library_of_The_Kings_School,_Goa.jpg
  • 34. Adding Other SourcesAdding Other SourcesAdding Other SourcesAdding Other Sources Rob Matthews: printed small sample of Wikipedia Actually, a printed Wikipedia corresponds to 2000 Britannica volumes Source: http://www.labnol.org/internet/wikipedia-printed-book/9136/ Actually, a printed Wikipedia corresponds to 2000 Britannica volumes Source: http://www.labnol.org/internet/wikipedia-printed-book/9136/
  • 35. ACL 2010 AAAI 2013 ACL 2010 AAAI 2013 Use Identity Links to connect What is equivalent Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
  • 36. Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data ACL 2010 AAAI 2013 ACL 2010 AAAI 2013
  • 37. Merging Structured DataMerging Structured Data Trentino Trentino- Alto Adige
  • 38. Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data One bad link isOne bad link is enough to make aenough to make a connected componentconnected component inconsistentinconsistent One bad link isOne bad link is enough to make aenough to make a connected componentconnected component inconsistentinconsistent ACL 2010 AAAI 2013 ACL 2010 AAAI 2013
  • 39. Source: Peter Mika Entity Integration: Challenges Entity Integration: Challenges
  • 40. Merging Structured DataMerging Structured Data Distinctness Assertions Di = ({en: Province of Trento, en:Trentino}, {en:Trentino-South Tyrol, en:Trentino-Alto Adige/Südtirol}) Distinctness Assertions Di = ({en: Province of Trento, en:Trentino}, {en:Trentino-South Tyrol, en:Trentino-Alto Adige/Südtirol}) ACL 2010 AAAI 2013 ACL 2010 AAAI 2013
  • 41. How to reconcileHow to reconcile equivalenceequivalence andand distinctnessdistinctness evidence?evidence? How to reconcileHow to reconcile equivalenceequivalence andand distinctnessdistinctness evidence?evidence? a) ignore somea) ignore some equivalence informationequivalence information (delete certain edges)(delete certain edges) a) ignore somea) ignore some equivalence informationequivalence information (delete certain edges)(delete certain edges) b) ignore someb) ignore some distinctness informationdistinctness information (remove node from(remove node from distinctness assertion)distinctness assertion) b) ignore someb) ignore some distinctness informationdistinctness information (remove node from(remove node from distinctness assertion)distinctness assertion) Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data ACL 2010 AAAI 2013 ACL 2010 AAAI 2013
  • 42. Min. cost solution:Min. cost solution: NP-hardNP-hard APX-hardAPX-hard Min. cost solution:Min. cost solution: NP-hardNP-hard APX-hardAPX-hard Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data ACL 2010 AAAI 2013 ACL 2010 AAAI 2013
  • 43. Finally, use region growingFinally, use region growing algorithm in the spiritalgorithm in the spirit of Leighton & Rao 1988of Leighton & Rao 1988 Finally, use region growingFinally, use region growing algorithm in the spiritalgorithm in the spirit of Leighton & Rao 1988of Leighton & Rao 1988 Linear Program RelaxationLinear Program RelaxationLinear Program RelaxationLinear Program Relaxation Approximation Guarantee:Approximation Guarantee: 4ln(nq+1)4ln(nq+1) for n distinctness assertions,for n distinctness assertions, q=max |Dq=max |Di,ji,j || but independent of |Dbut independent of |Dii | !| ! Approximation Guarantee:Approximation Guarantee: 4ln(nq+1)4ln(nq+1) for n distinctness assertions,for n distinctness assertions, q=max |Dq=max |Di,ji,j || but independent of |Dbut independent of |Dii | !| ! Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
  • 44. Linear Program RelaxationLinear Program RelaxationLinear Program RelaxationLinear Program Relaxation Nice:Nice: This generalizes theThis generalizes the Hungarian AlgorithmHungarian Algorithm to various advancedto various advanced types of non-standardtypes of non-standard matchingsmatchings (cf. de Melo. AAAI 2013)(cf. de Melo. AAAI 2013) Nice:Nice: This generalizes theThis generalizes the Hungarian AlgorithmHungarian Algorithm to various advancedto various advanced types of non-standardtypes of non-standard matchingsmatchings (cf. de Melo. AAAI 2013)(cf. de Melo. AAAI 2013) Merging Structured DataMerging Structured DataMerging Structured DataMerging Structured Data
  • 45. Separated ConceptsSeparated Concepts (Multilingual Wikipedia)(Multilingual Wikipedia) Separated ConceptsSeparated Concepts (Multilingual Wikipedia)(Multilingual Wikipedia)
  • 46. Application: Lexvo.org Semantic WebSemantic Web Journal 2014Journal 2014 Semantic WebSemantic Web Journal 2014Journal 2014
  • 47. Lexvo.orgLexvo.org Semantic WebSemantic Web Journal 2014Journal 2014 Semantic WebSemantic Web Journal 2014Journal 2014
  • 48. Semantic WebSemantic Web Journal 2014Journal 2014 Semantic WebSemantic Web Journal 2014Journal 2014 InterdisciplinaryInterdisciplinary Work, e.g. inWork, e.g. in Digital HumanitiesDigital Humanities InterdisciplinaryInterdisciplinary Work, e.g. inWork, e.g. in Digital HumanitiesDigital Humanities Lexvo.orgLexvo.org
  • 49. Taxonomic Organization a user wants a list of „Art Schools in Europe“
  • 50. Multilingual Taxonomies a Swedish user wants a list of „Konstskolor i Europa“
  • 51. De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach
  • 52. De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach
  • 53. De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach
  • 54. De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach Taxonomic Integration:Taxonomic Integration: MENTA ApproachMENTA Approach
  • 55. De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award Predict Individual Taxonomic Links: Article → Category Category → WordNet Predict Individual Taxonomic Links: Article → Category Category → WordNet Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 56. Predict Individual Taxonomic Links: Article → Category Category → WordNet Predict Individual Taxonomic Links: Article → Category Category → WordNet Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 57. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 58. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA Image: https://de.wikipedia.org/wiki/Datei:Bersntol_palae.jpg Fersental (Bersntol, Valle dei Mòcheni) Fersental (Bersntol, Valle dei Mòcheni)
  • 59. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 60. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA https://de.wikipedia.org/wiki/Datei:Language_distribution_Trentino_2011.png Fersental (Bersntol, Valle dei Mòcheni) Fersental (Bersntol, Valle dei Mòcheni)
  • 61. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 62. Use Identity Constraint Algorithm to form equivalence classes Use Identity Constraint Algorithm to form equivalence classes Markov Chain Random Walk with Restarts to Rank Parents Markov Chain Random Walk with Restarts to Rank Parents Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 63. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 64. Taxonomic Integration:Taxonomic Integration: MENTAMENTA Taxonomic Integration:Taxonomic Integration: MENTAMENTA
  • 65. Bansal et al.Bansal et al. ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up Bansal et al.Bansal et al. ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up Bansal et al.Bansal et al. ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up Bansal et al.Bansal et al. ACL 2014. Best Paper Runner-UpACL 2014. Best Paper Runner-Up Belief PropagationBelief Propagation exploiting Kirchhoff’sexploiting Kirchhoff’s Matrix Tree TheoremMatrix Tree Theorem for efficient handling offor efficient handling of tree factortree factor Belief PropagationBelief Propagation exploiting Kirchhoff’sexploiting Kirchhoff’s Matrix Tree TheoremMatrix Tree Theorem for efficient handling offor efficient handling of tree factortree factor Chu-Liu-EdmondsChu-Liu-Edmonds directed spanning treedirected spanning tree algorithm for decodingalgorithm for decoding Chu-Liu-EdmondsChu-Liu-Edmonds directed spanning treedirected spanning tree algorithm for decodingalgorithm for decoding New Algorithm: Structured Output Prediction New Algorithm: Structured Output Prediction
  • 66. UWN/MENTA CIKM 2010CIKM 2010 Best Paper AwardBest Paper Award CIKM 2010CIKM 2010 Best Paper AwardBest Paper Award Biggest (ontological)Biggest (ontological) taxonomytaxonomy Biggest (ontological)Biggest (ontological) taxonomytaxonomy
  • 67. UWN/MENTA multilingual extension of WordNet for word senses and taxonomical information over 200 languages Gerard de Melo
  • 68. OutlineOutline Large-Scale Knowledge Graphs Semantics in Action Models for the Future
  • 69. Language EducationLanguage EducationLanguage EducationLanguage Education
  • 74. Application: Sense-DisambiguatedApplication: Sense-Disambiguated Example SentencesExample Sentences Application: Sense-DisambiguatedApplication: Sense-Disambiguated Example SentencesExample Sentences
  • 75. Application: Sense-DisambiguatedApplication: Sense-Disambiguated Example SentencesExample Sentences Application: Sense-DisambiguatedApplication: Sense-Disambiguated Example SentencesExample Sentences
  • 76. Application: Sense-DisambiguatedApplication: Sense-Disambiguated Example SentencesExample Sentences Application: Sense-DisambiguatedApplication: Sense-Disambiguated Example SentencesExample Sentences
  • 79. ThesauriThesauri See also: de Melo & Weikum (2008).
  • 81. Application: Machine TranslationApplication: Machine Translation OpenWN-PT: Used by Google Translate OpenWN-PT: Used by Google Translate
  • 85. Machine LearningMachine Learning Examples Probably Incorrect! LearningLearning PredictionPrediction Incorrect Correct ClassifierModel
  • 86. Better Machine LearningBetter Machine Learning Examples Probably Incorrect! LearningLearning PredictionPrediction Incorrect Correct ClassifierModel Better Classifier! + Better Labels for Test Data
  • 87. MT?
  • 88. MT?
  • 89. MT?
  • 90. UWN Senses in MT? Issue: Senses should be less fine-grained Issue: Senses should be less fine-grained
  • 91. No Word Left Behind Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
  • 92. No Word Left Behind Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
  • 93. No Word Left Behind Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
  • 94. No Word Left Behind Web page: http://www.buzzfeed.com/paulf24/24-signs-youre-in-a-pretty-rad-relationship-b5ra#.txpBGq4p4
  • 95. Similar: Part-Of-Speech TaggingSimilar: Part-Of-Speech Tagging ● British fans gathered at the stadium to... ADJECTIVE “Didgeridoo” is similar to: “horn” (NOUN) “drums” (NOUN) “accordion” (NOUN) “Didgeridoo” is similar to: “horn” (NOUN) “drums” (NOUN) “accordion” (NOUN) Didgeridoo fans gathered at the park to... ???
  • 96. Similar: Part-Of-Speech TaggingSimilar: Part-Of-Speech Tagging ● British fans gathered at the stadium to... ADJECTIVE Gaelic “didiridiú” translates to “didgeridoo” (NOUN) in English Gaelic “didiridiú” translates to “didgeridoo” (NOUN) in English ...Astrálach is ea an didiridiú ???
  • 100. What about Document-Level Tasks? What about Document-Level Tasks? Public Domain Image from https://pixabay.com/en/book-text-read-paper-education-451067/
  • 101. “new” 1.0 “york” 1.0 “jaguar” 1.0 “automobile” 0.0 “car” 0.0 “10th” 1.0 “street” 1.0 “show” 1.0 ... ... New_York 1.0 Jaguar (car) 0.0 Jaguar (animal) 1.0 Automobile/Car 0.0 10th Street 1.0 Performance 1.0 ... ... “10th street new york jaguar show” Similar: “10th New show in York” “New Jaguar show” “Show New Street in York” “10th street new york jaguar show” Similar: “10th street nyc jaguar show” Document LevelDocument Level
  • 102. “new” 1.0 “york” 1.0 “jaguar” 1.0 “automobile” 0.0 “car” 0.0 “10th” 1.0 “street” 1.0 “show” 1.0 ... ... New_York 1.0 Jaguar (car) 0.0 Jaguar (animal) 1.0 Automobile/Car 0.0 10th Street 1.0 Performance 1.0 ... ... Animal 0.5 Vehicle 0.0 “10th street new york jaguar show” Similar: “10th New show in York” “New Jaguar show” “Show New Street in York” “10th street new york jaguar show” Similar: “10th street nyc jaguar show” “10th street nyc animal show” “Exposición de jaguares Nueva York” Expansion (de Melo & Siersdorfer 2007) Document LevelDocument Level
  • 103. Given: training documents with class labels Goal: guess class labels for test documents in some other language Result: better than plain machine translation. See de Melo & Siersdorfer 2007. Multilingual Tasks: Cross-Lingual Text Classification Multilingual Tasks: Cross-Lingual Text Classification
  • 104. Underlying frame: Commercial transfer Capture the “who-did-what-to-whom” Microsoft bought the patent from Nokia. Nokia sold the patent to Microsoft. The patent was acquired by Microsoft [from Nokia]. The patent was sold [by Nokia] to Microsoft. Sentence-Level SemanticsSentence-Level Semantics Buyer: Microsoft Seller: Nokia Product: The patent
  • 105. FrameBase.org Bringing knowledge into a standard form based on natural language (FrameNet) Bringing knowledge into a standard form based on natural language (FrameNet) ESWC 2015 Best Student Paper Nominee ESWC 2015 Best Student Paper Nominee
  • 106. Relation IntegrationRelation Integration X isAuthorOf Y Y writtenBy X X wrote Y Y writtenInYear Z ESWC 2015 Best Student Paper Nominee ESWC 2015 Best Student Paper Nominee
  • 107. Relation IntegrationRelation Integration YAGO: isMarriedTo predicateYAGO: isMarriedTo predicate Freebase: Marriage EntityFreebase: Marriage Entity Challenge: Modelling Differences Challenge: Modelling Differences
  • 108. Search Interfaces “Which companies were created during the last century in Silicon Valley ?” YAGO2: WWW 2011 Best Demo Award YAGO2: WWW 2011 Best Demo Award Gerard de Melo
  • 109. Answering Questions IBM's Jeopardy!-winning Watson system Gerard de Melo
  • 110. Answering Questions IBM's Jeopardy!-winning Watson system Gerard de Melo
  • 111. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? Jiaqiang Chen and Gerard de Melo 2015
  • 112. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? The Roman Empire was remarkably multicultural, with ”a rather astonishing cohesive capacity” to create a sense of shared identity while encompassing diverse peoples within its political system over a long span of time. Jiaqiang Chen and Gerard de Melo 2015
  • 113. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? The Roman Empire was remarkably multicultural, with ”a rather astonishing cohesive capacity” to create a sense of shared identity while encompassing diverse peoples within its political system over a long span of time. syntactic Jiaqiang Chen and Gerard de Melo 2015
  • 114. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? The Roman Empire was remarkably multicultural, with ”a rather astonishing cohesive capacity” to create a sense of shared identity while encompassing diverse peoples within its political system over a long span of time. syntactic semantic! Jiaqiang Chen and Gerard de Melo 2015
  • 115. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? The Roman Empire was remarkably multicultural, with ”a rather astonishing cohesive capacity” to create a sense of shared identity while encompassing diverse peoples within its political system over a long span of time. semantic!syntactic syntactic? Jiaqiang Chen and Gerard de Melo 2015
  • 116. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? The Roman Empire was remarkably multicultural, with ”a rather astonishing cohesive capacity” to create a sense of shared identity while encompassing diverse peoples within its political system over a long span of time. semantic!syntactic syntactic? ? Jiaqiang Chen and Gerard de Melo 2015
  • 117. What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors?What Goes into Word Vectors? The Roman Empire was remarkably multicultural, with ”a rather astonishing cohesive capacity” to create a sense of shared identity while encompassing diverse peoples within its political system over a long span of time. semantic!syntactic ? Word2Vec Solution: Subsampling Word2Vec Solution: Subsampling syntactic? Jiaqiang Chen and Gerard de Melo 2015
  • 118. Word2Vec ApproachWord2Vec ApproachWord2Vec ApproachWord2Vec Approach Alexandre Duret-Lutz https://www.flickr.com/photos/gadl/110845690/ Take everything we can get Take everything we can get
  • 119. Our Proposal:Our Proposal: Extract the Most Valuable PartsExtract the Most Valuable Parts Our Proposal:Our Proposal: Extract the Most Valuable PartsExtract the Most Valuable Parts Theological Hall, Strahov Monastery Library, Prague
  • 120. …Greek and Roman mythology... Our Proposal:Our Proposal: Extract the Most Valuable PartsExtract the Most Valuable Parts Our Proposal:Our Proposal: Extract the Most Valuable PartsExtract the Most Valuable Parts semantic! look for semantically salient contexts in text! look for semantically salient contexts in text! Jiaqiang Chen and Gerard de Melo 2015
  • 121. Two WorldsTwo Worlds Jiaqiang Chen and Gerard de Melo 2015 Distributional Semantics: Use all available text (Symbolic) Information Extraction: Look for valuable connections
  • 122. Proposed Research Program: Joint Training Proposed Research Program: Joint Training Better Word Embeddings Joint Training Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 123. Proposed Research Program: Joint Training Proposed Research Program: Joint Training Better Word Embeddings Joint Training Jiaqiang Chen and Gerard de Melo 2015 Use parallel threads Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 124. Preliminary Experiments:Preliminary Experiments: Joint TrainingJoint Training Preliminary Experiments:Preliminary Experiments: Joint TrainingJoint Training Recently lots of related work: E.g. Faruqui et al., Hill & Korhonen, Wang et al., Johansson & Nieto Piña Recently lots of related work: E.g. Faruqui et al., Hill & Korhonen, Wang et al., Johansson & Nieto Piña Jiaqiang Chen and Gerard de Melo 2015
  • 125. Preliminary Experiments:Preliminary Experiments: Joint TrainingJoint Training Preliminary Experiments:Preliminary Experiments: Joint TrainingJoint Training Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 126. Preliminary Experiments:Preliminary Experiments: Joint TrainingJoint Training Preliminary Experiments:Preliminary Experiments: Joint TrainingJoint Training Use negative samplingUse negative sampling Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 127. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Jiaqiang Chen and Gerard de Melo 2015 Variant 1: Definition ExtractionVariant 1: Definition Extraction Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 128. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Jiaqiang Chen and Gerard de Melo 2015 Definitions befuddle: to becloud and confuse as with liquor befuddled: dazed by alcoholic drink befuddled: confused and vague used especially of thinking beg: to ask earnestly for, to entreat or supplicate for, to beseech Variant 1: Definition ExtractionVariant 1: Definition Extraction Source: GCIDE Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 129. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Synonyms effectual: effectual efficacious effective effectuality: effectiveness effectivity effectualness efficacious: effectual efficaciousness: efficacy Jiaqiang Chen and Gerard de Melo 2015 Variant 1: Definition ExtractionVariant 1: Definition Extraction Source: GCIDE Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 130. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Jiaqiang Chen and Gerard de Melo 2015 Variant 2: List ExtractionVariant 2: List Extraction Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 131. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Jiaqiang Chen and Gerard de Melo 2015 ● Look for repeated occurrences of commas ● Short units of roughly equal length ● noun phrases, adjectives Variant 2: List ExtractionVariant 2: List Extraction Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 132. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Jiaqiang Chen and Gerard de Melo 2015 ● Look for repeated occurrences of commas ● Short units of roughly equal length ● noun phrases, adjectives ● Also: Hearst patterns, e.g. “cities such as New York, London, ...” Variant 2: List ExtractionVariant 2: List Extraction Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 133. Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Preliminary Experiments:Preliminary Experiments: Information ExtractionInformation Extraction Jiaqiang Chen and Gerard de Melo 2015 Extracted Lists player captain manager director vice-chairman group race culture religion organisation person person Italian Mexican Chinese Creole French Self-Portraits Portraits iris Still-Lives with Sunflowers view from the Asylum Works after Millet Vineyards ballscrews leadscrews worm gear screwjacks linear actuator Cleveland Essex Lincolnshire Northamptonshire Nottinghamshire Thames Valley South Wales ant.py dimdriver.py dimdriverdatafile.py dimdriverdatasetdef.py dimexception.py dimmaker.py dimoperators.py dimparser.py dimrex.py dimension.py Variant 2: List ExtractionVariant 2: List Extraction
  • 134. Preliminary Experiments:Preliminary Experiments: SetupSetup Preliminary Experiments:Preliminary Experiments: SetupSetup Wikipedia 2010 normalize to lower case and remove special characters Contain 1,205,009,010 words Select words appearing at least 50 times Vocabulary size 220,521 Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 135. Preliminary Experiments:Preliminary Experiments: SetupSetup Preliminary Experiments:Preliminary Experiments: SetupSetup Wikipedia 2010 normalize to lower case and remove special characters Contain 1,205,009,010 words Select words appearing at least 50 times Vocabulary size 220,521 Balance Components simply by controlling starting learning rates: 0.05 for CBOW, varying rates for extracted information Balance Components simply by controlling starting learning rates: 0.05 for CBOW, varying rates for extracted information Vector dim. 300 Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 136. Preliminary Experiments:Preliminary Experiments: Results on WS353Results on WS353 Preliminary Experiments:Preliminary Experiments: Results on WS353Results on WS353 Positive effect from 0.001 until around 0.04 Positive effect from 0.001 until around 0.04 Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 137. Preliminary Experiments:Preliminary Experiments: ExampleExample Preliminary Experiments:Preliminary Experiments: ExampleExample Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 138. Preliminary Experiments:Preliminary Experiments: ExampleExample Preliminary Experiments:Preliminary Experiments: ExampleExample Jiaqiang Chen and Gerard de Melo 2015 Best Paper Award at NAACL 2015 Vector Space Modeling Workshop Best Paper Award at NAACL 2015 Vector Space Modeling Workshop
  • 139. OutlineOutline Large-Scale Knowledge Graphs Semantics in Action Models for the Future
  • 140. History Repeating?History Repeating?History Repeating?History Repeating? SMTSMT NMTNMT Phrase-Based SMT Hierarchical Phrases WSD, MEANT etc. Phrase-Based SMT Hierarchical Phrases WSD, MEANT etc. Extended NMT?Extended NMT?
  • 142. Source: The New Yorker Future:Future: Learning Common-SenseLearning Common-Sense Future:Future: Learning Common-SenseLearning Common-Sense
  • 143. Learning Common-SenseLearning Common-SenseLearning Common-SenseLearning Common-Sense WebChild AAAI 2014 WSDM 2014 AAAI 2011 WebChild AAAI 2014 WSDM 2014 AAAI 2011
  • 144. Lexical Intensity OrderingsLexical Intensity Orderings hothot warmwarm fieryfiery scorchingscorching < < < weak strong TACL 2013TACL 2013
  • 145. Knowlywood: Human ActivitiesKnowlywood: Human Activities CIKM 2015CIKM 2015
  • 146. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships http://www.wikihow.com/Read-a-Book-to-a-Baby-or-Infant#/Image:Read-a-Book-to-a-Baby-or-Infant-Step-5.jpg
  • 147. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships x x x x petronia sparrow parched arid xdry x bird http://www.wikihow.com/Read-a-Book-to-a-Baby-or-Infant#/Image:Read-a-Book-to-a-Baby-or-Infant-Step-5.jpg
  • 148. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships x x x x petronia sparrow parched arid xdry x bird http://www.wikihow.com/Read-a-Book-to-a-Baby-or-Infant#/Image:Read-a-Book-to-a-Baby-or-Infant-Step-5.jpg Should account for relationships (incl. affordances, causality, etc.) Should account for relationships (incl. affordances, causality, etc.)
  • 149. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships Assume that she is learning just from text Assume that she is learning just from text
  • 150. 1. Gather large amounts of Patterns 2. Use Web-Scale Data (Google N-Grams, derived from 10^12 words of text) Hearst-style Bootstrapping with large numbers of seeds Gerard de Melo Information Extraction from TextInformation Extraction from Text
  • 151. Extension to RelationshipsExtension to Relationships Commonsense word relationships extracted from Google 1T n-grams 24 relations bootstrapped via ConceptNet → 1,158,141 triples Jiaqiang Chen, Niket Tandon, Gerard de Melo. WI 2015
  • 152. Extension to RelationshipsExtension to Relationships earring hasProperty gorgeous concept definedAs theory sonar partOf submarine predator desires food Commonsense word relationships extracted from Google 1T n-grams 24 relations bootstrapped via ConceptNet → 1,158,141 triples Jiaqiang Chen, Niket Tandon, Gerard de Melo. WI 2015
  • 153. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships Jiaqiang Chen, Niket Tandon, Gerard de Melo. WI 2015
  • 154. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships What causes cancer?
  • 155. Extension to RelationshipsExtension to RelationshipsExtension to RelationshipsExtension to Relationships Can cats fly?
  • 156. Summary Large-Scale Knowledge Graphs ► Universal WordNet/MENTA: large multilingual taxonomy ► Etymological WordNet Semantics in Action, e.g. ► Lexvo.org ► Question Answering with YAGO Future Perspectives ► Vector Representations ► Common-Sense for NLU More Information: www.demelo.org gdm@demelo.org More Information: www.demelo.org gdm@demelo.org Gerard de Melo