1. Machine reading for
the Semantic Web
Aldo Gangemi and Valentina Presutti
STLab, ISTC-CNR, Italy
2. Semantic Technology Laboratory
Institute of Cognitive Sciences and Technologies
Consiglio Nazionale delle Ricerche
Italy
Misael Mongiovì
Daria Spampinato
Aldo Gangemi
Head of the Lab
Valentina Presutti
Andrea Nuzzolese
Sergio Consoli
Luigi Asprino
Giorgia Lodi
Diego ReforgiatoMartina Sangiovanni
Paolo Ciancarini
3. COMPETENCES
Ontology Design
Linked Open Data design and publishing
Knowledge Extraction
Machine readingOpinion Mining and Sentiment analysis
Software architectures for knowledge-intensive applications
Large-sale data integration
6. WWW
Information
Resources
Data
Non-Information
Resources
Agents
Web 1.0
Web 2.0
Web 3.0
aka
Semantic Web
“John Coltrane”
this is the guy
this is a picture
John Coltrane, also known as Trane, was an
American jazz saxophonist and composer.
Working in the bebop and hard bop idioms
early in his career, Coltrane helped pioneer
the use of modes in jazz and was later at the
forefront of free jazz.
The Web as an integration hub
7. WWW
Information
Resources
Data
Non-Information
Resources
Agents
Web 1.0
Web 2.0
Web 3.0
“John Coltrane”
this is the guy
this is a picture
John Coltrane, also known as Trane, was an
American jazz saxophonist and composer.
Working in the bebop and hard bop idioms
early in his career, Coltrane helped pioneer
the use of modes in jazz and was later at the
forefront of free jazz.
The Web as an integration hub
8. • Most of the Web of Data derived from
structured data (typically databases) or semi-
structured data (e.g. Wikipedia infoboxes)
• Web content is mostly natural language text
(web sites, news, forums, reviews, etc.)
• Such content is highly valuable for the Semantic
Web (question answering, opinion mining,
knowledge summarization, etc.)
8
Limits and motivation
9. WWW
Information
Resources
Data
Non-Information
Resources
Agents
Web 1.0
Web 2.0
Web 3.0
“John Coltrane”
this is the guy
this is a picture
John Coltrane, also known as Trane, was an
American jazz saxophonist and composer.
Working in the bebop and hard bop idioms
early in his career, Coltrane helped pioneer
the use of modes in jazz and was later at the
forefront of free jazz.
The Web as an integration hub
10. John Coltrane, also known as Trane, was an
American jazz saxophonist and composer.
Working in the bebop and hard bop idioms
early in his career, Coltrane helped pioneer
the use of modes in jazz and was later at the
forefront of free jazz.
?
11. To extract as much relevant knowledge as
possible from web textual content and
publish it in the form of
Semantic Web triples
unsupervised, open domain, grounded
11
Open Knowledge Extraction
12. John Coltrane, also known as Trane, was an
American jazz saxophonist and composer.
Working in the bebop and hard bop idioms
early in his career, Coltrane helped pioneer
the use of modes in jazz and was later at the
forefront of free jazz.
knowledge extraction entity and data linking
data enrichment
13. WWW
Information
Resources
Data
Non-Information
Resources
Agents
Web 1.0
Web 2.0
Web 3.0
“John Coltrane”
this is the guy
this is a picture
John Coltrane, also known as Trane, was an
American jazz saxophonist and composer.
Working in the bebop and hard bop idioms
early in his career, Coltrane helped pioneer
the use of modes in jazz and was later at the
forefront of free jazz.
The Web as an integration hub
14. Approaches to linked data
and ontology learning
• Most attention to enrich the Web of Data by
learning standard relations: e.g., membership
(rdf:type), class taxonomy (rdfs:subClassOf), entity
linking (owl:sameAs)
• What about general factual relations?
• e.g. roles in events, part, participation, causality,
location, friendship, etc.
15. • The Black Hand might not have decided to
barbarously assassinate Franz Ferdinand after
he arrived in Sarajevo on June 28th, 1914
events
nega(on
modality
par(cipants
more
par(cipants
quality
coreference
need for “deep”
machine reading
event
rela(on
date
16. Open Information Extraction
pc5: NLPapps mac$ java -Xmx512m -jar reverb-latest.jar <<<"The Black Hand might
not have decided to barbarously assassinate Franz Ferdinand after he arrived in
Sarajevo on June 28th, 1914."
Initializing ReVerb extractor...Done.
Initializing confidence function...Done.
Initializing NLP tools...Done.
Starting extraction.
stdin 1 he arrived in Sarajevo 13 14 14 16 16
10.2200632195721161 The Black Hand might not have decided to barbarously
assassinate Franz Ferdinand after he arrived in Sarajevo on June 28th , 1914 .
DT NNP NNP MD RB VB VBN TO RB VB NNP NNP IN PRP VBD IN NNP IN NNP JJ , CD .
B-NP I-NP I-NP B-VP I-VP I-VP I-VP I-VP I-VP I-VP B-NP I-NP B-SBAR B-NP B-VP
B-PP B-NP B-PP B-NP I-NP I-NP I-NP O he arrive in sarajevo
Done with extraction.
Summary: 1 extractions, 1 sentences, 0 files, 1 seconds
18. LOD and ODP design
Aligned to WordNet,
VerbNet, FrameNet,
DOLCE+DnS,
DBpedia, schema.org
http://wit.istc.cnr.it/stlab-tools/fred
“The SemanticWeb will extremely love
FRED’s reading”
RESTful, Python lib
Earmark, NIF
RDF, OWL
Apache Stanbol
DRT- and Frame-based
High EE and RE accuracy
FRED integrates
NER, SenseTagging, WSD, Tax. Ind.,
Relation/Event/Role Extraction
19. machine
reading to rdf
“The Black Hand might not have decided to barbarously
assassinate Franz Ferdinand after he arrived in Sarajevo
on June 28th, 1914”
type
induc(on
nega(on
modality
taxonomy
induc(on
seman(c
roles
NER
indirect
type
induc(on
+ configurable namespaces and
Earmark/NIF text spans with semiotic relations to graph entities (denotes,
hasInterpretant)
events
quali(es
tense
representa(on
WSD/alignment
event
rela(ons
24. Landscape analysis of KE tools
• Hint at FRED performing best on term, relation,
event extraction, taxonomy induction, and frame
detection
• terminology extraction F1 = .87
• taxonomy induction F1 = .83
• relation extraction F1 = .76
• frame detection F1 =, 93
• event detection F1 = .82
26. Evaluation of frame detection against FrameNet corpus
• Precision is equivalent (p = .75) to the state-of-art
tool (Semafor), recall is lower (r = .58 against .75),
but Semafor trained on the corpus itself
• FRED is one order of magnitude faster
• FRED’s frame occurrences are formally
represented
27. Evaluation of FRED-based Tìpalo typing tool
• Tìpalo is a tool that automatically creates type
taxonomies to entities, based on their definitions in
natural language provided by their corresponding
Wikipedia pages
• Evaluation on a corpus of Wikipedia resources:
• F1 = .92 for entity typing
• F1 = .75 if state-of-the-art WSD is considered
28. • Sentilo identifies opinion holders, detects topics, and
scores opinions
• Evaluations on a corpus of user-based hotel reviews
• F1 = .95 for holder detection
• F1 = .66 for topic detection
• F1 = .80 for subtopic detection
• .81 is the correlation with open-rating 5-star scores given for
reviews
Evaluation of FRED-based Sentilo sentiment
analysis tool
29. Research challenges and
applications
• Open Knowledge Extraction
• Semantic Web Machine
Reading
• Semantic Sentiment
Analysis
• Complex Relation
Extraction and
Representation
• Abstractive Summarization
• Robotic natural language
understanding
• Integration of action schemas, linked
data, and OKE frames
• Social practices and norms
• Irony
• Modality
30. Semantic Web Machine Reading
Miles Davis was an
american jazz musician.
Ongoing research:
temporal series of graphs,
motif-based evaluation
32. 32
Linking
to
WN
supersenses
DBpedia
resource
Extracted
type
Disambiguated
sense
Inferred
superclass
Linked
resource
Disambiguated
WordNet
sense
A chaise longue is an upholstered sofa
in the shape of a chair that is long
enough to support the legs.
A chaise longue (English /ˌʃeɪz ˈlɔːŋ/;[1] French
pronunciation: [ʃɛzlɔ̃ŋɡ(ə)], "long chair") is an
upholstered sofa in the shape of a chair that is
long enough to support the legs.
Linking
to
DOLCE
Rich
taxonomical
data!
hJp://en.wikipedia.org/wiki/Chaise_longue
33. Elwood Buchanan slapped Miles
Davis' knuckles every time Miles
was using heavy vibrato.
Miles Davis was an american
jazz musician.
Graph series and reconciliation
Open issues: generalized
reconciliation with relevance
34. Semantic Sentiment Analysis
Miles Davis hated Betty Mabry because
of her radical promiscuity.
Open issues: better
sentic resources,
contextual scoring, etc.
35. Complex Relation Extraction and Representation
Abstractive Summarization
Open issues: coverage outside DBpedia, better
naming, skolemized entities, alignment with
existing properties, clustering of generated
properties
George E. Krug graduated from Lafayette College in Easton, Pennsylvania, in the
class of 1884. He went on to study architecture in Philadelphia, at the Fine Arts
Institute of the University of Pennsylvania.
36. • Event identity: FRED focuses on events expressed by verbs, propositions,
terms, and named entities (possibly resolved), as well as on event graphs
• Event classification: FRED uses Linked Data-oriented induction of types for
identified events, reusing e.g.VerbNet,WordNet, DBpedia, schema.org,
and DOLCE+DnS as reference ontologies
• Event unity: FRED applies semantic role labeling to verbs and propositions
(“situations”) in order to detect event boundaries, and frame detection
for resolving roles against a shared event ontology (VerbNet,
FrameNet, ...)
• Event modifiers: FRED extracts logical negation, basic modalities, and
adverbial qualities, applied to verbs and propositions, which can also be
used as event judgment indicators
• Event relations: FRED relates events via the role structure of verbs and
propositions, and extracts tense and entailment relations between them
Event Extraction
37. e.g. from:
“The Black Hand might not have decided to barbarously
assassinate Franz Ferdinand after he arrived in Sarajevo
on June 28th, 1914”
type
induc(on
nega(on
modality
taxonomy
induc(on
seman(c
roles
NER
indirect
type
induc(on
events
quali(es
tense
representa(on
WSD/alignment
event
rela(ons
40. OKE named graphs
• Why open-world? Because it’s the web, with incomplete
knowledge
• Integration between NLP and SW
• “The Black Hand might not have decided to barbarously assassinate
Franz Ferdinand after he arrived in Sarajevo on June 28th, 1914”
• d:Sarajevo
• d:Sarajevo :locatedIn d:FormerAustrianEmpire , d:Bosnia
• d:Bosnia :partOf d:Yugoslavia
41. ng1914
• Why open-world? Because it’s the web, with incomplete
knowledge
• Integration between NLP and SW
• “The Black Hand might not have decided to barbarously assassinate
Franz Ferdinand after he arrived in Sarajevo on June 28th, 1914”
• d:Sarajevo
• d:Sarajevo :locatedIn d:FormerAustrianEmpire , d:Bosnia
• d:Bosnia :partOf d:Yugoslavia
OKE named graphs
42. • Why open-world? Because it’s the web, with incomplete
knowledge
• Integration between NLP and SW
• “The Black Hand might not have decided to barbarously assassinate
Franz Ferdinand after he arrived in Sarajevo on June 28th, 1914”
• d:Sarajevo
• d:Sarajevo :locatedIn d:FormerAustrianEmpire , d:Bosnia
• d:Bosnia :partOf d:Yugoslavia
ng1929
OKE named graphs
43. • Why open-world? Because it’s the web, with incomplete
knowledge
• Integration between NLP and SW
• “The Black Hand might not have decided to barbarously assassinate
Franz Ferdinand after he arrived in Sarajevo on June 28th, 1914”
• d:Sarajevo
• d:Sarajevo :locatedIn d:FormerAustrianEmpire , d:Bosnia
• d:Bosnia :partOf d:Yugoslavia
ng1995
OKE named graphs
44. • More with events, relations, tense representation, sentiment …
• “The Black Hand might not have decided to barbarously assassinate
Franz Ferdinand after he arrived in Sarajevo on June 28th, 1914”
• …
• …
45. Robotic machine reading challenges …
“my birthday is on
Saturday”
“ok, then on
March 10th you’ll be
one year older”
46. “would you like to
have a party?”
Open issues: Integration of action schemas
with linked data and OKE frames, social
practices and norms, irony, modality, …
“my birthday is on
Saturday”
Robotic machine reading challenges …
47. Related publications
• Valentina Presutti, Francesco Draicchio and Aldo Gangemi. Knowledge Extraction based on Discourse
Representation Theory and Linguistic Frames. A. ten Teije and J. Völker (eds.): Proceedings of the
Conference on Knowledge Engineering and Knowledge Management (EKAW2012), LNCS, Springer, 2012
• Aldo Gangemi, Andrea Giovanni Nuzzolese, Valentina Presutti, Francesco Draicchio, Alberto Musetti and
Paolo Ciancarini. Automatic Typing of DBpedia Entities. Proceedings of ISWC2012, the Tenth International
Semantic Web Conference, LNCS, Springer, 2012
• Aldo Gangemi. A Comparison of Knowledge Extraction Tools for the Semantic Web. Proceedings of
ESWC2013, LNCS, Springer, 2013
• Aldo Gangemi, Francesco Draicchio, Valentina Presutti, Andrea Giovanni Nuzzolese, Diego Reforgiato. A
Machine Reader for the Semantic Web. Proceedings of ISWC2013, Springer, 2013
• Aldo Gangemi, Valentina Presutti, Diego Reforgiato Recupero. Frame-based detection of opinion holders
and topics: a model and a tool. IEEE Computational Intelligence, 9(1), 2014
• Valentina Presutti, Sergio Consoli, Andrea Giovanni Nuzzolese, Diego Reforgiato Recupero, Aldo Gangemi,
Ines Bannour, Haïfa Zargayouna. Uncovering the semantics of Wikipedia Pagelinks. Proceedings of the
Conference on Knowledge Engineering and Knowledge Management (EKAW2014), Springer, Berlin, 2014
• Diego Reforgiato Recupero, Valentina Presutti, Sergio Consoli, Aldo Gangemi, Andrea Giovanni Nuzzolese.
Sentilo: Frame-Based Sentiment Analysis. Cognitive Computation, http://dx.doi.org/10.1007/
s12559-014-9302-z, 2014