This presentation was given at the 8th International Natural Language Generation Conference (INLG12014), Philadelphia, Pennsylvania, and is related the publication of the same title.
With the rise of the Semantic Web more and more data become available encoded using the Semantic Web standard RDF. This representation is faced towards machines: designed to be easily processable by machines it is difficult to understand by non-experts. Transforming RDF data into human-comprehensible text would facilitate non-experts to assess this information. In this paper we present a language-independent method for extracting RDF verbalization templates from a parallel corpus of text and data. Our method is based on distant-supervised simultaneous multi relation learning and frequent maximal subgraph pattern mining. We demonstrate the feasibility of this method on a parallel corpus of Wikipedia articles and DBpedia data for English and German.
A preprint of the publication is available at http://km.aifb.kit.edu/sites/bridge-patterns/Ell_Harth_INLG2014_preprint.pdf
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
A language-independent method for the extraction of RDF verbalization templateslization - ppt spli-t
1. KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
1 Institute of Applied Informatics and Formal Description Methods (AIFB), Karlsruhe, Germany
www.kit.edu
A language-independent method for the extraction of
RDF verbalization templates
Basil Ell,1 Andreas Harth1
8th International Natural Language Generation Conference
20 June 2014, Philadelphia, PA, USA
2. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
2
Motivation
More and more data openly available as RDF
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Linked Open Data initiative
3. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
3
Motivation
More and more data openly available as RDF
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Linked Open Data initiative
Search Engine
keywords,
questions,
etc.
Text
NLG
4. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
4
Motivation
More and more data openly available as RDF
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Linked Open Data initiative
Search Engine
keywords,
questions,
etc.
Text
NLG
Encyclopedia or
Google Knowledge
Graph
Textual description
of a thing
NLG
5. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
5
Example RDF data - Triples
Subject Predicate Object
dbr:Curtain_(Novel) dbo:author dbr:Agatha_Christie
dbr:Curtain_(Novel) rdf:type dbo:Book
dbr:Curtain_(Novel) rdfs:label "Curtain (novel)"@en
dbr:Curtain_(Novel) dbp:releaseDate "September 1975"@en
dbr:Curtain_(Novel) rdf:type dbo:Writer
dbr:Curtain_(Novel) rdfs:label "Agatha Christie"@en
dbo:Book rdfs:label "book"@en
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
6. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
6
Example RDF data - Graph
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
7. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
7
Overview
Motivation
RDF Verbalization Templates
Automatic Template Extraction
Evaluation
Related Work
Summary
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
8. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
8
RDF VERBALIZATION TEMPLATES
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
9. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
9
RDF Verbalization Template (1/2)
Graph pattern
(GP)
Sentence pattern
(SP)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
10. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
10
RDF Verbalization Template (1/2)
Graph pattern
(GP)
Sentence pattern
(SP)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
11. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
11
RDF Verbalization Template (2/2)
GP represented
as SPARQL
query
SELECT
?book_label
?book_type_label
?author_label
?book_rD
WHERE {
?book dbo:author ?author .
?book dbp:releaseDate ?book_rD .
?book rdf:type ?book_type .
?book_type rdfs:label ?book_type_label .
?book rdfs:label ?book_label .
?author rdfs:label ?author_label .
?author rdf:type dbo:Writer .
}
book_label = “Curtain (novel)"
book_type_label = "book"
author_label = "Agatha Christie"
book_rD = "September 1975"
Curtain is a book by Agatha
Christie published in
September 1975.
Query
results
Verbalization
result
Subject Predicate Object
dbr:Curtain_(Novel) dbo:author dbr:Agatha_Christie
dbr:Curtain_(Novel) rdf:type dbo:Book
dbr:Curtain_(Novel) rdfs:label "Curtain (novel)"@en
dbr:Curtain_(Novel) dbp:releaseDate "September 1975"@en
dbr:Agatha_Christie rdf:type dbo:Writer
dbr:Agatha_Christie rdfs:label "Agatha Christie"@en
dbo:Book rdfs:label "book"@en
RDF data
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
12. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
12
AUTOMATIC TEMPLATE EXTRACTION
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
13. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
13
Template Extraction (1/6) - Overview
Parallel text-data corpus RDF verbalization templates
1. Sentence Collection
2. Text-Data Alignment
3. Abstraction
4. Grouping
5. Pattern Mining
6. Template Creation
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
14. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
14
Template Extraction (1/6) - Overview
Parallel text-data corpus RDF verbalization templates
1. Sentence Collection
2. Text-Data Alignment
3. Abstraction
4. Grouping
5. Pattern Mining
6. Template Creation
Experiment:
Text from Wikipedia
Data from DBpedia
10 Virtual Machines
8 vCPUs
8GB RAM
40GB Disk
Extraction ran for 2 weeks
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
15. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
15
Template Extraction (2/6) - Features
Distant-supervised
No hand-labeled training data required
Simultaneus multi-relation learning
Simultaneously learning all relations in a sentence
Frequent maximal subgraph pattern mining
Identify commonalities among RDF graph patterns
Language independent
Does not rely on syntactic parsing
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
16. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
16
Example Template (1/2)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
17. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
17
Example Template (2/2)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
18. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
18
Template Extraction (3/6) - Alignment
label
Sentencem1
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
19. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
19
Template Extraction (3/6) - Alignment
label
Sentencem1
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
20. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
20
Template Extraction (3/6) - Alignment
label
Sentencem1 m2 m3
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
21. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
21
Template Extraction (3/6) - Alignment
label
Sentencem1 m2 m3
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
22. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
22
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
23. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
23
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
i
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
24. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
24
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
i
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
25. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
25
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
26. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
26
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
27. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
27
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
28. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
28
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier matched string
Language independent approach:
-> no syntactic parsing
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
29. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
29
Template Extraction (4/6) – Abstraction
Abstraction 1:
Abstraction 2:
Hypothesis graph pattern 1
Hypothesis graph pattern 2
pattern 1
pattern 2
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
30. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
30
Template Extraction (5/6) - Grouping
'"{V1}" is a short story by {V2}.':
abstraction-64451-1
abstraction-88393-1
abstraction-4732-1
abstraction-50480-1
'"{V1}" is a single by American {V9} {V4} {V8}.':
abstraction-22205-1
abstraction-22205-3
abstraction-72533-1
abstraction-127891-2
'{V1} (born {V2}) is a German footballer.':
abstraction-86372-1
abstraction-86415-1
abstraction-135340-5
abstraction-140464-2
Hypothesis graph patterns with
equivalent sentence pattern
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Group graph patterns with equivalent sentence patterns:
31. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
31
Template Extraction (6/6) - fmSpan
fmSpan - Frequent maximal subgraph pattern
mining
Input:
Set of graph patterns
Minimal coverage value: c
Output: Set of graph patterns
Each graph pattern
Is subgraph to at least c graph patterns (→ frequent)
Cannot be extended while maintaining coverage (→ maximal)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
32. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
32
EVALUATION
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
33. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
33
Evaluation (1/4) - Experiment
88,708,622 triples
4,004,478 English documents
716,049 German documents
3,811,992 English sentences
794,040 German sentences
3,434,108 abstracted English sentences
530,766 abstracted German sentences
(with at least two identified entities)
#groups≥5 #templates #all groups
en 4569 3816 686,687
de 2130 1250 269,551
Parallel text-data corpus:
( , )
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
34. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
34
Evaluation (2/4) - Coverage
0
50
100
150
200
250
300
350
#en
#de
How often can a
template be
applied?
About 300 templates where each template can be used
to verbalize between 10,000 and 100,000 subgraphs.
1–10
10–100
100–1000
1000–10,000
10,000–100,000
100,000–1,000,000
1,000,000–10,000,000
10,000,000–100,000,000
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
35. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
35
Evaluation (3/4)
0
50
100
150
200
(1) (2) (3) (4)
Accuracy (1)
en de
0
5
10
15
20
(1) (2) (3) (4)
Accuracy (2)
en de
Is everything that is
expressed in the graph
pattern also expressed in
the sentence pattern?
Is everything that is
expressed in the
sentence pattern also
expressed in the graph
pattern?
Measured for each triple pattern within the GP:
(1) The triple pattern is explicitly expressed
(2) The triple pattern is implied
(3) The triple pattern is not expressed
(4) Unsure
(1) Everything is expressed
(2) Most things are expressed
(3) Some things are expressed
(4) Nothing is expressed
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
10 English templates, 10 German templates,
6 evaluators, 200 verbalizationsUser study
36. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
36
Evaluation (4/4)
0
50
100
150
200
250
(1) (2) (3) (4)
Syntactical Correctness
en de
0
50
100
150
200
250
300
(1) (2) (3) (4) (5)
Understandability
en de
How syntactically
correct are
verbalizations?
How
understandable are
verbalizations?
(1) Completely syntactically correct
(2) Almost syntactically correct
(3) Some syntactical errors
(4) Strongly syntactically incorrect
(1) The meaning is clear
(2) The meaning is clear, but there are some problems
in word usage, and/or style
(3) The basic thrust is clear, but the evaluator is not
sure of some detailed parts because of word usage
problems.
(4) Contains many word usage problems, and the
evaluator can only guess at the meaning
(5) Cannot be understood at all
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
37. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
37
RELATED WORK
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
38. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
38
Related Work (1/4)
(Welty et al., 2010)
Focus on IE
Input sentences are parsed
Regard relations between proper nouns only
Does not consider a graph of relations
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
39. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
39
Related Work (2/4)
(Duma and Klein, 2013)
Focus on NLG
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
40. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
40
Related Work (3/4)
(Gerber and Ngomo, 2011)
Focus on IE
< ’s acquisition of > pattern for property subsidiary
“Google’s acquisition of Youtube comes as online
video is really starting to hit its stride.”
relation expressed by string between entities
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
41. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
41
Related Work (4/4)
Distant supervision
(Craven and Kumlien, 1999), (Bunescu and Mooney,
2007), (Carlson et al., 2009), (Mintz et al., 2009), (Welty
et al., 2010), (Hoffmann et al., 2011), (Surdeanu et al.,
2012)
Simultaneus multi-relation learning
(Carlson et al., 2009)
42. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
42
SUMMARY
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
43. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
43
Summary
Introduced RDF verbalization templates
Introduced template extraction approach
Distant-supervised
Language independent
Simultaneous multi-relation learning
Frequent maximal subgraph pattern mining
Evaluation
Large parallel text-data corpus for en and de
Good syntactical correctness & understandability
Accuracy needs to be improved in future work
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
44. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
44
Thank you for your attention!
The authors acknowledge the support of the European Commission's Seventh Framework Programme
FP7-ICT-2011-7 (XLike, Grant 288342).
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
http://km.aifb.kit.edu/sites/bridge-patterns/INLG2014/
45. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
45
References (1/2)
Razvan Bunescu and Raymond Mooney. 2007. Learning to extract relations from the web using minimal supervision. In
Annual meeting-association for Computational Linguistics, volume 45, pages 576–583.
Andrew Carlson, Justin Betteridge, Estevam R Hruschka Jr, and Tom M Mitchell. 2009. Coupling semi-supervised learning
of categories and relations. In Proceedings of the NAACL HLT 2009 Workshop on Semi-supervised Learning for
Natural Language Processing, pages 1–9. Association for Computational Linguistics.
Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text
sources. In Thomas Lengauer, Reinhard Schneider, Peer Bork, Douglas L. Brutlag, Janice I. Glasgow, Hans-Werner
Mewes, and Ralf Zimmer, editors, ISMB, pages 77–86. AAAI.
Daniel Duma and Ewan Klein, 2013. Generating Natural Language from Linked Data: Unsupervised template extraction,
pages 83–94. Association for Computational Linguistics, Potsdam, Germany.
Daniel Gerber and A-C Ngonga Ngomo. 2011. Bootstrapping the linked data web. In 1st Workshop on Web Scale
Knowledge Extraction @ International Semantic Web Conference, volume 2011.
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak
supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies - Volume 1, pages 541–550. Association for
Computational Linguistics.
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
46. Institute of Applied Informatics and Formal Description
Metthods (AIFB)
46
References (2/2)
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled
data. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint
Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP 09, pages 1003–1011.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning
for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, pages 455–465. Association for Computational Linguistics.
Chris Welty, James Fan, David Gondek, and Andrew Schlaikjer. 2010. Large scale relation detection. In Proceedings of the
NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 24–
33. Association for Computational Linguistics.
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Editor's Notes
Mention: Paper rather technical (algorithms, formalizations). Presentation is not that technical – tries to convey the main ideas of the approach.
Mention: Website with additional material (data samples, evaluation material) related to the publication: http://km.aifb.kit.edu/sites/bridge-patterns/INLG2014/
Mention: graph more complex than sentence -> e.g., that a person is alive (category living people) is implied by present tense instead of past tense Also: redundancies in the vocabulary
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
Explain identification and modifiers.
Only entities are matched, not relations. They‘ll be identified when comparing graph patterns.
What about overlapping matches? Why do i create individual hypothesis graph patterns?
Mention:
* Experts in English, German, SPARQL
* How were templates selected? -> randomly, different complexities, material online