Gathering Alternative Surface Forms
for DBpedia Entities
Volha Bryl
University of Mannheim, Germany  Springer Nature
Christian Bizer, Heiko Paulheim
University of Mannheim, Germany
NLP & DBpedia @ ISWC 2015, Bethlehem, USA, October 11, 2015
Why you need Surface Forms
• Surface form (SF) of an entity is a collection of strings it can be
referred as to: synonyms, alternatives names, etc.
• Used to support many NLP tasks: co-reference resolution, entity
linking, disambiguation
2Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Why you need Surface Forms
• Surface form (SF) of an entity is a collection of strings it can be
referred as to: synonyms, alternatives names, etc.
• Used to support many NLP tasks: co-reference resolution, entity
linking, disambiguation
“Billionaire Elon Musk has spelled out how he plans to
create temporary suns over Mars in order to heat the
Red Planet. Dismissing earlier comments that he
intended to nuke the planet’s surface, he says he wants
to create aerial explosions to heat it up. ”
--- to link the three entities, your machine should know that red planet is
an alternative name for Mars, and that Mars can be referred to just by its
“type” – planet
3Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Surface Forms from Wiki(DB)pedia
• Some of Wikipedia’s (hence, DBpedia’s) crowd-sourced content look
quite like surface forms
• Page titles
• Redirects
• Account for alternative names, word forms (e.g. plurals), closely related words,
abbreviations, alternative spellings, likely misspellings, subtopics
• Disambiguation pages
• There are 10+ Bethlehem’s in US, according to
https://en.wikipedia.org/wiki/Bethlehem_(disambiguation)
• Anchor texts of links between wiki pages
Named after the Roman god of war, it is often referred to as the “Red
Planet”...
Source: Named after the [[Mars (mythology)|Roman god of war]], it is
often referred to as the "Red Planet“
• …additionally, we use anchor texts of links from external pages to Wikipedia
4Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Surface Forms from Wiki(DB)pedia
• Not a new idea
• BabelNet, DBpedia Spotlight, … [see our paper for more links]
5Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Mars in BabelNet:
Surface Forms from Wiki(DB)pedia
• Not a new idea
• BabelNet, DBpedia Spotlight, … [see our paper for more links]
• Problem: Quality
• …it is not only that quality is a problem, it is also that it have never been
assessed or addressed
• Reason 1: good quality of Wikipedia content is taken for granted
• Reason 2: hopes are that NLP algorithms won’t be influenced by noise
6Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Mars in BabelNet:
Surface Forms from Wiki(DB)pedia
• Not a new idea
• BabelNet, DBpedia Spotlight, … [see our paper for more links]
• Problem: Quality – Why?
• By adding a redirect or an anchor text of internal Wikipedia link, a Wikipedia
editor might mean not only same as or also known as, but also related to,
contains, etc.
• Both variants serve the purpose of pointing to the correct wiki page
7Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Mars in BabelNet:
Solution: Focus on Quality
• Step 1: Extract
• We extract SFs from Wikipedia labels, redirects, disambiguations, and anchor
texts of internal wiki-links
• Step 2: Evaluate
• We create a gold standard to evaluate the SFs quality
• Step 3: Filter
• We implement three filters to improve SFs quality
• Bonus: More SFs
• We extract SFs from anchor texts of Wiki links found in the Common Crawl
2014 corpus
• All datasets are available at
http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/
8Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
SFs Dataset Statistics
• LRD = Labels, Redirects, Disambiguations
• Extracted from DBpedia dumps
• WAT = Wikipedia Anchor Texts
• Extracted by a new DBpedia extractor (based on PageLinksExtractor)
9Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Gold Standard
• Manual annotation, 1 annotator, 2 subsets
• Popular subset: manually selected 34 popular entities of different types
• Denmark, Berlin, Apple Inc., Animal Farm, Michael Jackson, Star Wars, Diego
Maradona, Mars, etc.
• ~82 SFs per entity, linked from other Wiki pages 813,736 times
• Random subset: randomly selected 81 entities each having at least 5 SFs
• Andy_Zaltzman, Bell AH-1 SuperCobra, Biarritz, Castellum, Firefox (film), Kipchak
languages, ParisTech, Psychokinesis, etc.
• ~13 SFs per entity , linked from other Wiki pages 14,760 times
Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/gold/
10Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Gold Standard
• Type of annotations
• correct (“the eternal city” for Rome),
• contained (“Google Japan” for Google), contains (“Turkey” for Istanbul),
• type of (“the city” for Rome)
• partial (“Diego” for Diego Maradona)
• related (“Google Blog” for Google)
• wrong (“during World War I” for United States)
11Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Evaluation: How many correct SFs?
• SFs extracted from labels, redirects, disambiguations
• correct, popular subset: 66.8%
• correct, random subset: 86.6%
• SFs extracted from Wikipedia anchor texts
• correct, popular subset: 38.5%
• correct, random subset: 70.7%
• Combined dataset
• correct, popular subset: 45.7%
• correct, random subset: 75%
12Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
(1) Filtering: String Patterns
• Data analysis  there are patterns wrong SFs follow
• URLs: contain .com or .net (“Berlin-china.net” for Berlin)
• of-phrases, with the exceptions for city of, state of, and the like (“Issues of
Toronto” for Toronto)
• in-phrases (“Historical sites in Berlin” for Berlin)
• and-phrases (“Tom Cruise and Katie Holmes” for Tom Cruise)
• list-of (“List of Toronto MPs and MPPs” for Toronto)
• Increase in precision
• popular subset: 1.33%
• popular subset, LRD only: 3.75%
• random subset: less than 1%
13Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
(2) Filtering: Wikidata
• Observation: some SFs are entities on their own in other languages
• E.g. “Neckarau” city area of Mannheim redirects to Mannheim in English
Wikipedia, but has its own page in German Wikipedia
• Implementation: use DBpedia- Wikidata dumps, released in May 2015
• Check whether a SF exactly matches or is close (Levenshtein distance) to any
of the labels of Wikidata entities that do not have English but have other
Wikipedia pages
• Increase in precision
• 0.5% compared to pattern-based filtering
• 1.5% for SF extracted only from LRD
14Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
(3) Filtering: Frequency Scores
• For SFs extracted from anchor texts, frequencies are available
 TF-IDF scores
• Determining the threshold: 1.0 .. 8.0 values with a step of 0.2 evaluated
•Two thresholds selected, highest values of F1: 1.8 and 2.6
•Threshold 0 (no filtering) used as baseline
• Increase in precision
•20% for popular subset, 10% for random subset
* Filtering done on the dataset to which pattern- and Wikidata-based filters are already applied
15Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
SFs from Common Crawl
• Common Crawl (CC) is the largest publicly available web corpus
• Extraction done on Winter 2014 CC Corpus, in the context of the Web
Data Commons project
• http://webdatacommons.org/ -- extracting and providing for public download
various types of structured data from CC
• Data required a lot of cleaning
• 3M SFs added to our LRD&WAT corpus
• No annotated gold standard: left for future work
• Available at
http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/lrd-cc/
16Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Conclusion and Future Work
• Main message
• quality of Wikipedia-base surface forms is often overlooked!
• Contributions
• Gold standard SFs, made available
• 3 filtering strategies: precision improved by > 20% for popular Wikipedia
entities, for > 10% for random entities
• Extracted SFs from Common Crawl corpus
• All data publicly available
• Future work directions
• Task-based evaluation of the resource, further work on the gold standard
17Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Gathering Alternative Surface Forms for DBpedia Entities

  • 1.
    Gathering Alternative SurfaceForms for DBpedia Entities Volha Bryl University of Mannheim, Germany  Springer Nature Christian Bizer, Heiko Paulheim University of Mannheim, Germany NLP & DBpedia @ ISWC 2015, Bethlehem, USA, October 11, 2015
  • 2.
    Why you needSurface Forms • Surface form (SF) of an entity is a collection of strings it can be referred as to: synonyms, alternatives names, etc. • Used to support many NLP tasks: co-reference resolution, entity linking, disambiguation 2Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 3.
    Why you needSurface Forms • Surface form (SF) of an entity is a collection of strings it can be referred as to: synonyms, alternatives names, etc. • Used to support many NLP tasks: co-reference resolution, entity linking, disambiguation “Billionaire Elon Musk has spelled out how he plans to create temporary suns over Mars in order to heat the Red Planet. Dismissing earlier comments that he intended to nuke the planet’s surface, he says he wants to create aerial explosions to heat it up. ” --- to link the three entities, your machine should know that red planet is an alternative name for Mars, and that Mars can be referred to just by its “type” – planet 3Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 4.
    Surface Forms fromWiki(DB)pedia • Some of Wikipedia’s (hence, DBpedia’s) crowd-sourced content look quite like surface forms • Page titles • Redirects • Account for alternative names, word forms (e.g. plurals), closely related words, abbreviations, alternative spellings, likely misspellings, subtopics • Disambiguation pages • There are 10+ Bethlehem’s in US, according to https://en.wikipedia.org/wiki/Bethlehem_(disambiguation) • Anchor texts of links between wiki pages Named after the Roman god of war, it is often referred to as the “Red Planet”... Source: Named after the [[Mars (mythology)|Roman god of war]], it is often referred to as the "Red Planet“ • …additionally, we use anchor texts of links from external pages to Wikipedia 4Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 5.
    Surface Forms fromWiki(DB)pedia • Not a new idea • BabelNet, DBpedia Spotlight, … [see our paper for more links] 5Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim Mars in BabelNet:
  • 6.
    Surface Forms fromWiki(DB)pedia • Not a new idea • BabelNet, DBpedia Spotlight, … [see our paper for more links] • Problem: Quality • …it is not only that quality is a problem, it is also that it have never been assessed or addressed • Reason 1: good quality of Wikipedia content is taken for granted • Reason 2: hopes are that NLP algorithms won’t be influenced by noise 6Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim Mars in BabelNet:
  • 7.
    Surface Forms fromWiki(DB)pedia • Not a new idea • BabelNet, DBpedia Spotlight, … [see our paper for more links] • Problem: Quality – Why? • By adding a redirect or an anchor text of internal Wikipedia link, a Wikipedia editor might mean not only same as or also known as, but also related to, contains, etc. • Both variants serve the purpose of pointing to the correct wiki page 7Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim Mars in BabelNet:
  • 8.
    Solution: Focus onQuality • Step 1: Extract • We extract SFs from Wikipedia labels, redirects, disambiguations, and anchor texts of internal wiki-links • Step 2: Evaluate • We create a gold standard to evaluate the SFs quality • Step 3: Filter • We implement three filters to improve SFs quality • Bonus: More SFs • We extract SFs from anchor texts of Wiki links found in the Common Crawl 2014 corpus • All datasets are available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/ 8Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 9.
    SFs Dataset Statistics •LRD = Labels, Redirects, Disambiguations • Extracted from DBpedia dumps • WAT = Wikipedia Anchor Texts • Extracted by a new DBpedia extractor (based on PageLinksExtractor) 9Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 10.
    Gold Standard • Manualannotation, 1 annotator, 2 subsets • Popular subset: manually selected 34 popular entities of different types • Denmark, Berlin, Apple Inc., Animal Farm, Michael Jackson, Star Wars, Diego Maradona, Mars, etc. • ~82 SFs per entity, linked from other Wiki pages 813,736 times • Random subset: randomly selected 81 entities each having at least 5 SFs • Andy_Zaltzman, Bell AH-1 SuperCobra, Biarritz, Castellum, Firefox (film), Kipchak languages, ParisTech, Psychokinesis, etc. • ~13 SFs per entity , linked from other Wiki pages 14,760 times Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/gold/ 10Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 11.
    Gold Standard • Typeof annotations • correct (“the eternal city” for Rome), • contained (“Google Japan” for Google), contains (“Turkey” for Istanbul), • type of (“the city” for Rome) • partial (“Diego” for Diego Maradona) • related (“Google Blog” for Google) • wrong (“during World War I” for United States) 11Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 12.
    Evaluation: How manycorrect SFs? • SFs extracted from labels, redirects, disambiguations • correct, popular subset: 66.8% • correct, random subset: 86.6% • SFs extracted from Wikipedia anchor texts • correct, popular subset: 38.5% • correct, random subset: 70.7% • Combined dataset • correct, popular subset: 45.7% • correct, random subset: 75% 12Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 13.
    (1) Filtering: StringPatterns • Data analysis  there are patterns wrong SFs follow • URLs: contain .com or .net (“Berlin-china.net” for Berlin) • of-phrases, with the exceptions for city of, state of, and the like (“Issues of Toronto” for Toronto) • in-phrases (“Historical sites in Berlin” for Berlin) • and-phrases (“Tom Cruise and Katie Holmes” for Tom Cruise) • list-of (“List of Toronto MPs and MPPs” for Toronto) • Increase in precision • popular subset: 1.33% • popular subset, LRD only: 3.75% • random subset: less than 1% 13Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 14.
    (2) Filtering: Wikidata •Observation: some SFs are entities on their own in other languages • E.g. “Neckarau” city area of Mannheim redirects to Mannheim in English Wikipedia, but has its own page in German Wikipedia • Implementation: use DBpedia- Wikidata dumps, released in May 2015 • Check whether a SF exactly matches or is close (Levenshtein distance) to any of the labels of Wikidata entities that do not have English but have other Wikipedia pages • Increase in precision • 0.5% compared to pattern-based filtering • 1.5% for SF extracted only from LRD 14Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 15.
    (3) Filtering: FrequencyScores • For SFs extracted from anchor texts, frequencies are available  TF-IDF scores • Determining the threshold: 1.0 .. 8.0 values with a step of 0.2 evaluated •Two thresholds selected, highest values of F1: 1.8 and 2.6 •Threshold 0 (no filtering) used as baseline • Increase in precision •20% for popular subset, 10% for random subset * Filtering done on the dataset to which pattern- and Wikidata-based filters are already applied 15Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 16.
    SFs from CommonCrawl • Common Crawl (CC) is the largest publicly available web corpus • Extraction done on Winter 2014 CC Corpus, in the context of the Web Data Commons project • http://webdatacommons.org/ -- extracting and providing for public download various types of structured data from CC • Data required a lot of cleaning • 3M SFs added to our LRD&WAT corpus • No annotated gold standard: left for future work • Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/lrd-cc/ 16Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  • 17.
    Conclusion and FutureWork • Main message • quality of Wikipedia-base surface forms is often overlooked! • Contributions • Gold standard SFs, made available • 3 filtering strategies: precision improved by > 20% for popular Wikipedia entities, for > 10% for random entities • Extracted SFs from Common Crawl corpus • All data publicly available • Future work directions • Task-based evaluation of the resource, further work on the gold standard 17Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim