Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JRC-Names - EC - Diplohack Datamarket

353 views

Published on

JRC-Names: A freely available, highly multilingual named entity resource. This presentation was part of the Diplohack Brussels Data Market on 29-30 April 2016.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

JRC-Names - EC - Diplohack Datamarket

  1. 1. 1RANLP’2011, Hissar, Bulgaria, 12.09.2011 JRC-Names: A freely available, highly multilingual named entity resource Hissar, Bulgaria, 12 September 2011 Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot Technical details and publications: http://langtech.jrc.ec.europa.eu/ Applications: http://emm.newbrief.eu/overview.html
  2. 2. 2RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  3. 3. 3RANLP’2011, Hissar, Bulgaria, 12.09.2011 What is JRC-Names? • JRC-Names consists of: • Lists of names and their many spelling variants, • ~205,000 person and organisation names plus • ~204,000 name spelling variants • In 27 scripts and many more languages • Software to recognise these names in multilingual text, with offset and unique name identifier • Download from http://langtech.jrc.ec.europa.eu/
  4. 4. 4RANLP’2011, Hissar, Bulgaria, 12.09.2011 Possible uses of JRC-Names • Standardise name spellings in databases, text collections and the internet for improved retrieval (Stern & Sagot 2010) • Improve Machine Translation – names must be treated differently from other words (Babych & Hartley 2003; Steinberger & Pouliquen 2009) • Use as input to learn automatic transliteration rules (e.g. Pouliquen 2009) • Use output of JRC-Names as seeds to learn NER rules (e.g. Buchholz & van den Bosch 2000) • Social networks are less biased by national viewpoints if based on information extracted from multilingual texts • NER results are useful for other text mining tasks (opinion mining; co-reference resolution; summarisation; topic detection and tracking; cross-lingual linking of related documents across languages; …)
  5. 5. 5RANLP’2011, Hissar, Bulgaria, 12.09.2011 Related work – other multilingual (ml) NE resources • Wentlant et al. (2008) – built a ml NE repository based on Wikipedia links and case information; 2.5 Mio English names, 250K German, 3K Swahili, … • Toral et al. (2008) – built Named Entity WordNet by searching NEs in WordNet and Wikipedia: 310K entities, including 278K persons • Stern & Sagot (2010) – exploit French Wikipedia and GeoNames to produce French resource: 263K person names + 883K variants. • Maurel (2009) –produced Prolexbase mostly manually: 75K entities of all types •  Most resources are based on Wikipedia • Strong at providing cross-lingual and cross-script variants • Offers only few other spelling variants • No morphological inflections • JRC-Names contains mostly spelling variants from real-life text, enriched with Wikipedia – up to 413 variants for the same NE.
  6. 6. 6RANLP’2011, Hissar, Bulgaria, 12.09.2011 Name variants found and used in 6 hours (!) of EMM news analysis 26.08.2011, PM
  7. 7. 7RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  8. 8. 8RANLP’2011, Hissar, Bulgaria, 12.09.2011 NER on news gathered by the Europe Media Monitor (EMM) • ~ 3600 Sources (world-wide, with focus on Europe) • ~ 3225 news sources (web portals) • ~ 360 specialist medical sites • ~ 20 commercial newswires • Specialist pay-for sources (LexisMed) • 24/7, updated every 10 minutes • ~ 100,000 articles / day in ~ 50 languages • Named Entity Recognition (NER) performed on 20 languages. • Articles are fed into the various publicly accessible EMM applications:
  9. 9. 9RANLP’2011, Hissar, Bulgaria, 12.09.2011 Multilingual NER in EMM – A brief overview • Lookup of most frequent known names and their variants in all languages • Database currently contains about 1,18 million names + 225.000 variants (status July 2011) • Including morphological (and other) variants by pre-generating inflection forms (Slovene example): Tony(a|o|u|om|em|m|ju|jem|ja)?s+Blair(a|o|u|om|em|m|ju|jem|ja) • Guessing new names using empirically-derived lexical patterns in 20 languages. • President, Minister, Head of State, Sir, American • “death of”, “[0-9]+-year-old”, … • Known first names + uppercase words • Identification of a current average of 1,000 unknown names per day. • Only names found repeatedly will become known names (error reduction).
  10. 10. 10RANLP’2011, Hissar, Bulgaria, 12.09.2011 Multilingual name recognition using lexical patterns asesinato del exprimer ministro Rafic al-Hariri, que la oposición atribuyóes l'assassinat de l'ex-dirigeant Rafic Hariri et le départ du chef de la diplomfr na de moord op oud-premier Rafiq al-Hariri gingen gisteren bijna eennl libanesischen Regierungschef Rafik Hariri vor einem Monat wichtige Bde danjega libanonskega premiera Rafika Haririja. Libanonska opozicija sisl möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommiplet death of former Prime Minister Rafik Hariri, blamed by many oppositionen ‫اغتيال‬‫السابق‬ ‫الوزراء‬ ‫رئيس‬‫الحريري‬ ‫رفيق‬‫سابقا‬ ‫حدث‬ ‫وما‬ ‫يهودية‬ ‫بأياد‬ar Бывший премьер-министр Ливана Рафик Харири, которыйru
  11. 11. 11RANLP’2011, Hissar, Bulgaria, 12.09.2011 Merging name variants for the same entity • For all newly found name forms, detect whether they are a variant of an existing NE: • Transliteration; • Normalisation, using ~30 hand-written rules and removing vowels; • Calculate similarity (threshold: 94%). • Below threshold  new entity 20% + 80% Condition:
  12. 12. 12RANLP’2011, Hissar, Bulgaria, 12.09.2011 Enriching the EMM data with Wikipedia name variants • For frequent or highly visible names, manually launch a Wikipedia mining process. • Check for each variant of a name whether there is a Wikipedia entry. • New name variants, in all scripts, will be recognised in new EMM articles. Хамид Карзай Hamid Karzai Hamid Karzaï Hamid Karsai ‫كرزاي‬ ‫حامد‬ हामिद करजई 哈米德·卡尔扎伊 http://en.wikipedia.org/wiki/Hamid_Karzai
  13. 13. 13RANLP’2011, Hissar, Bulgaria, 12.09.2011 Manual moderation of EMM name database • Process is fully automatic, but it can be useful to make changes manually. • Manual process only for frequent or important names (e.g. Nobel Prize winners): • Name changes: (e.g. Cardinal Josef Ratzinger  Pope Benedict XVI) • Correct NER mistakes (e.g. Genius Report, Opfer von Diskriminierung); • Add new stop name parts (e.g. Monday, Report); • Merge name variants with similarity below the threshold; • Change the display name of an entity; • Correct the entity type (PER, ORG, T, U, …); • Launch Wikipedia mining process; • … • Caveat: Name database contains errors!
  14. 14. 14RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  15. 15. 15RANLP’2011, Hissar, Bulgaria, 12.09.2011 Statistics on JRC-Names (1) • JRC-Names include names from the EMM database if any of the following hold: • Found in 5 or more news clusters; • Manually verified; • Retrieved from Wikipedia; • Number of entries (status July 2011): • 205,000 distinct names; • 204,000 additional variants; • ~3.2% names of organisations / events • Number of variants: • 413 variants for Muammar Gaddafi (entity 262) • 256 variants for Mikhail Saakashvili (entity 472) • 246 variants for Mahmoud Ahmadinejad (entity 101358) • Grows by ~230 new entities and ~430 new variants per week. Variant forms No. of entities 1 63.76% 2 22.52% 3 5.31% 10 or more 3760 entities 50 or more 242 entities 100 or more 37 entities
  16. 16. 16RANLP’2011, Hissar, Bulgaria, 12.09.2011 Statistics on JRC-Names (2) • Number of scripts: 27 Number of languages: ??? • News mentions names from around the world. • Frequency does not reflect origin • European Union (10101) is most frequent entity in German, and second in English. • It does not matter where a name like Silvio Berlusconi comes from.
  17. 17. 17RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  18. 18. 18RANLP’2011, Hissar, Bulgaria, 12.09.2011 Details about the JRC-Names software • Java-implemented demonstrator • Finite state automaton • Reads the NE resource file entities.gzip (frequently updated) • Searches for known names (and their variants) in UTF8-encoded text files. Returns: • Numerical name identifier • Main name for that entity • Name string found in the text • Position (Offset and string length) • For any given name string, returns all variants. • Software and NE resource file can be downloaded from • http://langtech.jrc.ec.europe.eu/ , Section on ‘Resources’ • Free usage, according to accompanying end-user licence.
  19. 19. 19RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  20. 20. 20RANLP’2011, Hissar, Bulgaria, 12.09.2011 Treatment of morphological inflections • The recognition of morphological inflections used in EMM processing chain are not currently part of JRC-Names. • We are working on a solution to include morphological processing in a future release of JRC-Names. • Further variants will also be included more consistently: • Hyphenation (e.g. Yves Saint-Laurent vs. Yves Saint Laurent) • Names with and without name ‘infixes’ (e.g. Khan al Khalil vs. Khan Khalil) • Abbreviations (e.g. Saint vs. St.) • … • Current solution: Add the approximately 45,000 full-forms of inflected names, as found in EMM processing results since January 2011, to the resource file entities.gzip • This helps to recognise the most frequent inflection forms of the frequent names.
  21. 21. 21RANLP’2011, Hissar, Bulgaria, 12.09.2011 Agenda • What is JRC-Names; What can it be used for • Related work: other named entity (NE) resources • How JRC-Names was produced • Recognition of named entities in news reports in 20 languages • Introduction to EMM • Automatic mapping of name variants to the same entity • Enrichment with Wikipedia variants • Partial manual moderation • Statistics on JRC-Names • Programming details / Usage of the tool • Solutions to capture morphological variants • Further multilingual linguistic resources
  22. 22. 22RANLP’2011, Hissar, Bulgaria, 12.09.2011 Further JRC/EC-provided multilingual linguistic resources • JRC-Acquis (2006): 1 billion word parallel corpus in 22 languages • DGT-TM (2007): Translation Memory in 22 languages; up to 2 million segments •  DGT-TM-2011 (forthcoming): 23 languages; 4 million segments? Yearly updates • JEX (JRC Eurovoc Indexer) (forthcoming): software to automatically label texts according to the thousands of categories of the Eurovoc thesaurus; 23 languages. • Further smaller resources: • Multilingual summary evaluation data (2010): 4 clusters for each of 7 languages • Sentiment-annotated collection of quotations (2010): English (German forthcoming) • Multilingual Named Entity-annotated parallel corpus (forthcoming) • Available at http://langtech.jrc.ec.europa.eu/, section on ‘Resources’
  23. 23. 23RANLP’2011, Hissar, Bulgaria, 12.09.2011 JRC-Names: A freely available, highly multilingual named entity resource Hissar, Bulgaria, 12 September 2011 Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot Technical details and publications: http://langtech.jrc.ec.europa.eu/ Applications: http://emm.newbrief.eu/overview.html

×