Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Entities, Time and Events in BiographyNet and NewsReader


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Entities, Time and Events in BiographyNet and NewsReader

  1. 1. Entities, Time and Events in BiographyNet & NewsReader Antske Fokkens VU University Monday, November 11, 13
  2. 2. Acknowledgement (people) The work presented in this presentation was carried out by/with: Agata Cybulska, Marieke van Erp and Piek Vossen Niels Ockeloen, Serge ter Braake, Willem Robert van Hage, Jesper Hoeksema, Sara Tonelli, Rachele Sprugnoli, Luciano Serafini, Aitor Soroa, German Rigau and others Monday, November 11, 13
  3. 3. Overview mini introduction to BiographyNet mini introduction to NewsReader representing entities and events Monday, November 11, 13
  4. 4. BiographyNet An interdisciplinary project involving history, computer science and computational linguistics Goal: inspire new historic research by identifying relations between people and events in Biographical dictionaries Monday, November 11, 13
  5. 5. NLP in BiographyNet The Biography Portal of the Netherlands 125,000 biographies from 23 sources describing 76,000 people Text and metadata Role of NLP: Identify information in text Study differences in style and focus Monday, November 11, 13
  6. 6. BiographyNet use cases Analysis on groups of individuals (e.g. who were governor generals of the Dutch Indies) More complex questions, e.g. the relation between influential people in the Dutch colonies and current Dutch elite Perspectives: how are people and events judged in different sources? Monday, November 11, 13
  7. 7. BiographyNet data Biographical text in Dutch Heterogenous corpus: 23 sources, texts from 17th century - now Metadata about basic facts: high quality (few errors) completeness varies Monday, November 11, 13
  8. 8. BiographyNet Text mining First step: fill out gaps in metadata Basic supervised machine learning system Next steps: Create timelines for individuals Identify relations between people Identify events and relations between them Monday, November 11, 13
  9. 9. BiographyNet Methodology The output of NLP tools is used by other researchers They should have insight into the performance of the tools and the approaches that are used Provenance information plays a vital role Monday, November 11, 13
  10. 10. NewsReader Automatically process massive streams of daily news from thousands of sources in 4 different languages Project Partners: VU University Amsterdam, LexisNexis, Synerscope (the Netherlands) Basque University (Spain) ScraperWiki (UK) Federation Bruno Kessler (Italy) Monday, November 11, 13
  11. 11. NewsReader what happened, where, when and who was involved? Which temporal and causal relations hold between events, what does that tell us about the people involved? Place the cumulated result in a knowledge store that can handle dynamic growth of information: a history recorder Monday, November 11, 13
  12. 12. NewsReader Big Data Focus: The financial crisis E.g. What is the impact of the financial crisis on the car industry? Big Data: LexisNexis estimates: 1-2 million news articles per day that their archive has 10 million English news articles about the car industry from the last 10 years Monday, November 11, 13
  13. 13. NewsReader Narratives What are the stories that are being told by all this data? Challenges: Duplicates, overlap and repetitions: how to distinguish old from new? Single results tell only parts of the story Results can be inconsistent News is opinionated and colored Monday, November 11, 13
  14. 14. NewsReader overall approach Resolve all mentions of events, their participants, locations and time in texts and other resources Determine coreference and other relations between them Combine all information from coreferring event mentions around a hypothetical event instance (independent from text) Combine instances into storylines Monday, November 11, 13
  15. 15. NLP pipeline TOKENIZER + SENTENCE SPLITTER Time expressions WSD_client WSD_server NER POS-TAGGER NED_client NED_server PARSER KS Frontend Mgmt. Scripts API implementation over layers; replicated for scalability and fault tolerance LEXISNEXIS documents Storage of original input data HBase + Hadoop Triple Store distributed & replicated for scalability and fault-tolerance (possibly) distributed Resource Mention KNOWLEDGE STORE Visualisation (Synerscope) Story Understanding Entity Statement + Context Partial replication Event relations RDF Triples + Named Graphs Coreference resolution start / stop, backup / restore, configuration, statistics, gathering SRL Event detection Inference Event coreference Opinion Detection Factuality Runs in virtual machine EHU Runs in virtual machine Input data storage Processes that can be carried out in any order at this stage VUA Monday, November 11, 13 FBK
  16. 16. Both Projects Accumulate information about the same entities and events from various sources Must deal with different perspectives, contradicting and partial information Monday, November 11, 13
  17. 17. Grounded Annotation Framework (GAF) Sources report on events and entities: event mentions and entity mentions URIs represent instances of these entities and events in reality GAF links instances to mentions Information from mentions in other sources is merged with known information around the instance Monday, November 11, 13
  18. 18. a GAF example changes in the world 2004 2005 SEM-EVENT TEMBLOR SEM-EVENT USS Jimmy Carter energy weapon 2006 SEM-EVENT TSUNAMI 2007 SEM-EVENT TEMBLOR 2009 2008 SEM-EVENT TSUNAMI SEM-EVENT TEMBLOR SEM-EVENT TSUNAMI future tsunami Tsunami alert system ANNOTATION ANNOTATION NAF TAF publication of sources 2004 2005 ANNOTATION 2006 sensor data direct event report Monday, November 11, 13 delayed event report future event report ANNOTATION ANNOTATION 2007 ANNOTATION ANNOTATION 2008 "The catastrophe four years ago devastated Indian Ocean community and killed more than 230,000 people, over 170,000 of them in Aceh at northern tip of Sumatra Island of Indonesia." 2009 2013 ..., the vessel is the party responsible for the 2004 Indian Ocean tsunami that killed 230,000 people. Apparently, the submarine was able to trigger seismic activity via some kind of directed energy weapon.
  19. 19. Linguistic information in GAF The NLP Annotation Format (NAF) Knowledge Annotation Format (KAF) stand-off layered annotation (LAF compatible) separating mentions from instances NLP Interchange Format (NIF) RDF and URIs, inline annotation Compatible with PROV-DM Monday, November 11, 13
  20. 20. Events in GAF extended Simple Event Model (SEM): RDF representations of event instances with participant, location and time can represent contradictory information Monday, November 11, 13
  21. 21. GAF from NAF + SEM Can accumulate information from different sources Can represent repeated information as a single relation (with links to all sources that provided this information) Can represent contradicting information Is compatible with the PROV-DM Monday, November 11, 13
  22. 22. Acknowledgements Supported by the European Union’s 7th Framework program via the NewsReader Project (ICT-316404) Supported by the BiographyNet project (nr. 660.011.308) funded by the Netherlands eScience center (http:// Monday, November 11, 13
  23. 23. References GAF: Fokkens, Antske, Marieke van Erp, Piek Vossen, Sara Tonelli, Willem Robert van Hage, Luciano Serafini, Rachele Sprugnoli and Jesper Hoeksema. 2013. GAF: A Grounded Annotation Framework for Events. Proceedings of the first Workshop on EVENTS: Definition, Detection, Coreference and Representation. Atlanta USA. Marieke Van Erp, Antske Fokkens, Piek Vossen, Sara Tonelli, Willem Robert Van Hage, Luciano Serafini, Rachele Sprugnoli and Jesper Hoeksema. 2013. Denoting Data in the Grounded Annotation Framework. ISWC 2013 Posters and Demos. Sydney Australia, 21-25 October 2013 Monday, November 11, 13
  24. 24. References SEM: Van Hage, Willem Robert, Véronique Malaisé, Roxane Segers, Laura Hollink, and Guus Schreiber. "Design and use of the Simple Event Model (SEM)." Web Semantics: Science, Services and Agents on the World Wide Web 9, no. 2 (2011): 128-136. Cross-document coreference: Cybulska, Agata, and Piek Vossen. “Semantic Relations between Events and their Time, Locations and Participants for Event Coreference Resolution.” In: Proceedings of RANLP 2013. Monday, November 11, 13
  25. 25. References Named Entity Recognition: Marieke van Erp, Giuseppe Rizzo and Raphaël Troncy (2013) Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning. #MSM2013 Concept Extraction Challenge. Rio de Janeiro, Brazil, May 2013. Provenance: Niels Ockeloen, Antske Fokkens, Serge Ter Braake, Piek Vossen, Victor de Boer, Guus Schreiber and Susan Legêne. 2013. BiographyNet: Managing Provenance at multiple levels and from different perspectives. In: Proceedings of the Workshop on Linked Science 2013 (LISC2013). Monday, November 11, 13