News Search Using Discourse Analytics


Published on

Enhanching access to information within digital heritage archives, e.g. New York Times Corpus, by identifying discourse phenomena and searchng and filtering events according to multiple facets.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

News Search Using Discourse Analytics

  1. 1. News Search Using Discourse Analytics Claudiu Mihăilă The National Centre for Text Mining The University of Manchester
  2. 2. Data Growing Exponential growth of data Information overload
  3. 3. Data Pouring Exponential growth of data Information overload Data deluge
  4. 4. Data Processing Exponential growth of data Information overload Data deluge Can we process a deluge of data in a useful manner?
  5. 5. Searching Give a query as input Obtain a set of relevant articles Keyword v. Semantics – Synonyms – Hyponyms – Spelling variants – Inflections – Relations between query terms
  6. 6. Searching Keywords Crimes in the town of Sandwich
  7. 7. Searching Keywords Crimes in the town of Sandwich – Crime Sandwich by Click Bang Productions on SoundCloud – Sandwich Crime - Topix – Crime on rye: Four accused of stealing $10 sandwich from car – Crime Scene Sandwich Bags – Crime rate in Sandwich, Illinois (IL): murders, rapes, robberies – Ham Sandwich Nation: Due Process When Everything is a Crime
  8. 8. Searching Semantics Crimes in the town of Sandwich
  9. 9. Searching Semantics Crimes in the town of Sandwich – Kent Police issue warning after fake £20 notes reported in Sandwich – Trio jailed for total of 30 years after crime spree in Sandwich – Murder at Sandwich - Kent
  10. 10. Semantic search engine Features Specification of semantic types of search terms: town:Sandwich Normalisation of semantic entities: Sandwich, Kent = Sandwich, UK Relations between search terms to describe events: location:Sandwich Restrictions on discourse context of retrieved events
  11. 11. Structured events The event
  12. 12. Discourse interpretation The story Karl Munro may have killed Sunita in Weatherfield in 2013. According to Karl Munro, Craig Tinker set Sunita on fire in Weatherfield in 2013. Karl Munro said he will kill Sunita. Karl Munro didn’t fail to kill Sunita in Weatherfield in 2013. Stella Price condemned all of Karl’s wrongdoings.
  13. 13. ACE corpus 2005 version Discourse -related Attributes 599 news-domain documents Polarity – News articles Tense – Transcripts of broadcast news Specificity – Transcripts of broadcast conversation Modality – Conversational telephone speech – Weblogs – Discussion fora Source type Subjectivity
  14. 14. Discourse context of events Scheme
  15. 15. New York Times corpus Digital archive 20 years-worth of news articles – 1.8M Includes annotations of – Metadata – Named entities – Normalisation Facilitates diachronic studies – Language evolution – Social change – Development of events
  16. 16. ISHER Semantically enabled searching Web-based User-friendly interface Intuitive query-building mechanism Refining/filtering according to facets
  17. 17. ISHER Automatic Event Recognition - EventMine Miwa, Thompson, Ananiadou. (2012). Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics, 28(13), 1759-1765
  18. 18. ISHER Web-based interface – “Coronation Street”
  19. 19. ISHER Semantic clustering Lingo – 3rd party NaCTeM clustering
  20. 20. ISHER Semantic clustering Cluster summarisation
  21. 21. ISHER Metadata in the NYT corpus
  22. 22. ISHER Entities
  23. 23. ISHER Events
  24. 24. ISHER Events Prime Minister Tony Blair’s election last month
  25. 25. Final remarks Other domains Same technique can be adapted to other domains Previously developed –EUPMC – medical journal articles –ASCOT – clinical trials
  26. 26. Final remarks Summary Future work Enhanced access to information within digital heritage archives (NYT) Apply to new domains and institutional repositories Identified discourse phenomena to search for and filter events Customise towards social unrest Created ISHER, semantic search engine to access the NYT corpus Other languages in danger of digital extinction – Meta-Net Diachronic studies
  27. 27. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.