0
News Search Using Discourse Analytics

Claudiu Mihăilă
The National Centre for Text Mining
The University of Manchester
Data
Growing
Exponential growth of data
Information overload
Data
Pouring
Exponential growth of data
Information overload
Data deluge
Data
Processing
Exponential growth of data
Information overload
Data deluge
Can we process a deluge of data
in a useful ma...
Searching

Give a query as input
Obtain a set of relevant articles
Keyword v. Semantics
– Synonyms
– Hyponyms

– Spelling ...
Searching
Keywords
Crimes in the town of Sandwich
Searching
Keywords
Crimes in the town of Sandwich
– Crime Sandwich by Click Bang
Productions on SoundCloud
– Sandwich Crim...
Searching
Semantics
Crimes in the town of Sandwich
Searching
Semantics
Crimes in the town of Sandwich
– Kent Police issue warning after fake
£20 notes reported in Sandwich
–...
Semantic search engine
Features
Specification of semantic types of
search terms: town:Sandwich
Normalisation of semantic e...
Structured events
The event
Discourse interpretation
The story
Karl Munro may have killed Sunita in Weatherfield in 2013.
According to Karl Munro, Cra...
ACE corpus
2005 version

Discourse -related Attributes

599 news-domain documents

Polarity

– News articles

Tense

– Tra...
Discourse context of events
Scheme
New York Times corpus
Digital archive
20 years-worth of news articles – 1.8M
Includes annotations of
– Metadata
– Named en...
ISHER
Semantically enabled searching
Web-based
User-friendly interface
Intuitive query-building mechanism
Refining/filteri...
ISHER
Automatic Event Recognition - EventMine

Miwa, Thompson, Ananiadou. (2012). Boosting automatic event extraction from...
ISHER
Web-based interface – “Coronation Street”
ISHER
Semantic clustering
Lingo – 3rd party
NaCTeM clustering
ISHER
Semantic clustering

Cluster summarisation
ISHER
Metadata in the NYT corpus
ISHER
Entities
ISHER
Events
ISHER
Events

Prime Minister Tony Blair’s election last month
Final remarks
Other domains
Same technique can be adapted to other domains
Previously developed
–EUPMC – medical journal a...
Final remarks
Summary

Future work

Enhanced access to information within
digital heritage archives (NYT)

Apply to new do...
Thank you!
Upcoming SlideShare
Loading in...5
×

News Search Using Discourse Analytics

245

Published on

Enhanching access to information within digital heritage archives, e.g. New York Times Corpus, by identifying discourse phenomena and searchng and filtering events according to multiple facets.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
245
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "News Search Using Discourse Analytics"

  1. 1. News Search Using Discourse Analytics Claudiu Mihăilă The National Centre for Text Mining The University of Manchester
  2. 2. Data Growing Exponential growth of data Information overload
  3. 3. Data Pouring Exponential growth of data Information overload Data deluge
  4. 4. Data Processing Exponential growth of data Information overload Data deluge Can we process a deluge of data in a useful manner?
  5. 5. Searching Give a query as input Obtain a set of relevant articles Keyword v. Semantics – Synonyms – Hyponyms – Spelling variants – Inflections – Relations between query terms
  6. 6. Searching Keywords Crimes in the town of Sandwich
  7. 7. Searching Keywords Crimes in the town of Sandwich – Crime Sandwich by Click Bang Productions on SoundCloud – Sandwich Crime - Topix – Crime on rye: Four accused of stealing $10 sandwich from car – Crime Scene Sandwich Bags – Crime rate in Sandwich, Illinois (IL): murders, rapes, robberies – Ham Sandwich Nation: Due Process When Everything is a Crime
  8. 8. Searching Semantics Crimes in the town of Sandwich
  9. 9. Searching Semantics Crimes in the town of Sandwich – Kent Police issue warning after fake £20 notes reported in Sandwich – Trio jailed for total of 30 years after crime spree in Sandwich – Murder at Sandwich - Kent
  10. 10. Semantic search engine Features Specification of semantic types of search terms: town:Sandwich Normalisation of semantic entities: Sandwich, Kent = Sandwich, UK Relations between search terms to describe events: location:Sandwich Restrictions on discourse context of retrieved events
  11. 11. Structured events The event
  12. 12. Discourse interpretation The story Karl Munro may have killed Sunita in Weatherfield in 2013. According to Karl Munro, Craig Tinker set Sunita on fire in Weatherfield in 2013. Karl Munro said he will kill Sunita. Karl Munro didn’t fail to kill Sunita in Weatherfield in 2013. Stella Price condemned all of Karl’s wrongdoings.
  13. 13. ACE corpus 2005 version Discourse -related Attributes 599 news-domain documents Polarity – News articles Tense – Transcripts of broadcast news Specificity – Transcripts of broadcast conversation Modality – Conversational telephone speech – Weblogs – Discussion fora Source type Subjectivity
  14. 14. Discourse context of events Scheme
  15. 15. New York Times corpus Digital archive 20 years-worth of news articles – 1.8M Includes annotations of – Metadata – Named entities – Normalisation Facilitates diachronic studies – Language evolution – Social change – Development of events
  16. 16. ISHER Semantically enabled searching Web-based User-friendly interface Intuitive query-building mechanism Refining/filtering according to facets
  17. 17. ISHER Automatic Event Recognition - EventMine Miwa, Thompson, Ananiadou. (2012). Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics, 28(13), 1759-1765
  18. 18. ISHER Web-based interface – “Coronation Street”
  19. 19. ISHER Semantic clustering Lingo – 3rd party NaCTeM clustering
  20. 20. ISHER Semantic clustering Cluster summarisation
  21. 21. ISHER Metadata in the NYT corpus
  22. 22. ISHER Entities
  23. 23. ISHER Events
  24. 24. ISHER Events Prime Minister Tony Blair’s election last month
  25. 25. Final remarks Other domains Same technique can be adapted to other domains Previously developed –EUPMC – medical journal articles –ASCOT – clinical trials
  26. 26. Final remarks Summary Future work Enhanced access to information within digital heritage archives (NYT) Apply to new domains and institutional repositories Identified discourse phenomena to search for and filter events Customise towards social unrest Created ISHER, semantic search engine to access the NYT corpus Other languages in danger of digital extinction – Meta-Net Diachronic studies
  27. 27. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×