Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mark A Greenwood, Jonathon Hare,
David R Newman, Wim Peters
SemanticMedia@TheBritishLibrary
Monday 23rd September 2013
The Project Vision
• Semantic News is 6 month project:
• June to November 2013
• Two 50% FTEs (1 Southampton, 1 Sheffield)...
Where is the Data? (1)
• Question Time in
2010
• 34 episodes, 163
questions
• BBC Subtitles
• XML encoded
• Broadcast as t...
Where is the Data? (2)
• BBC Programmes Data
• XML encoded
• Information about the
programme,
(panellists, topics,
broadca...
Pre-parsing Subtitles Data
• Raw XML subtitles
• Remove duplicate words
• Parse into CSV
• time offset
• sentence
• Break ...
Pre-parsing Twitter Data
• Twitter ‘Garden Hose’ for 2010 Dataset
• Used Apache Hadoop and filtered on:
• @bbcqt, @bbcques...
Information Extraction with GATE
● General Architecture for Text Engineering (GATE)
● Developed by University of Sheffield...
Linguistic pre-processing
● Techniques
● Tokenization
● Sentence Splitting
● Language Identification
● POS tagging
● Morph...
Named Entity Recognition
● Approaches
● Gazetteer lookup
● JAPE grammars
● Co-reference
● Types
● Location: countries, reg...
Enrichment: LODIE
● Under constant development in various projects
● Associates the most probable LOD URI with
named entit...
Enrichment: LODIE
“Ken Clarke: The Labour plotters hide behind the
knife and stab with the cloak! Brilliant!!”
“Hain just ...
Representing Extracted Information
Conceptualising a Question
http://www.youtube.com/watch?v=O3l9Mi-KylI
Show Me The Data!
• Use (Linked) Open Data Datasets
• Crime Data
• Election Data (constituencies, majorities, etc.)
• MP v...
Let’s have some
questions from
our audience.
Semanticnews 230913-final
Semanticnews 230913-final
Upcoming SlideShare
Loading in …5
×

Semanticnews 230913-final

628 views

Published on

Slides presented about the SemanticNews project at the SematicMedia@theBritishLibrary event on September 23rd 2013.

Published in: Technology, Education
  • Be the first to comment

Semanticnews 230913-final

  1. 1. Mark A Greenwood, Jonathon Hare, David R Newman, Wim Peters SemanticMedia@TheBritishLibrary Monday 23rd September 2013
  2. 2. The Project Vision • Semantic News is 6 month project: • June to November 2013 • Two 50% FTEs (1 Southampton, 1 Sheffield) • An interactive `second screen’ to provide contextual information on Question Time questions • Use multiple data sources • Perform named entity recognition • Exploit Linked Open Datasets • Towards an almost real-time system
  3. 3. Where is the Data? (1) • Question Time in 2010 • 34 episodes, 163 questions • BBC Subtitles • XML encoded • Broadcast as the subtitles stream
  4. 4. Where is the Data? (2) • BBC Programmes Data • XML encoded • Information about the programme, (panellists, topics, broadcast dates, etc.) • Tweets • Taken from the Twitter ‘Garden Hose’ (10% stream)
  5. 5. Pre-parsing Subtitles Data • Raw XML subtitles • Remove duplicate words • Parse into CSV • time offset • sentence • Break into questions • BBC Programmes data provides question time offsets • Compare with subtitles time offsets and split
  6. 6. Pre-parsing Twitter Data • Twitter ‘Garden Hose’ for 2010 Dataset • Used Apache Hadoop and filtered on: • @bbcqt, @bbcquestiontime • #bbcqt, #bbcquestiontime, #questiontime • “Question Time” “David Dimbleby” • Collated JSON results and imported into OpenRefine • Removed irrelevant fields • Filtered out tweets that did not contain “bbc” • Exported as CSV
  7. 7. Information Extraction with GATE ● General Architecture for Text Engineering (GATE) ● Developed by University of Sheffield since 2000 ● Used by many researchers, scientists and organisations all over the world ● Includes various components for language processing ● Parsers, machine learning tools, stemmers, IR tools, IE components for various languages... ● Also performs visualising and manipulating of text, annotations, ontologies, parse trees, etc., and tools for evaluation
  8. 8. Linguistic pre-processing ● Techniques ● Tokenization ● Sentence Splitting ● Language Identification ● POS tagging ● Morphological analysis ● Adapted for use with social media like Twitter
  9. 9. Named Entity Recognition ● Approaches ● Gazetteer lookup ● JAPE grammars ● Co-reference ● Types ● Location: countries, regions, cities etc. ● Organisation: names of companies, government organisations, committees, agencies, universities, etc. ● Person: names of people ● Date: absolute dates like ‘October 2012’ or ‘2007’, as well as relative dates, such as ‘last year’. ● Measurements: e.g. “8,596 km”, “one fifth”, percentages and probabilities
  10. 10. Enrichment: LODIE ● Under constant development in various projects ● Associates the most probable LOD URI with named entities ● Disambiguation against DBPedia ● Various techniques to enhance recall
  11. 11. Enrichment: LODIE “Ken Clarke: The Labour plotters hide behind the knife and stab with the cloak! Brilliant!!” “Hain just lost Labour votes by supporting the •£25k benefits of an extremist.”
  12. 12. Representing Extracted Information
  13. 13. Conceptualising a Question http://www.youtube.com/watch?v=O3l9Mi-KylI
  14. 14. Show Me The Data! • Use (Linked) Open Data Datasets • Crime Data • Election Data (constituencies, majorities, etc.) • MP voting records • School league tables • NHS performance league tables • Economic Figures (GDP, Inflation, Unemployment) • Compare and contrast
  15. 15. Let’s have some questions from our audience.

×