Connecting Concepts : joining up the BBC

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Connecting Concepts : joining up the BBC - Presentation Transcript

    1. Connecting Concepts : joining up the BBC Rob Lee : robl@rattlecentral.com http://www.rattlecentral.com Contextual semantic disambiguation Joining across the BBC
    2. Going to tell a bit of a story, where we ended up was different to where we thought we’d end up Never thought we’d end up building a semantic categorisation system
    3. + Innovation Labs How did it begin ? BBC Innovation labs
    4. BBC Ivory tower of content - above the web, not part of it BBC News in particular - premier property, horizontal navigation isn’t bad, but links out to the wider web missing or very ‘functional’ or unsurprising Journalists make poor librarians, how can we breathe a little more like into this content Where’s the ‘wilfing’ ?
    5. Wikipedia demonstrates wilfing really well, it has good internal and outbound links
    6. wilfing - What Was I Looking For ? We wanted to look at ways we could bring this wilfing experience to the BBC. Options, train 100 journalists to go back through the content archives OR automate it To automate it, the first thing we need is a data source - lets use wikipedia as it’s good for wilfing
    7. Muddy Boots Muddy Boots, research project, tramping new trails through the pristine pastures of the BBC content Built on some fairly simple precepts, using freely available technology
    8. Lets take a look at a wikipedia page, how does it support wilfing. Good internal page links support horizontal nav
    9. Not just internal links that are useful - there are typically an ‘interesting’ set of external links Some functional but others picked by users and thus likely to be interesting
    10. how do we relate the story to our commons content ? We need to relate Wikipedia data to our content, most archives have poor classification/descriptive data, journalists make poor librarians. How can we improve the classification of content in our archives, how can find out what it’s about ?
    11. Various automated techniques available including Semantic Analysis, Term Extraction We cheated - used YTE Now we can say what characterises a story and start to relate it to Wikipedia
    12. Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ? We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing
    13. Use another 3rd party service to ranks external URL’s, tried using google rank, technorati buzz - but they didn’t necessarily rank the interesting links highly (e.g. try searching for google.com on google and see how many hits there are) - del.icio.us is different though users provide both context and ranking (via tagging and the process of bookmarking a URL) - so for each external link from all the wikipedia pages, we can see how many of the original extracted tags there are compared to the all the tags for a URL - high matches = high rank - how relevant is this url to the story in question and the fact it’s been bookmarked means it’s interesting to someone
    14. Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method
    15. Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive Story classification incorrect Doesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie) Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance
    16. Apple Apple ? A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’
    17. Simplify the problem : Unambiguously identify the “main actors” in a news story Then add semantic markup for them Answers the “who” in who, what, where, why ... http://www.flickr.com/photos/donnagrayson/195244498/sizes/l/ Back to basics, only try to do one thing well - and solve the problem of ambiguity
    18. geonames.org In order to solve the disambiguation problem, we could use the disambiguation pages But wikipedia is difficult to work with for machines, hence use dbpedia Also possible to link other datasources such as geonames or musicbrainz, so we can find out more information than previously was possible just using wikipedia We’re building a semantic categorisation system
    19. Page for ‘Apple on dbpedia -> lots of data
    20. For a given term we can find a) if it’s ambiguous and b) what the term could possibly mean It’s possible to use this information to help determine which term we are actually refer to in the original text.
    21. Entity Extraction Yahoo term extraction TagThe.net Voting System Leveraging existing web services to perform entity extraction is useful, especially when employing a voting system. We also used a local named entity extraction service, this is more useful in the future as we have some direction over it’s evolution
    22. We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content
    23. Extract (& Classify Entities) Find In DBpedia / Wikipedia Classify Entities via DBpedia Extract Required Attributes Parse Content & Markup One Possible Workflow Entity extraction - many methods available, entity classification via DBpedia is very extensible, finally microformat markup, good for machines and the semantic web as a whole
    24. Chris sizemore slide • contxt Chris Sizemore approach to using wikipedia as a controlled vocabulary : can produce interesting ‘human’ categorisation results
    25. Sample of MuddyBoots output, classification of a BBC news article. Demonstrates ‘main actor’ discovery and automated microformatting and inclusion of extra content from DBpedia in a ‘featured actors’ sidebar. The inclusion of microformats means machines can now query this page in a more granular fashion
    26. Added bonus of creating semantic links and using ‘web scale identifiers’. BBC Music beta aggregates around Music Brains identifiers, DBpedia knows about MusicBrainz, therefor we can provide news feeds for any artist on BBC Music beta using this relationship Demonstrating how common controlled vocab can help us join up both the BBC and link out to other web databases
    27. http://www.muddy.it Text Where next ? Precision and recall testing with the BBC Trialling the technology in a production environment Muddy.it service, for those of you who are interested in the technology, then go have a look at muddy.it. We’re testing it with beta partners at the moment and you can go register you interest
    28. In summary: Using dbpedia as a controlled vocabulary has a number of benefits : no maintenance required - performed by community use of web scale identifiers allows you to link your content to other web scale databases e.g. musicbrainz there’s lots of information within dbpedia that allows you to move beyond NLP and into contextual semantic disambiguation
    29. Photos: http://www.flickr.com/photos/evapro/305689596/ http://www.flickr.com/photos/jvk/141284308 http://www.flickr.com/photos/aquatic/500443103/ http://www.flickr.com/photos/donnagrayson/195244498/ http://flickr.com/photos/garrulus/82714475/ http://www.flickr.com/photos/bekathwia/2120050762/

    + monkeyhelpermonkeyhelper, 8 months ago

    custom

    812 views, 1 favs, 0 embeds more stats

    A presentation given at ISOKUK Semantic Web Categor more

    More Info

    © All Rights Reserved

    Go to text version
    • Total Views 812
      • 812 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 15
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as innappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel

    Categories