Your SlideShare is downloading. ×
The Semantic Web and the News:
Exploitation and Adoption

Ken Ellis
Chief Scientist
Agenda

  Intro to Daylife

 Exploiting the Semantic Web
        Named Entities
    
        Toolsets, issues
    

  ...
Daylife

                A Platform for News Innovation:

A scalable solution for publishers of all sizes to generate more...
Daylife: What We Do
    Aggregate Content

        Licensed photos (Getty, AP, Reuters)
    
        Articles (scraped, ...
[Wiki|DB]Pedia and Named Entites

 We also want to collect content around a named entity
…and associate it with external d...
[Wiki|DB]Pedia and Named Entites
                                         … for a lot of NE’s
                            ...
[Wiki|DB]Pedia and Named Entites

    Without getting swamped

Daylife and the Semantic Web

    Wikipedia

        website
    
        API
    
        Wikimedia dumps
    

    D...
Resources for News Organizations

                                   Named Entities
    Wikipedia                  

   ...
[Wiki|DB]Pedia and Named Entites

                        But:
“… Now, team owner Kevin Buckler is looking to debut
 in NA...
[Wiki|DB]Pedia and Named Entites

    Articles First:

    Wikimedia dumps and DBPedia

        Filter for people, organ...
[Wiki|DB]Pedia and Named Entites

      Mike
     Wallace
    journalist


                                             NA...
[Wiki|DB]Pedia and Named Entites

    Another possibility: compare text of Wikipedia entry to

    the article

    But:
...
[Wiki|DB]Pedia and Named Entites

    NE First approach:

    Classifier for race car drivers, Wikipedia to identify name...
[Wiki|DB]Pedia and Named Entites

    NE First:

        Tractable for a human (limited number of classifiers)
    
    ...
[Wiki|DB]Pedia and Named Entites
General Nits

    Sticky Graffiti

        Wikipedia can be updated
    
        real-t...
[Wiki|DB]Pedia and Named Entites
General Nits

    Career Changes

        Mike Wallace (journalist)
    
        become...
[Wiki|DB]Pedia and Named Entites
General Nits

    Staleness

    Infrequent Wikimedia

    dumps
    GWB is still presi...
[Wiki|DB]Pedia and Named Entites
    Obscure Information

    Clint Eastwood:

        Is prominent, is a politician
   ...
[Wiki|DB]Pedia and Named Entites
      URI Stability
  
      If this were 1981, unambiguous “George Bush”:
  

<rdf:RDF...
Resources

                                   Named Entities
    Wikipedia                  

                          ...
Freebase
    GUID’s are stable

    Query by Wikipedia URI
                                        http://www.freebase.c...
Resources

                                   Inter-agency standards
    Wikipedia                  

                  ...
Interagency Metadata

     Data:

       authorship
       location
       caption
       sometimes people,
        c...
Resources

                                   Q: “Can you use our metadata”
    Wikipedia                  

           ...
Others Using the Semantic Web

    Having an API

        not the Semantic Web, but at least machine-friendly
    
     ...
Others Using the Semantic Web

    EVRI

        API
    
        Topics (mostly, all?) from Wikipedia
    
        Pro...
Others Using the Semantic Web

    Zemanta

        Lots of Linked Data
    
        API provides text markup
    


  ...
Others Using the Semantic Web

    Calais (Thomson Reuters)

        API extracts NE’s, other information
    
        P...
Others Using the Semantic Web
    The New York Times

        Leading charge with publisher API
    
        Their own t...
Wikipedia URI Lookup
A quick service to support lookup for Wikipedia URI’s

         http://labs.daylife.com/wikipedia_top...
Thank you


            Web Site
            http://www.daylife.com

            Daylife API
            http://developer....
Upcoming SlideShare
Loading in...5
×

The Semantic Web And The News

999

Published on

An overview of the ways in which news organizations leverage semantic web resources such as dbpedia, freebase, wikipedia; adoption of Semantic web standards.

Published in: Technology, News & Politics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
999
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "The Semantic Web And The News"

  1. 1. The Semantic Web and the News: Exploitation and Adoption Ken Ellis Chief Scientist
  2. 2. Agenda Intro to Daylife   Exploiting the Semantic Web Named Entities  Toolsets, issues  Adopting / Enabling  Others  Daylife 
  3. 3. Daylife A Platform for News Innovation: A scalable solution for publishers of all sizes to generate more content and more inventory – with no additional personnel costs
  4. 4. Daylife: What We Do Aggregate Content  Licensed photos (Getty, AP, Reuters)  Articles (scraped, real-time)  Create Metadata  Topics (people, organizations, concepts)  Topic taxonomy, descriptions  Quotes with attribution  Photo identification  Relatedness  Authorship, sentiment analysis, etc.  Deliver to Clients  Web Sites / Modules / Data  Flexibility: API w/ 500 distinct queries  Novel search/ranking algorithms  Free API 
  5. 5. [Wiki|DB]Pedia and Named Entites We also want to collect content around a named entity …and associate it with external data (Wikipedia, Freebase)
  6. 6. [Wiki|DB]Pedia and Named Entites … for a lot of NE’s (55k newsworthy ones last month) 1000000 100000 Articles Per Month 10000 1000 100 10 1 1 10 100 1000 10000 100000 NE Rank
  7. 7. [Wiki|DB]Pedia and Named Entites Without getting swamped 
  8. 8. Daylife and the Semantic Web Wikipedia  website  API  Wikimedia dumps  DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  9. 9. Resources for News Organizations Named Entities Wikipedia   vetting website   disambiguation API   aliases Wikimedia dumps   prominence DBPedia   Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  10. 10. [Wiki|DB]Pedia and Named Entites But: “… Now, team owner Kevin Buckler is looking to debut in NASCAR Sprint Cup Series competition, when Mike Wallace runs in Thursday's Gatorade Duel …” Which Mike Wallace?  Mike_Wallace_(journalist)  Mike_Wallace_(NASCAR)  Two disambiguation approaches  Given an article, extracted name, what Wikipedia entry does  it map to? Given a Wikipedia entry, what articles match? 
  11. 11. [Wiki|DB]Pedia and Named Entites Articles First:  Wikimedia dumps and DBPedia  Filter for people, organizations, other NE  Construct weighted graph from links  Proxy for prominence (# edits, pageviews, dumps only)  Redirects & disambiguation pages  “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human  decided reference is unambiguous; Usama/Osama Identify names, possibly matching graph nodes  Select set of nodes that minimizes total distance  Perhaps factor in node prominence 
  12. 12. [Wiki|DB]Pedia and Named Entites Mike Wallace journalist NASCAR Chicago Sun- Times Mike Kevin Chicago Wallace Buckler Bulls NASCAR Gatorade I made this up!
  13. 13. [Wiki|DB]Pedia and Named Entites Another possibility: compare text of Wikipedia entry to  the article But:  Wikipedia entries largely historical, small fraction related to  current events Journalists, in providing context for lesser-known individuals,  often mention a few other named entities
  14. 14. [Wiki|DB]Pedia and Named Entites NE First approach:  Classifier for race car drivers, Wikipedia to identify names  Filter based on prominence  See EVRI taxonomical paths  http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with- taxonomical-paths
  15. 15. [Wiki|DB]Pedia and Named Entites NE First:  Tractable for a human (limited number of classifiers)  Better for low-recall high-precision  Article First:  Low editorial oversight  Best-guess  Neither is a complete solution  Not for locations 
  16. 16. [Wiki|DB]Pedia and Named Entites General Nits Sticky Graffiti  Wikipedia can be updated  real-time if you don’t like it Some derived data sets  can’t. Makes it our problem! On-demand updates from  Wikipedia API / HTML
  17. 17. [Wiki|DB]Pedia and Named Entites General Nits Career Changes  Mike Wallace (journalist)  becomes a NASCAR driver Joe Wurtzelbacher  becomes a political pundit Not a complete solution,  but we knew that.
  18. 18. [Wiki|DB]Pedia and Named Entites General Nits Staleness  Infrequent Wikimedia  dumps GWB is still president?  DBPedia bad  Wikimedia dumps bad  Freebase good  Wikipedia HTML/API good  DBPedia, 3/5/09
  19. 19. [Wiki|DB]Pedia and Named Entites Obscure Information  Clint Eastwood:  Is prominent, is a politician  Not a prominent politician 
  20. 20. [Wiki|DB]Pedia and Named Entites URI Stability  If this were 1981, unambiguous “George Bush”:  <rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot; xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;> <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;> <dc:title>George Bush</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> The NYTimes did this, and still does (API):  “George Bush” tag  George H. W. Bush  A lucky problem to have! 
  21. 21. Resources Named Entities Wikipedia   GUID’s! website   tagging API   associations (members of Wikimedia dumps   teams) DBPedia  other data  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  22. 22. Freebase GUID’s are stable  Query by Wikipedia URI  http://www.freebase.com/api/service/mqlre ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/ Easy-to-find redirects  en/Mike_Wallace_$0028journalist$0029quot;}} GWB isn’t president  Professions vs. Types  Easier for topic tagging  Clint Eastwood still a politician  but: easier to tell he’s a minor one  multiple types/professions, not much political data  No good proxy for significance  cross-reference 
  23. 23. Resources Inter-agency standards Wikipedia   Newswire services website   IPTC: photo information API   NewsML: article information, Wikimedia dumps   topics DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  24. 24. Interagency Metadata Data:   authorship  location  caption  sometimes people, category  NE’s hand-typed, often quickly  RSS almost as good Stripped  Matching problem,  but STILL USEFUL
  25. 25. Resources Q: “Can you use our metadata” Wikipedia   A: “Sometimes” website   API  Again, matching problem, but Wikimedia dumps   good for client-specific topics, DBPedia  still useful Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  26. 26. Others Using the Semantic Web Having an API  not the Semantic Web, but at least machine-friendly  eventually common, even for publishers  Publishing URI’s for Wikipedia, Freebase, IMDB, etc.  common among non-publishers  parasitic (not bad!)  Querying using the same URI’s  not so common  mutualistic 
  27. 27. Others Using the Semantic Web EVRI  API  Topics (mostly, all?) from Wikipedia  Probably taxonomic pathways, facets, derived from Wikipedia  Disambiguation based on above  Published Wikipedia URL’s  Can’t query by Wikipedia, other URI’s 
  28. 28. Others Using the Semantic Web Zemanta  Lots of Linked Data  API provides text markup  Developing (with others)  simplified RDFa based semantic tagging standard
  29. 29. Others Using the Semantic Web Calais (Thomson Reuters)  API extracts NE’s, other information  Provides Linked Data URI’s to others (one-way)  Provides their own endpoints  Not an aggregator  Eventual support for querying  Very clean! 
  30. 30. Others Using the Semantic Web The New York Times  Leading charge with publisher API  Their own tagging, great quality  Some major newspapers  following suit Others APIs: NewsGator, Inform,  Outside.in Slow Moves to Digital Access  Full-text RSS rare  API rare  Semantic Web standards rare  Wouldn’t it be great if:  You could ask for content about Mike_Wallace_(American_football)  They pointed you to other rich data sources 
  31. 31. Wikipedia URI Lookup A quick service to support lookup for Wikipedia URI’s http://labs.daylife.com/wikipedia_topic_getInfo.php?uri= http://en.wikipedia.org/wiki/Mike_Wallace_(journalist) or http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
  32. 32. Thank you Web Site http://www.daylife.com Daylife API http://developer.daylife.com Labs http://labs.daylife.com Email ken@daylife.com

×