The Semantic Web And The News

  • 935 views
Uploaded on

An overview of the ways in which news organizations leverage semantic web resources such as dbpedia, freebase, wikipedia; adoption of Semantic web standards.

An overview of the ways in which news organizations leverage semantic web resources such as dbpedia, freebase, wikipedia; adoption of Semantic web standards.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
935
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Semantic Web and the News: Exploitation and Adoption Ken Ellis Chief Scientist
  • 2. Agenda Intro to Daylife   Exploiting the Semantic Web Named Entities  Toolsets, issues  Adopting / Enabling  Others  Daylife 
  • 3. Daylife A Platform for News Innovation: A scalable solution for publishers of all sizes to generate more content and more inventory – with no additional personnel costs
  • 4. Daylife: What We Do Aggregate Content  Licensed photos (Getty, AP, Reuters)  Articles (scraped, real-time)  Create Metadata  Topics (people, organizations, concepts)  Topic taxonomy, descriptions  Quotes with attribution  Photo identification  Relatedness  Authorship, sentiment analysis, etc.  Deliver to Clients  Web Sites / Modules / Data  Flexibility: API w/ 500 distinct queries  Novel search/ranking algorithms  Free API 
  • 5. [Wiki|DB]Pedia and Named Entites We also want to collect content around a named entity …and associate it with external data (Wikipedia, Freebase)
  • 6. [Wiki|DB]Pedia and Named Entites … for a lot of NE’s (55k newsworthy ones last month) 1000000 100000 Articles Per Month 10000 1000 100 10 1 1 10 100 1000 10000 100000 NE Rank
  • 7. [Wiki|DB]Pedia and Named Entites Without getting swamped 
  • 8. Daylife and the Semantic Web Wikipedia  website  API  Wikimedia dumps  DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 9. Resources for News Organizations Named Entities Wikipedia   vetting website   disambiguation API   aliases Wikimedia dumps   prominence DBPedia   Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 10. [Wiki|DB]Pedia and Named Entites But: “… Now, team owner Kevin Buckler is looking to debut in NASCAR Sprint Cup Series competition, when Mike Wallace runs in Thursday's Gatorade Duel …” Which Mike Wallace?  Mike_Wallace_(journalist)  Mike_Wallace_(NASCAR)  Two disambiguation approaches  Given an article, extracted name, what Wikipedia entry does  it map to? Given a Wikipedia entry, what articles match? 
  • 11. [Wiki|DB]Pedia and Named Entites Articles First:  Wikimedia dumps and DBPedia  Filter for people, organizations, other NE  Construct weighted graph from links  Proxy for prominence (# edits, pageviews, dumps only)  Redirects & disambiguation pages  “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human  decided reference is unambiguous; Usama/Osama Identify names, possibly matching graph nodes  Select set of nodes that minimizes total distance  Perhaps factor in node prominence 
  • 12. [Wiki|DB]Pedia and Named Entites Mike Wallace journalist NASCAR Chicago Sun- Times Mike Kevin Chicago Wallace Buckler Bulls NASCAR Gatorade I made this up!
  • 13. [Wiki|DB]Pedia and Named Entites Another possibility: compare text of Wikipedia entry to  the article But:  Wikipedia entries largely historical, small fraction related to  current events Journalists, in providing context for lesser-known individuals,  often mention a few other named entities
  • 14. [Wiki|DB]Pedia and Named Entites NE First approach:  Classifier for race car drivers, Wikipedia to identify names  Filter based on prominence  See EVRI taxonomical paths  http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with- taxonomical-paths
  • 15. [Wiki|DB]Pedia and Named Entites NE First:  Tractable for a human (limited number of classifiers)  Better for low-recall high-precision  Article First:  Low editorial oversight  Best-guess  Neither is a complete solution  Not for locations 
  • 16. [Wiki|DB]Pedia and Named Entites General Nits Sticky Graffiti  Wikipedia can be updated  real-time if you don’t like it Some derived data sets  can’t. Makes it our problem! On-demand updates from  Wikipedia API / HTML
  • 17. [Wiki|DB]Pedia and Named Entites General Nits Career Changes  Mike Wallace (journalist)  becomes a NASCAR driver Joe Wurtzelbacher  becomes a political pundit Not a complete solution,  but we knew that.
  • 18. [Wiki|DB]Pedia and Named Entites General Nits Staleness  Infrequent Wikimedia  dumps GWB is still president?  DBPedia bad  Wikimedia dumps bad  Freebase good  Wikipedia HTML/API good  DBPedia, 3/5/09
  • 19. [Wiki|DB]Pedia and Named Entites Obscure Information  Clint Eastwood:  Is prominent, is a politician  Not a prominent politician 
  • 20. [Wiki|DB]Pedia and Named Entites URI Stability  If this were 1981, unambiguous “George Bush”:  <rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot; xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;> <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;> <dc:title>George Bush</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> The NYTimes did this, and still does (API):  “George Bush” tag  George H. W. Bush  A lucky problem to have! 
  • 21. Resources Named Entities Wikipedia   GUID’s! website   tagging API   associations (members of Wikimedia dumps   teams) DBPedia  other data  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 22. Freebase GUID’s are stable  Query by Wikipedia URI  http://www.freebase.com/api/service/mqlre ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/ Easy-to-find redirects  en/Mike_Wallace_$0028journalist$0029quot;}} GWB isn’t president  Professions vs. Types  Easier for topic tagging  Clint Eastwood still a politician  but: easier to tell he’s a minor one  multiple types/professions, not much political data  No good proxy for significance  cross-reference 
  • 23. Resources Inter-agency standards Wikipedia   Newswire services website   IPTC: photo information API   NewsML: article information, Wikimedia dumps   topics DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 24. Interagency Metadata Data:   authorship  location  caption  sometimes people, category  NE’s hand-typed, often quickly  RSS almost as good Stripped  Matching problem,  but STILL USEFUL
  • 25. Resources Q: “Can you use our metadata” Wikipedia   A: “Sometimes” website   API  Again, matching problem, but Wikimedia dumps   good for client-specific topics, DBPedia  still useful Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 26. Others Using the Semantic Web Having an API  not the Semantic Web, but at least machine-friendly  eventually common, even for publishers  Publishing URI’s for Wikipedia, Freebase, IMDB, etc.  common among non-publishers  parasitic (not bad!)  Querying using the same URI’s  not so common  mutualistic 
  • 27. Others Using the Semantic Web EVRI  API  Topics (mostly, all?) from Wikipedia  Probably taxonomic pathways, facets, derived from Wikipedia  Disambiguation based on above  Published Wikipedia URL’s  Can’t query by Wikipedia, other URI’s 
  • 28. Others Using the Semantic Web Zemanta  Lots of Linked Data  API provides text markup  Developing (with others)  simplified RDFa based semantic tagging standard
  • 29. Others Using the Semantic Web Calais (Thomson Reuters)  API extracts NE’s, other information  Provides Linked Data URI’s to others (one-way)  Provides their own endpoints  Not an aggregator  Eventual support for querying  Very clean! 
  • 30. Others Using the Semantic Web The New York Times  Leading charge with publisher API  Their own tagging, great quality  Some major newspapers  following suit Others APIs: NewsGator, Inform,  Outside.in Slow Moves to Digital Access  Full-text RSS rare  API rare  Semantic Web standards rare  Wouldn’t it be great if:  You could ask for content about Mike_Wallace_(American_football)  They pointed you to other rich data sources 
  • 31. Wikipedia URI Lookup A quick service to support lookup for Wikipedia URI’s http://labs.daylife.com/wikipedia_topic_getInfo.php?uri= http://en.wikipedia.org/wiki/Mike_Wallace_(journalist) or http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
  • 32. Thank you Web Site http://www.daylife.com Daylife API http://developer.daylife.com Labs http://labs.daylife.com Email ken@daylife.com