• Save
The Semantic Web And The News
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The Semantic Web And The News

  • 2,203 views
Uploaded on

An overview of the ways in which news organizations leverage semantic web resources such as dbpedia, freebase, wikipedia; adoption of Semantic web standards.

An overview of the ways in which news organizations leverage semantic web resources such as dbpedia, freebase, wikipedia; adoption of Semantic web standards.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,203
On Slideshare
2,195
From Embeds
8
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 8

http://www.slideshare.net 5
http://www.spundge.com 2
http://translate.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Semantic Web and the News: Exploitation and Adoption Ken Ellis Chief Scientist
  • 2. Agenda Intro to Daylife   Exploiting the Semantic Web Named Entities  Toolsets, issues  Adopting / Enabling  Others  Daylife 
  • 3. Daylife A Platform for News Innovation: A scalable solution for publishers of all sizes to generate more content and more inventory – with no additional personnel costs
  • 4. Daylife: What We Do Aggregate Content  Licensed photos (Getty, AP, Reuters)  Articles (scraped, real-time)  Create Metadata  Topics (people, organizations, concepts)  Topic taxonomy, descriptions  Quotes with attribution  Photo identification  Relatedness  Authorship, sentiment analysis, etc.  Deliver to Clients  Web Sites / Modules / Data  Flexibility: API w/ 500 distinct queries  Novel search/ranking algorithms  Free API 
  • 5. [Wiki|DB]Pedia and Named Entites We also want to collect content around a named entity …and associate it with external data (Wikipedia, Freebase)
  • 6. [Wiki|DB]Pedia and Named Entites … for a lot of NE’s (55k newsworthy ones last month) 1000000 100000 Articles Per Month 10000 1000 100 10 1 1 10 100 1000 10000 100000 NE Rank
  • 7. [Wiki|DB]Pedia and Named Entites Without getting swamped 
  • 8. Daylife and the Semantic Web Wikipedia  website  API  Wikimedia dumps  DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 9. Resources for News Organizations Named Entities Wikipedia   vetting website   disambiguation API   aliases Wikimedia dumps   prominence DBPedia   Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 10. [Wiki|DB]Pedia and Named Entites But: “… Now, team owner Kevin Buckler is looking to debut in NASCAR Sprint Cup Series competition, when Mike Wallace runs in Thursday's Gatorade Duel …” Which Mike Wallace?  Mike_Wallace_(journalist)  Mike_Wallace_(NASCAR)  Two disambiguation approaches  Given an article, extracted name, what Wikipedia entry does  it map to? Given a Wikipedia entry, what articles match? 
  • 11. [Wiki|DB]Pedia and Named Entites Articles First:  Wikimedia dumps and DBPedia  Filter for people, organizations, other NE  Construct weighted graph from links  Proxy for prominence (# edits, pageviews, dumps only)  Redirects & disambiguation pages  “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human  decided reference is unambiguous; Usama/Osama Identify names, possibly matching graph nodes  Select set of nodes that minimizes total distance  Perhaps factor in node prominence 
  • 12. [Wiki|DB]Pedia and Named Entites Mike Wallace journalist NASCAR Chicago Sun- Times Mike Kevin Chicago Wallace Buckler Bulls NASCAR Gatorade I made this up!
  • 13. [Wiki|DB]Pedia and Named Entites Another possibility: compare text of Wikipedia entry to  the article But:  Wikipedia entries largely historical, small fraction related to  current events Journalists, in providing context for lesser-known individuals,  often mention a few other named entities
  • 14. [Wiki|DB]Pedia and Named Entites NE First approach:  Classifier for race car drivers, Wikipedia to identify names  Filter based on prominence  See EVRI taxonomical paths  http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with- taxonomical-paths
  • 15. [Wiki|DB]Pedia and Named Entites NE First:  Tractable for a human (limited number of classifiers)  Better for low-recall high-precision  Article First:  Low editorial oversight  Best-guess  Neither is a complete solution  Not for locations 
  • 16. [Wiki|DB]Pedia and Named Entites General Nits Sticky Graffiti  Wikipedia can be updated  real-time if you don’t like it Some derived data sets  can’t. Makes it our problem! On-demand updates from  Wikipedia API / HTML
  • 17. [Wiki|DB]Pedia and Named Entites General Nits Career Changes  Mike Wallace (journalist)  becomes a NASCAR driver Joe Wurtzelbacher  becomes a political pundit Not a complete solution,  but we knew that.
  • 18. [Wiki|DB]Pedia and Named Entites General Nits Staleness  Infrequent Wikimedia  dumps GWB is still president?  DBPedia bad  Wikimedia dumps bad  Freebase good  Wikipedia HTML/API good  DBPedia, 3/5/09
  • 19. [Wiki|DB]Pedia and Named Entites Obscure Information  Clint Eastwood:  Is prominent, is a politician  Not a prominent politician 
  • 20. [Wiki|DB]Pedia and Named Entites URI Stability  If this were 1981, unambiguous “George Bush”:  <rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot; xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;> <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;> <dc:title>George Bush</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> The NYTimes did this, and still does (API):  “George Bush” tag  George H. W. Bush  A lucky problem to have! 
  • 21. Resources Named Entities Wikipedia   GUID’s! website   tagging API   associations (members of Wikimedia dumps   teams) DBPedia  other data  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 22. Freebase GUID’s are stable  Query by Wikipedia URI  http://www.freebase.com/api/service/mqlre ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/ Easy-to-find redirects  en/Mike_Wallace_$0028journalist$0029quot;}} GWB isn’t president  Professions vs. Types  Easier for topic tagging  Clint Eastwood still a politician  but: easier to tell he’s a minor one  multiple types/professions, not much political data  No good proxy for significance  cross-reference 
  • 23. Resources Inter-agency standards Wikipedia   Newswire services website   IPTC: photo information API   NewsML: article information, Wikimedia dumps   topics DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 24. Interagency Metadata Data:   authorship  location  caption  sometimes people, category  NE’s hand-typed, often quickly  RSS almost as good Stripped  Matching problem,  but STILL USEFUL
  • 25. Resources Q: “Can you use our metadata” Wikipedia   A: “Sometimes” website   API  Again, matching problem, but Wikimedia dumps   good for client-specific topics, DBPedia  still useful Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 26. Others Using the Semantic Web Having an API  not the Semantic Web, but at least machine-friendly  eventually common, even for publishers  Publishing URI’s for Wikipedia, Freebase, IMDB, etc.  common among non-publishers  parasitic (not bad!)  Querying using the same URI’s  not so common  mutualistic 
  • 27. Others Using the Semantic Web EVRI  API  Topics (mostly, all?) from Wikipedia  Probably taxonomic pathways, facets, derived from Wikipedia  Disambiguation based on above  Published Wikipedia URL’s  Can’t query by Wikipedia, other URI’s 
  • 28. Others Using the Semantic Web Zemanta  Lots of Linked Data  API provides text markup  Developing (with others)  simplified RDFa based semantic tagging standard
  • 29. Others Using the Semantic Web Calais (Thomson Reuters)  API extracts NE’s, other information  Provides Linked Data URI’s to others (one-way)  Provides their own endpoints  Not an aggregator  Eventual support for querying  Very clean! 
  • 30. Others Using the Semantic Web The New York Times  Leading charge with publisher API  Their own tagging, great quality  Some major newspapers  following suit Others APIs: NewsGator, Inform,  Outside.in Slow Moves to Digital Access  Full-text RSS rare  API rare  Semantic Web standards rare  Wouldn’t it be great if:  You could ask for content about Mike_Wallace_(American_football)  They pointed you to other rich data sources 
  • 31. Wikipedia URI Lookup A quick service to support lookup for Wikipedia URI’s http://labs.daylife.com/wikipedia_topic_getInfo.php?uri= http://en.wikipedia.org/wiki/Mike_Wallace_(journalist) or http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
  • 32. Thank you Web Site http://www.daylife.com Daylife API http://developer.daylife.com Labs http://labs.daylife.com Email ken@daylife.com