Text Analytic Summit 2010


Published on

With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the entities in the document are connected on a knowledge graph. This is similar to the "reconciliation" process that is used to grow Freebase itself.

The web is currently full of semantic hints, whether they are explicit (like those promoted by the Semantic Web) or implicit (like the use of blog widgets.) Using these hints, text analytic methods can get a toe-hold on the web corpus at large.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text Analytic Summit 2010

  1. 1. It's not what you said, it's how you said it. Jamie Taylor, Ph.D. Text Analytic Summit Boston 2010
  2. 2. What do y'all mean "Semantics" The Web! Now with Better Flavor!
  3. 3. Tim Berners-Lee, James Hendler and Ora Lassila    May 2001
  4. 4. The Semantic Web? The Cake taken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png
  5. 5. Linked Open Data
  6. 6. The Real Web http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  7. 7. Wish it were real
  8. 8. Might be real
  9. 9. Is real, but don't believe it
  10. 10. Is currently useful
  11. 11. Entities
  12. 12. Identifiers Side Step Polysemy Bono, A.K.A. Paul David Hewson http://rdf.freebase.com/ns/en.paul_david_hewson
  13. 13. Vocabulary Manufactures http://rdf.freebase.com/ns/automotive.make.model_s
  14. 14. A socially managed semantic database
  15. 15. Freebase has Many Types of Things
  16. 16. Many Strong Identifiers http://rdf.freebase.com/ns/en.berlin_wall http://www.ellerdale.com/topics/view/0080-6ba0 http://www.bbc.co.uk/music/artists/7f347782-eb14-40c3-98e2-17b6e1bfe56c http://musicbrainz.org/artist/7f347782-eb14-40c3-98e2-17b6e1bfe56c http://rdf.freebase.com/ns/authority.musicbrainz.7f347782-eb14-40c3-98e2-17b6e1bfe56c
  17. 17. 12 Million Entites 350 Million Relations
  18. 18. Users contribute data Users extend the data model
  19. 19. schema = vocabulary
  20. 20. 1500 types with 500+ instances!! A range of of vocabularies....
  21. 21. Growing Freebase
  22. 22. Reconciliation +=
  23. 23. Reconciliation Relational Learning Record Matching Collective Entity Resolution Equivalence Mining Record Linking Identity Matching
  24. 24. Reconciliation "Excuse Me" "Excuse Me" "Harrison Ford" "Harrison Ford" "Vanity Fair" "Maytime"
  25. 25. Reconciliation "Fugitive" "Excuse Me" "Harrison Ford" "Harrison Ford" "Vanity Fair" "Blade Runner"
  26. 26. A Graph of Entities
  27. 27. Vocabulary contains located performed-at released-by created plays-in plays-in nationality education education located
  28. 28. Reconciliation as "understanding" contains located performed-at released-by created plays-in plays-in nationality education education located
  29. 29. { "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{ "id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/ award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/ award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, ..... http://data.labs.freebase.com/recon/
  30. 30. Data Everywhere
  31. 31. Wikipedia Features
  32. 32. Wikipedia Features X X Error Prone -- Usually <99%
  33. 33. (Machine) Learning Semantics get 5M type types assertions 2.8M Wikipedia topics intersect the two calculate feature join feature counts generate type sources counts per type with topics scores for topics 2.4M features 1.6G scores 1400 types extract features 37M features 5M articles WEX
  34. 34. /people/person distribution untyped topics person topics other topics all topics Data courtesy Viral Shah
  35. 35. RABJ: Humans in the loop
  36. 36. Thresholding Results 99% threshold at 16.75
  37. 37. /people/person assertions threshold 53K /people/person assertions
  38. 38. Training Wheels? Semantics are Everywhere
  39. 39. A Strong Tag for Food Inc. http://movi.es/BVl43
  40. 40. Widgets: Content Tags
  41. 41. Explicit Semantics
  42. 42. Rich Snippets <div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li><li class="star">&nbsp;</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>
  43. 43. RDFa microformats HTML5 MicroData Open Graph Protocol
  44. 44. Explicit Semantics in Surprising Places
  45. 45. Blog Tags::Entities
  46. 46. Metaweb Topic Block
  47. 47. Widget Microdata <div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http:// www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/ artist"> ..... </div>
  48. 48. Thickening the Graph
  49. 49. "Vocabulary" Pattern taw shooter marksman marble marksman http://wordnet.freebaseapps.com photo: http://sarabbit.openphoto.net
  50. 50. Review (neighborhood) Pattern Eric Schlosser E. Coli Michael Pollan Robert Kenner