Text Analytic Summit 2010
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Text Analytic Summit 2010

on

  • 1,621 views

With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the ...

With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the entities in the document are connected on a knowledge graph. This is similar to the "reconciliation" process that is used to grow Freebase itself.

The web is currently full of semantic hints, whether they are explicit (like those promoted by the Semantic Web) or implicit (like the use of blog widgets.) Using these hints, text analytic methods can get a toe-hold on the web corpus at large.

Statistics

Views

Total Views
1,621
Views on SlideShare
1,609
Embed Views
12

Actions

Likes
1
Downloads
42
Comments
0

1 Embed 12

http://www.slideshare.net 12

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Text Analytic Summit 2010 Presentation Transcript

  • 1. It's not what you said, it's how you said it. Jamie Taylor, Ph.D. Text Analytic Summit Boston 2010
  • 2. What do y'all mean "Semantics" The Web! Now with Better Flavor!
  • 3. Tim Berners-Lee, James Hendler and Ora Lassila    May 2001
  • 4. The Semantic Web? The Cake taken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png
  • 5. Linked Open Data
  • 6. The Real Web http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • 7. Wish it were real
  • 8. Might be real
  • 9. Is real, but don't believe it
  • 10. Is currently useful
  • 11. Entities
  • 12. Identifiers Side Step Polysemy Bono, A.K.A. Paul David Hewson http://rdf.freebase.com/ns/en.paul_david_hewson
  • 13. Vocabulary Manufactures http://rdf.freebase.com/ns/automotive.make.model_s
  • 14. A socially managed semantic database
  • 15. Freebase has Many Types of Things
  • 16. Many Strong Identifiers http://rdf.freebase.com/ns/en.berlin_wall http://www.ellerdale.com/topics/view/0080-6ba0 http://www.bbc.co.uk/music/artists/7f347782-eb14-40c3-98e2-17b6e1bfe56c http://musicbrainz.org/artist/7f347782-eb14-40c3-98e2-17b6e1bfe56c http://rdf.freebase.com/ns/authority.musicbrainz.7f347782-eb14-40c3-98e2-17b6e1bfe56c
  • 17. 12 Million Entites 350 Million Relations
  • 18. Users contribute data Users extend the data model
  • 19. schema = vocabulary
  • 20. 1500 types with 500+ instances!! A range of of vocabularies....
  • 21. Growing Freebase
  • 22. Reconciliation +=
  • 23. Reconciliation Relational Learning Record Matching Collective Entity Resolution Equivalence Mining Record Linking Identity Matching
  • 24. Reconciliation "Excuse Me" "Excuse Me" "Harrison Ford" "Harrison Ford" "Vanity Fair" "Maytime"
  • 25. Reconciliation "Fugitive" "Excuse Me" "Harrison Ford" "Harrison Ford" "Vanity Fair" "Blade Runner"
  • 26. A Graph of Entities
  • 27. Vocabulary contains located performed-at released-by created plays-in plays-in nationality education education located
  • 28. Reconciliation as "understanding" contains located performed-at released-by created plays-in plays-in nationality education education located
  • 29. { "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{ "id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/ award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/ award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, ..... http://data.labs.freebase.com/recon/
  • 30. Data Everywhere
  • 31. Wikipedia Features
  • 32. Wikipedia Features X X Error Prone -- Usually <99%
  • 33. (Machine) Learning Semantics get 5M type types assertions 2.8M Wikipedia topics intersect the two calculate feature join feature counts generate type sources counts per type with topics scores for topics 2.4M features 1.6G scores 1400 types extract features 37M features 5M articles WEX
  • 34. /people/person distribution untyped topics person topics other topics all topics Data courtesy Viral Shah
  • 35. RABJ: Humans in the loop
  • 36. Thresholding Results 99% threshold at 16.75
  • 37. /people/person assertions threshold 53K /people/person assertions
  • 38. Training Wheels? Semantics are Everywhere
  • 39. A Strong Tag for Food Inc. http://movi.es/BVl43
  • 40. Widgets: Content Tags
  • 41. Explicit Semantics
  • 42. Rich Snippets <div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li><li class="star">&nbsp;</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>
  • 43. RDFa microformats HTML5 MicroData Open Graph Protocol
  • 44. Explicit Semantics in Surprising Places
  • 45. Blog Tags::Entities
  • 46. Metaweb Topic Block
  • 47. Widget Microdata <div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http:// www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/ artist"> ..... </div>
  • 48. Thickening the Graph
  • 49. "Vocabulary" Pattern taw shooter marksman marble marksman http://wordnet.freebaseapps.com photo: http://sarabbit.openphoto.net
  • 50. Review (neighborhood) Pattern Eric Schlosser E. Coli Michael Pollan Robert Kenner