Your SlideShare is downloading. ×
0
It's not what you said,
             it's how you said it.
                         Jamie Taylor, Ph.D.




  Text Analyti...
What do y'all mean
  "Semantics"



                  The Web!
                  Now with
                 Better Flavor!
Tim Berners-Lee, James Hendler
           and Ora Lassila   




May 2001
The Semantic Web?




   The Cake
      taken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png
Linked Open Data
The Real Web




               http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
Wish it were real
Might be real
Is real, but don't believe it
Is currently useful
Entities
Identifiers        Side Step Polysemy




       Bono, A.K.A. Paul David Hewson
http://rdf.freebase.com/ns/en.paul_david_he...
Vocabulary

                  Manufactures




http://rdf.freebase.com/ns/automotive.make.model_s
A socially managed semantic database
Freebase has Many Types of Things
Many Strong Identifiers
            http://rdf.freebase.com/ns/en.berlin_wall




            http://www.ellerdale.com/topi...
12 Million Entites
350 Million Relations
Users contribute data




Users extend the data model
schema = vocabulary
1500 types with 500+ instances!!




A range of of vocabularies....
Growing Freebase
Reconciliation



   +=
Reconciliation

Relational Learning
            Record Matching
Collective Entity Resolution
                 Equivalence ...
Reconciliation
                              "Excuse Me"
"Excuse Me"
                                   "Harrison Ford"
  ...
Reconciliation
                            "Fugitive"
"Excuse Me"
                                "Harrison Ford"
        ...
A Graph of Entities
Vocabulary
contains

            located
                           performed-at               released-by
               ...
Reconciliation as "understanding"
   contains

               located
                              performed-at          ...
{
    "/type/object/name":"Blade Runner",
    "/type/object/type":"/film/film",
    "/film/film/starring/actor":["Harrison For...
Data Everywhere
Wikipedia Features
Wikipedia Features



    X


X

    Error Prone -- Usually <99%
(Machine) Learning Semantics
                              get             5M type
                             types
    ...
/people/person distribution
                             untyped topics
                             person topics
       ...
RABJ: Humans in the loop
Thresholding Results

          99% threshold at 16.75
/people/person assertions

                threshold




                        53K /people/person
                      ...
Training Wheels?
Semantics are Everywhere
A Strong Tag for Food Inc.
   http://movi.es/BVl43
Widgets: Content Tags
Explicit Semantics
Rich Snippets
<div class="post-item restaurant-gen-info hreview-aggregate">
 <div class="item vcard">
  <h1 class="fn org"...
RDFa

       microformats


  HTML5 MicroData


Open Graph Protocol
Explicit Semantics in
 Surprising Places
Blog Tags::Entities
Metaweb Topic Block
Widget Microdata


<div class="fb-widget"
id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420"
style="border:0; outline:0; padding:0...
Thickening the Graph
"Vocabulary" Pattern
             taw    shooter      marksman




              marble   marksman

http://wordnet.freebas...
Review (neighborhood) Pattern
                           Eric Schlosser


                     E. Coli


                 ...
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Text Analytic Summit 2010
Upcoming SlideShare
Loading in...5
×

Text Analytic Summit 2010

1,216

Published on

With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the entities in the document are connected on a knowledge graph. This is similar to the "reconciliation" process that is used to grow Freebase itself.

The web is currently full of semantic hints, whether they are explicit (like those promoted by the Semantic Web) or implicit (like the use of blog widgets.) Using these hints, text analytic methods can get a toe-hold on the web corpus at large.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,216
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
44
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Text Analytic Summit 2010"

  1. 1. It's not what you said, it's how you said it. Jamie Taylor, Ph.D. Text Analytic Summit Boston 2010
  2. 2. What do y'all mean "Semantics" The Web! Now with Better Flavor!
  3. 3. Tim Berners-Lee, James Hendler and Ora Lassila    May 2001
  4. 4. The Semantic Web? The Cake taken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png
  5. 5. Linked Open Data
  6. 6. The Real Web http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  7. 7. Wish it were real
  8. 8. Might be real
  9. 9. Is real, but don't believe it
  10. 10. Is currently useful
  11. 11. Entities
  12. 12. Identifiers Side Step Polysemy Bono, A.K.A. Paul David Hewson http://rdf.freebase.com/ns/en.paul_david_hewson
  13. 13. Vocabulary Manufactures http://rdf.freebase.com/ns/automotive.make.model_s
  14. 14. A socially managed semantic database
  15. 15. Freebase has Many Types of Things
  16. 16. Many Strong Identifiers http://rdf.freebase.com/ns/en.berlin_wall http://www.ellerdale.com/topics/view/0080-6ba0 http://www.bbc.co.uk/music/artists/7f347782-eb14-40c3-98e2-17b6e1bfe56c http://musicbrainz.org/artist/7f347782-eb14-40c3-98e2-17b6e1bfe56c http://rdf.freebase.com/ns/authority.musicbrainz.7f347782-eb14-40c3-98e2-17b6e1bfe56c
  17. 17. 12 Million Entites 350 Million Relations
  18. 18. Users contribute data Users extend the data model
  19. 19. schema = vocabulary
  20. 20. 1500 types with 500+ instances!! A range of of vocabularies....
  21. 21. Growing Freebase
  22. 22. Reconciliation +=
  23. 23. Reconciliation Relational Learning Record Matching Collective Entity Resolution Equivalence Mining Record Linking Identity Matching
  24. 24. Reconciliation "Excuse Me" "Excuse Me" "Harrison Ford" "Harrison Ford" "Vanity Fair" "Maytime"
  25. 25. Reconciliation "Fugitive" "Excuse Me" "Harrison Ford" "Harrison Ford" "Vanity Fair" "Blade Runner"
  26. 26. A Graph of Entities
  27. 27. Vocabulary contains located performed-at released-by created plays-in plays-in nationality education education located
  28. 28. Reconciliation as "understanding" contains located performed-at released-by created plays-in plays-in nationality education education located
  29. 29. { "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{ "id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/ award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/ award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, ..... http://data.labs.freebase.com/recon/
  30. 30. Data Everywhere
  31. 31. Wikipedia Features
  32. 32. Wikipedia Features X X Error Prone -- Usually <99%
  33. 33. (Machine) Learning Semantics get 5M type types assertions 2.8M Wikipedia topics intersect the two calculate feature join feature counts generate type sources counts per type with topics scores for topics 2.4M features 1.6G scores 1400 types extract features 37M features 5M articles WEX
  34. 34. /people/person distribution untyped topics person topics other topics all topics Data courtesy Viral Shah
  35. 35. RABJ: Humans in the loop
  36. 36. Thresholding Results 99% threshold at 16.75
  37. 37. /people/person assertions threshold 53K /people/person assertions
  38. 38. Training Wheels? Semantics are Everywhere
  39. 39. A Strong Tag for Food Inc. http://movi.es/BVl43
  40. 40. Widgets: Content Tags
  41. 41. Explicit Semantics
  42. 42. Rich Snippets <div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li><li class="star">&nbsp;</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>
  43. 43. RDFa microformats HTML5 MicroData Open Graph Protocol
  44. 44. Explicit Semantics in Surprising Places
  45. 45. Blog Tags::Entities
  46. 46. Metaweb Topic Block
  47. 47. Widget Microdata <div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http:// www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/ artist"> ..... </div>
  48. 48. Thickening the Graph
  49. 49. "Vocabulary" Pattern taw shooter marksman marble marksman http://wordnet.freebaseapps.com photo: http://sarabbit.openphoto.net
  50. 50. Review (neighborhood) Pattern Eric Schlosser E. Coli Michael Pollan Robert Kenner
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×