Random indexing spaces for bridging the
           Human and Data Webs


    Jose Quesada, Ralph Brandao-Vidal, Lael schoo...
Introduction
Most of the existing knowledge on the Web is in
plain, unstructured text
The problem we aim to solve in this ...
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam
vulputate ipsum ac erat cursus et adipiscing diam pulvinar. I...
What's 'human web'




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
What's 'data web'




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Ontotext's linked data semantic repository (LDSR)




           Jose Quesada: Random indexing spaces for bridging the Hum...
Resources vs Literals
  Resource
The first explicit definition of resource is found in RFC 2396 and states that
A resource...
What's in an identifier?
●   Uniform Resource Identifier (URI)
Scheme ":" ["//" authority "/"] [path] [ "?" query ]
[ "#" ...
Why turning literals into resources
                is useful
●   Increased integration of the human and data
    Webs
●  ...
●   We will use statistical semantics to generate a
    vector for any literal


●   This vector can be used to uniquely i...
Attaching new resources to the
      center of the graph




     Jose Quesada: Random indexing spaces for bridging the Hu...
Statistical semantics




 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Statistical semantics
Exploits statistical patterns
of human word usage to
figure out word meaning
●   LSA (Landauer)     ...
Example of text data: Titles of Some Technical
                       Memos

●
    c1: Human machine interface for ABC com...
Matrix of words by contexts




   Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                                                Decomposition of the
      ...
Singular value
                                                    Decomposition of the
=                                 ...
Singular value
                                                    Decomposition of the
=                                 ...
Singular value
                                                    Decomposition of the
=                                 ...
Singular value
                                                    Decomposition of the
=                                 ...
Singular value
                                                    Decomposition of the
=                                 ...
Singular value
                                                    Decomposition of the
=                                 ...
Before                   After
r (human - user) =                    -.38                     .94
r (human - minors) =    ...
Similarity Measures
                                                                N
●
    Dot Product                   ...
Parallel spaces
●   Dbpedia                                        ●   Wikipedia
    ●   Structured                       ...
Dbpedia-wikipedia corpus
●   Currently 4M concepts. We used the most
    central 1M
    ●   Has to have > 100 words after ...
How to use statistical semantic to
 convert literals into resources


        Any literal can have a vector

Computing nea...
Random indexing
●   Same dimension-reduction without SVD
●   For each context, assign a random vector
    (nonzero seed va...
Training




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Generating the Meaningful, Unique
         Identifier (MUID)
●   Each literal gets a 1000-dimensional vector.
    This vec...
Example results. Taking any page and getting the
            closest dbpedia concepts
results for the search 'http://www.g...
Example results




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Problems
●   Nearest neighbors on the current space takes 2
    minutes. Fortunately, it's easily paralellizable


●   Vec...
Advantages
●   We can now use any text as subject. We can say that an essay is a
    review, or that a particular paragrap...
Future work
●   Merge meaningful ID generation and
    compression into a single step


●   Improve nearest neighbors time...
What's in an identifier?
         Uniform Resource Identifier (URI)
 Scheme ":" ["//" authority "/"] [path] [ "?" query ]
...
Random indexing spaces for bridging the Human and Data Webs
                    Jose Quesada, quesada@gmail.com

       Ma...
Upcoming SlideShare
Loading in …5
×

Irmles2010 Random indexing spaces to bridge the human and data webs

856 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
856
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Irmles2010 Random indexing spaces to bridge the human and data webs

  1. 1. Random indexing spaces for bridging the Human and Data Webs Jose Quesada, Ralph Brandao-Vidal, Lael schooler Max Planck Institute, Adaptive Behavior and Cognition, Berlin Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  2. 2. Introduction Most of the existing knowledge on the Web is in plain, unstructured text The problem we aim to solve in this paper is simply converting literals into resources Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  3. 3. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam vulputate ipsum ac erat cursus et adipiscing diam pulvinar. In at ultricies odio. Donec sodales enim euismod nulla pulvinar et elementum velit congue. Cras ac quam ante, non facilisis massa. mpib:c97169cadaadbba92afbc2895b9eb9f unique, meaningful ID (MUID) Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  4. 4. What's 'human web' Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  5. 5. What's 'data web' Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  6. 6. Ontotext's linked data semantic repository (LDSR) Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  7. 7. Resources vs Literals Resource The first explicit definition of resource is found in RFC 2396 and states that A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources Literals Literals are values that do not have a unique identifier. They are usually a string that contains some human-readable text, for example names, dates and other types of values about a subject. In the previous example, the string ‘Fido’ is a literal. They optionally have a language (e.g., English, Japanese) or a type (e.g., integer, Boolean, string), but this is about all that can be said about literals. They cannot have properties like resources. Unlike resources, literals cannot link to the rest of the graph. They are second-class citizens on the Semantic Web. In terms of graphs, literals are one-way streets: since they cannot be the subject of a triple, there can be no outgoing links to other nodes. Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  8. 8. What's in an identifier? ● Uniform Resource Identifier (URI) Scheme ":" ["//" authority "/"] [path] [ "?" query ] [ "#" fragment] Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  9. 9. Why turning literals into resources is useful ● Increased integration of the human and data Webs ● Dangling nodes prevent us from applying some machine learning techniques: Number of URI: 126,875,974 Number of Literals: 227,758,535 Total number of entities: 354,635,159 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  10. 10. ● We will use statistical semantics to generate a vector for any literal ● This vector can be used to uniquely identify a literal; it makes it operationally equivalent to a resource Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  11. 11. Attaching new resources to the center of the graph Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  12. 12. Statistical semantics Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  13. 13. Statistical semantics Exploits statistical patterns of human word usage to figure out word meaning ● LSA (Landauer) ● Completely unsupervised Scale better than say neural networks Topics Models (Griffiths) ● ● ● Most require lineal algebra operations ● BEAGLE (Jones) on large sparse matrices ● HAL (Burgess) ● Computationally expensive ● Random indexing (Sahlgren) ● SP (Dennis) Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  14. 14. Example of text data: Titles of Some Technical Memos ● c1: Human machine interface for ABC computer applications ● c2: A survey of user opinion of computer system response time ● c3: The EPS user interface management system ● c4: System and human system engineering testing of EPS ● c5: Relation of user perceived response time to error measurement ● m1: The generation of random, binary, ordered trees ● m2: The intersection graph of paths in trees ● m3: Graph minors IV: Widths of trees and well-quasi-ordering ● m4: Graph minors: A survey Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  15. 15. Matrix of words by contexts Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  16. 16. Singular value Decomposition of the = words by contexts matrix Contexts Words (states) = Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  17. 17. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  18. 18. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  19. 19. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  20. 20. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  21. 21. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  22. 22. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  23. 23. Before After r (human - user) = -.38 .94 r (human - minors) = -.28 -.83 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  24. 24. Similarity Measures N ● Dot Product x. y = ∑ xi yi i =1 x. y • Cosine cos(θ xy ) = x y N • Euclidean euclid ( x, y ) = ∑ ( xi − yi ) 2 i =1 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  25. 25. Parallel spaces ● Dbpedia ● Wikipedia ● Structured ● Plain text ● Well-connected to the ● Representative of rest of the semantic human knowledge and web interest ● Pageviews reflect how ● One-to-one present a concept is in mappings the average human mind Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  26. 26. Dbpedia-wikipedia corpus ● Currently 4M concepts. We used the most central 1M ● Has to have > 100 words after stoplist ● More than 5 incoming and outgoing links Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  27. 27. How to use statistical semantic to convert literals into resources Any literal can have a vector Computing nearest neighbors will find similar resources Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  28. 28. Random indexing ● Same dimension-reduction without SVD ● For each context, assign a random vector (nonzero seed values is a free parameter). ● A word will be the average of all context vectors it appears in ● A new doc vector (e.g., a query) is the average of the vectors for the words it contains Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  29. 29. Training Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  30. 30. Generating the Meaningful, Unique Identifier (MUID) ● Each literal gets a 1000-dimensional vector. This vector 'captures the meaning' of the text ● Too long to be passed around in RDF. MD5 hashing compacts it @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> . mpib:c97169cadaadbba92afbc2895b9eb9f Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  31. 31. Example results. Taking any page and getting the closest dbpedia concepts results for the search 'http://www.google.de' : @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix dbpedia: <http://en.wikipedia.org/wiki#> mpib:c97169cadaadbba92afbc2895b9eb9f skos:related dbpebia:http://en.wikipedia.org/wiki/Google_Alerts mpib:8482e762cceb5d7636529cccf1c825 skos:related dbpebia:http://en.wikipedia.org/wiki/Google_Apps mpib:278c93125941f38c18dfe67591c94a5 skos:related dbpebia:http://en.wikipedia.org/wiki/Googlepedia mpib:2885141b46cd2fdc3c447bcfa18b73 skos:related dbpebia:http://en.wikipedia.org/wiki/IGoogle mpib:2959b4e35ca423f34a47b8fce196cf skos:related dbpebia:http://en.wikipedia.org/wiki/List_of_Google_products Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  32. 32. Example results Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  33. 33. Problems ● Nearest neighbors on the current space takes 2 minutes. Fortunately, it's easily paralellizable ● Vectors depend on the corpora. Two wikipedia version from different years may render slightly different vectors ● Selecting the most relevant concepts on wikipedia is an extra source of free parameters Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  34. 34. Advantages ● We can now use any text as subject. We can say that an essay is a review, or that a particular paragraph is insightful ● Works at different granularity levels, from single word to entire books ● We could use this to disambiguate text ● It may reduce graph search time by connecting dangling nodes to central parts of the graph. Whether this is a good idea is an open question Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  35. 35. Future work ● Merge meaningful ID generation and compression into a single step ● Improve nearest neighbors time ● Apply it in a realistic use case scenario Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  36. 36. What's in an identifier? Uniform Resource Identifier (URI) Scheme ":" ["//" authority "/"] [path] [ "?" query ] [ "#" fragment] Meaningful, unique identifier (MUID) @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> . mpib:c97169cadaadbba92afbc2895b9eb9f Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  37. 37. Random indexing spaces for bridging the Human and Data Webs Jose Quesada, quesada@gmail.com Max Planck Institute, Adaptive Behavior and Cognition, Berlin Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

×