Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Irmles2010 Random indexing spaces to bridge the human and data webs
1. Random indexing spaces for bridging the
Human and Data Webs
Jose Quesada, Ralph Brandao-Vidal, Lael schooler
Max Planck Institute, Adaptive Behavior and Cognition, Berlin
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
2. Introduction
Most of the existing knowledge on the Web is in
plain, unstructured text
The problem we aim to solve in this paper is
simply converting literals into resources
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
3. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam
vulputate ipsum ac erat cursus et adipiscing diam pulvinar. In
at ultricies odio. Donec sodales enim euismod nulla pulvinar et
elementum velit congue. Cras ac quam ante, non facilisis
massa.
mpib:c97169cadaadbba92afbc2895b9eb9f
unique, meaningful ID (MUID)
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
5. What's 'data web'
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
6. Ontotext's linked data semantic repository (LDSR)
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
7. Resources vs Literals
Resource
The first explicit definition of resource is found in RFC 2396 and states that
A resource can be anything that has identity. Familiar examples
include an electronic document, an image, a service (e.g., "today's weather
report for Los Angeles"), and a collection of other resources. Not all
resources are network "retrievable"; e.g., human beings, corporations, and
bound books in a library can also be considered resources
Literals
Literals are values that do not have a unique identifier. They
are usually a string that contains some human-readable text,
for example names, dates and other types of values about a subject. In the
previous example, the string ‘Fido’ is a literal. They optionally have a
language (e.g., English, Japanese) or a type (e.g., integer, Boolean, string),
but this is about all that can be said about literals. They cannot have
properties like resources. Unlike resources, literals cannot link to the rest of
the graph. They are second-class citizens on the Semantic Web. In terms
of graphs, literals are one-way streets: since they
cannot be the
subject of a triple, there can be no outgoing links to other
nodes.
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
8. What's in an identifier?
● Uniform Resource Identifier (URI)
Scheme ":" ["//" authority "/"] [path] [ "?" query ]
[ "#" fragment]
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
9. Why turning literals into resources
is useful
● Increased integration of the human and data
Webs
● Dangling nodes prevent us from applying some
machine learning techniques:
Number of URI: 126,875,974
Number of Literals: 227,758,535
Total number of entities: 354,635,159
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
10. ● We will use statistical semantics to generate a
vector for any literal
● This vector can be used to uniquely identify a
literal; it makes it operationally equivalent to a
resource
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
11. Attaching new resources to the
center of the graph
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
13. Statistical semantics
Exploits statistical patterns
of human word usage to
figure out word meaning
● LSA (Landauer) ● Completely unsupervised
Scale better than say neural networks
Topics Models (Griffiths)
●
●
● Most require lineal algebra operations
● BEAGLE (Jones) on large sparse matrices
● HAL (Burgess) ● Computationally expensive
● Random indexing (Sahlgren)
● SP (Dennis)
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
14. Example of text data: Titles of Some Technical
Memos
●
c1: Human machine interface for ABC computer applications
●
c2: A survey of user opinion of computer system response time
●
c3: The EPS user interface management system
●
c4: System and human system engineering testing of EPS
●
c5: Relation of user perceived response time to error measurement
●
m1: The generation of random, binary, ordered trees
●
m2: The intersection graph of paths in trees
●
m3: Graph minors IV: Widths of trees and well-quasi-ordering
●
m4: Graph minors: A survey
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
15. Matrix of words by contexts
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
16. Singular value
Decomposition of the
= words by contexts matrix
Contexts
Words (states)
=
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
17. Singular value
Decomposition of the
= words by contexts matrix
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
18. Singular value
Decomposition of the
= words by contexts matrix
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
19. Singular value
Decomposition of the
= words by contexts matrix
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
20. Singular value
Decomposition of the
= words by contexts matrix
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
21. Singular value
Decomposition of the
= words by contexts matrix
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
22. Singular value
Decomposition of the
= words by contexts matrix
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
23. Before After
r (human - user) = -.38 .94
r (human - minors) = -.28 -.83
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
24. Similarity Measures
N
●
Dot Product x. y = ∑ xi yi
i =1
x. y
• Cosine cos(θ xy ) =
x y
N
• Euclidean euclid ( x, y ) = ∑ ( xi − yi ) 2
i =1
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
25. Parallel spaces
● Dbpedia ● Wikipedia
● Structured ● Plain text
● Well-connected to the ● Representative of
rest of the semantic human knowledge and
web interest
● Pageviews reflect how
● One-to-one present a concept is in
mappings the average human
mind
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
26. Dbpedia-wikipedia corpus
● Currently 4M concepts. We used the most
central 1M
● Has to have > 100 words after stoplist
● More than 5 incoming and outgoing links
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
27. How to use statistical semantic to
convert literals into resources
Any literal can have a vector
Computing nearest neighbors will find similar
resources
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
28. Random indexing
● Same dimension-reduction without SVD
● For each context, assign a random vector
(nonzero seed values is a free parameter).
● A word will be the average of all context vectors
it appears in
● A new doc vector (e.g., a query) is the average
of the vectors for the words it contains
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
30. Generating the Meaningful, Unique
Identifier (MUID)
● Each literal gets a 1000-dimensional vector.
This vector 'captures the meaning' of the text
● Too long to be passed around in RDF. MD5
hashing compacts it
@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .
mpib:c97169cadaadbba92afbc2895b9eb9f
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
31. Example results. Taking any page and getting the
closest dbpedia concepts
results for the search 'http://www.google.de' :
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dbpedia: <http://en.wikipedia.org/wiki#>
mpib:c97169cadaadbba92afbc2895b9eb9f skos:related
dbpebia:http://en.wikipedia.org/wiki/Google_Alerts
mpib:8482e762cceb5d7636529cccf1c825 skos:related
dbpebia:http://en.wikipedia.org/wiki/Google_Apps
mpib:278c93125941f38c18dfe67591c94a5 skos:related
dbpebia:http://en.wikipedia.org/wiki/Googlepedia
mpib:2885141b46cd2fdc3c447bcfa18b73 skos:related dbpebia:http://en.wikipedia.org/wiki/IGoogle
mpib:2959b4e35ca423f34a47b8fce196cf skos:related
dbpebia:http://en.wikipedia.org/wiki/List_of_Google_products
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
33. Problems
● Nearest neighbors on the current space takes 2
minutes. Fortunately, it's easily paralellizable
● Vectors depend on the corpora. Two wikipedia
version from different years may render slightly
different vectors
● Selecting the most relevant concepts on wikipedia is
an extra source of free parameters
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
34. Advantages
● We can now use any text as subject. We can say that an essay is a
review, or that a particular paragraph is insightful
● Works at different granularity levels, from single word to entire books
● We could use this to disambiguate text
● It may reduce graph search time by connecting dangling nodes to central
parts of the graph. Whether this is a good idea is an open question
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
35. Future work
● Merge meaningful ID generation and
compression into a single step
● Improve nearest neighbors time
● Apply it in a realistic use case scenario
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
36. What's in an identifier?
Uniform Resource Identifier (URI)
Scheme ":" ["//" authority "/"] [path] [ "?" query ]
[ "#" fragment]
Meaningful, unique identifier (MUID)
@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .
mpib:c97169cadaadbba92afbc2895b9eb9f
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
37. Random indexing spaces for bridging the Human and Data Webs
Jose Quesada, quesada@gmail.com
Max Planck Institute, Adaptive Behavior and Cognition, Berlin
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs