SlideShare a Scribd company logo
Random indexing spaces for bridging the
           Human and Data Webs


    Jose Quesada, Ralph Brandao-Vidal, Lael schooler

Max Planck Institute, Adaptive Behavior and Cognition, Berlin




             Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Introduction
Most of the existing knowledge on the Web is in
plain, unstructured text
The problem we aim to solve in this paper is
simply converting literals into resources




           Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam
vulputate ipsum ac erat cursus et adipiscing diam pulvinar. In
at ultricies odio. Donec sodales enim euismod nulla pulvinar et
elementum velit congue. Cras ac quam ante, non facilisis
massa.




   mpib:c97169cadaadbba92afbc2895b9eb9f
   unique, meaningful ID (MUID)


  Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
What's 'human web'




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
What's 'data web'




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Ontotext's linked data semantic repository (LDSR)




           Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Resources vs Literals
  Resource
The first explicit definition of resource is found in RFC 2396 and states that
A resource can be anything that has identity. Familiar examples
include an electronic document, an image, a service (e.g., "today's weather
report for Los Angeles"), and a collection of other resources. Not all
resources are network "retrievable"; e.g., human beings, corporations, and
bound books in a library can also be considered resources
  Literals
Literals are values that do not have a unique identifier. They
are usually a string that contains some human-readable text,
for example names, dates and other types of values about a subject. In the
previous example, the string ‘Fido’ is a literal. They optionally have a
language (e.g., English, Japanese) or a type (e.g., integer, Boolean, string),
but this is about all that can be said about literals. They cannot have
properties like resources. Unlike resources, literals cannot link to the rest of
the graph. They are second-class citizens on the Semantic Web. In terms
of graphs, literals are one-way streets: since they
                                          cannot be the
subject of a triple, there can be no outgoing links to other
nodes.



                            Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
What's in an identifier?
●   Uniform Resource Identifier (URI)
Scheme ":" ["//" authority "/"] [path] [ "?" query ]
[ "#" fragment]




            Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Why turning literals into resources
                is useful
●   Increased integration of the human and data
    Webs
●   Dangling nodes prevent us from applying some
    machine learning techniques:


    Number of URI:                              126,875,974
    Number of Literals:                         227,758,535
    Total number of entities:                   354,635,159




                Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
●   We will use statistical semantics to generate a
    vector for any literal


●   This vector can be used to uniquely identify a
    literal; it makes it operationally equivalent to a
    resource




             Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Attaching new resources to the
      center of the graph




     Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Statistical semantics




 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Statistical semantics
Exploits statistical patterns
of human word usage to
figure out word meaning
●   LSA (Landauer)                                   ●   Completely unsupervised
                                                         Scale better than say neural networks
    Topics Models (Griffiths)
                                                     ●
●

                                                     ●   Most require lineal algebra operations
●   BEAGLE (Jones)                                       on large sparse matrices
●   HAL (Burgess)                                    ●   Computationally expensive
●   Random indexing (Sahlgren)
●   SP (Dennis)


                  Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Example of text data: Titles of Some Technical
                       Memos

●
    c1: Human machine interface for ABC computer applications
●
    c2: A survey of user opinion of computer system response time
●
    c3: The EPS user interface management system
●
    c4: System and human system engineering testing of EPS
●
    c5: Relation of user perceived response time to error measurement

●
    m1: The generation of random, binary, ordered trees
●
    m2: The intersection graph of paths in trees
●
    m3: Graph minors IV: Widths of trees and well-quasi-ordering
●
    m4: Graph minors: A survey




                 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Matrix of words by contexts




   Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                                                Decomposition of the
                            =                                                   words by contexts matrix



                 Contexts
Words (states)




                                      =



                                Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                    Decomposition of the
=                                                   words by contexts matrix




    Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                    Decomposition of the
=                                                   words by contexts matrix




    Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                    Decomposition of the
=                                                   words by contexts matrix




    Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                    Decomposition of the
=                                                   words by contexts matrix




    Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                    Decomposition of the
=                                                   words by contexts matrix




    Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Singular value
                                                    Decomposition of the
=                                                   words by contexts matrix




    Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Before                   After
r (human - user) =                    -.38                     .94
r (human - minors) =                   -.28                   -.83

            Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Similarity Measures
                                                                N
●
    Dot Product                                   x. y = ∑ xi yi
                                                               i =1


                                                               x. y
• Cosine                                          cos(θ xy ) =
                                                               x y


                                                                         N
• Euclidean                              euclid ( x, y ) =             ∑ ( xi − yi ) 2
                                                                        i =1




        Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Parallel spaces
●   Dbpedia                                        ●   Wikipedia
    ●   Structured                                       ●   Plain text
    ●   Well-connected to the                            ●   Representative of
        rest of the semantic                                 human knowledge and
        web                                                  interest
                                                         ●   Pageviews reflect how
    ●   One-to-one                                           present a concept is in
        mappings                                             the average human
                                                             mind


               Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Dbpedia-wikipedia corpus
●   Currently 4M concepts. We used the most
    central 1M
    ●   Has to have > 100 words after stoplist
    ●   More than 5 incoming and outgoing links




               Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
How to use statistical semantic to
 convert literals into resources


        Any literal can have a vector

Computing nearest neighbors will find similar
               resources




        Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Random indexing
●   Same dimension-reduction without SVD
●   For each context, assign a random vector
    (nonzero seed values is a free parameter).
●   A word will be the average of all context vectors
    it appears in
●   A new doc vector (e.g., a query) is the average
    of the vectors for the words it contains


             Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Training




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Generating the Meaningful, Unique
         Identifier (MUID)
●   Each literal gets a 1000-dimensional vector.
    This vector 'captures the meaning' of the text
●   Too long to be passed around in RDF. MD5
    hashing compacts it



                                        @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .
                                        mpib:c97169cadaadbba92afbc2895b9eb9f




             Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Example results. Taking any page and getting the
            closest dbpedia concepts
results for the search 'http://www.google.de' :
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dbpedia: <http://en.wikipedia.org/wiki#>

mpib:c97169cadaadbba92afbc2895b9eb9f skos:related
dbpebia:http://en.wikipedia.org/wiki/Google_Alerts
mpib:8482e762cceb5d7636529cccf1c825 skos:related
dbpebia:http://en.wikipedia.org/wiki/Google_Apps
mpib:278c93125941f38c18dfe67591c94a5 skos:related
dbpebia:http://en.wikipedia.org/wiki/Googlepedia
mpib:2885141b46cd2fdc3c447bcfa18b73 skos:related dbpebia:http://en.wikipedia.org/wiki/IGoogle
mpib:2959b4e35ca423f34a47b8fce196cf skos:related
dbpebia:http://en.wikipedia.org/wiki/List_of_Google_products




                      Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Example results




Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Problems
●   Nearest neighbors on the current space takes 2
    minutes. Fortunately, it's easily paralellizable


●   Vectors depend on the corpora. Two wikipedia
    version from different years may render slightly
    different vectors


●   Selecting the most relevant concepts on wikipedia is
    an extra source of free parameters

              Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Advantages
●   We can now use any text as subject. We can say that an essay is a
    review, or that a particular paragraph is insightful


●   Works at different granularity levels, from single word to entire books


●   We could use this to disambiguate text


●   It may reduce graph search time by connecting dangling nodes to central
    parts of the graph. Whether this is a good idea is an open question




                   Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Future work
●   Merge meaningful ID generation and
    compression into a single step


●   Improve nearest neighbors time


●   Apply it in a realistic use case scenario



             Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
What's in an identifier?
         Uniform Resource Identifier (URI)
 Scheme ":" ["//" authority "/"] [path] [ "?" query ]
                  [ "#" fragment]


      Meaningful, unique identifier (MUID)
@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .
       mpib:c97169cadaadbba92afbc2895b9eb9f


            Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
Random indexing spaces for bridging the Human and Data Webs
                    Jose Quesada, quesada@gmail.com

       Max Planck Institute, Adaptive Behavior and Cognition, Berlin




        Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

More Related Content

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Irmles2010 Random indexing spaces to bridge the human and data webs

  • 1. Random indexing spaces for bridging the Human and Data Webs Jose Quesada, Ralph Brandao-Vidal, Lael schooler Max Planck Institute, Adaptive Behavior and Cognition, Berlin Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 2. Introduction Most of the existing knowledge on the Web is in plain, unstructured text The problem we aim to solve in this paper is simply converting literals into resources Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 3. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam vulputate ipsum ac erat cursus et adipiscing diam pulvinar. In at ultricies odio. Donec sodales enim euismod nulla pulvinar et elementum velit congue. Cras ac quam ante, non facilisis massa. mpib:c97169cadaadbba92afbc2895b9eb9f unique, meaningful ID (MUID) Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 4. What's 'human web' Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 5. What's 'data web' Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 6. Ontotext's linked data semantic repository (LDSR) Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 7. Resources vs Literals Resource The first explicit definition of resource is found in RFC 2396 and states that A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources Literals Literals are values that do not have a unique identifier. They are usually a string that contains some human-readable text, for example names, dates and other types of values about a subject. In the previous example, the string ‘Fido’ is a literal. They optionally have a language (e.g., English, Japanese) or a type (e.g., integer, Boolean, string), but this is about all that can be said about literals. They cannot have properties like resources. Unlike resources, literals cannot link to the rest of the graph. They are second-class citizens on the Semantic Web. In terms of graphs, literals are one-way streets: since they cannot be the subject of a triple, there can be no outgoing links to other nodes. Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 8. What's in an identifier? ● Uniform Resource Identifier (URI) Scheme ":" ["//" authority "/"] [path] [ "?" query ] [ "#" fragment] Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 9. Why turning literals into resources is useful ● Increased integration of the human and data Webs ● Dangling nodes prevent us from applying some machine learning techniques: Number of URI: 126,875,974 Number of Literals: 227,758,535 Total number of entities: 354,635,159 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 10. We will use statistical semantics to generate a vector for any literal ● This vector can be used to uniquely identify a literal; it makes it operationally equivalent to a resource Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 11. Attaching new resources to the center of the graph Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 12. Statistical semantics Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 13. Statistical semantics Exploits statistical patterns of human word usage to figure out word meaning ● LSA (Landauer) ● Completely unsupervised Scale better than say neural networks Topics Models (Griffiths) ● ● ● Most require lineal algebra operations ● BEAGLE (Jones) on large sparse matrices ● HAL (Burgess) ● Computationally expensive ● Random indexing (Sahlgren) ● SP (Dennis) Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 14. Example of text data: Titles of Some Technical Memos ● c1: Human machine interface for ABC computer applications ● c2: A survey of user opinion of computer system response time ● c3: The EPS user interface management system ● c4: System and human system engineering testing of EPS ● c5: Relation of user perceived response time to error measurement ● m1: The generation of random, binary, ordered trees ● m2: The intersection graph of paths in trees ● m3: Graph minors IV: Widths of trees and well-quasi-ordering ● m4: Graph minors: A survey Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 15. Matrix of words by contexts Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 16. Singular value Decomposition of the = words by contexts matrix Contexts Words (states) = Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 17. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 18. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 19. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 20. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 21. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 22. Singular value Decomposition of the = words by contexts matrix Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 23. Before After r (human - user) = -.38 .94 r (human - minors) = -.28 -.83 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 24. Similarity Measures N ● Dot Product x. y = ∑ xi yi i =1 x. y • Cosine cos(θ xy ) = x y N • Euclidean euclid ( x, y ) = ∑ ( xi − yi ) 2 i =1 Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 25. Parallel spaces ● Dbpedia ● Wikipedia ● Structured ● Plain text ● Well-connected to the ● Representative of rest of the semantic human knowledge and web interest ● Pageviews reflect how ● One-to-one present a concept is in mappings the average human mind Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 26. Dbpedia-wikipedia corpus ● Currently 4M concepts. We used the most central 1M ● Has to have > 100 words after stoplist ● More than 5 incoming and outgoing links Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 27. How to use statistical semantic to convert literals into resources Any literal can have a vector Computing nearest neighbors will find similar resources Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 28. Random indexing ● Same dimension-reduction without SVD ● For each context, assign a random vector (nonzero seed values is a free parameter). ● A word will be the average of all context vectors it appears in ● A new doc vector (e.g., a query) is the average of the vectors for the words it contains Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 29. Training Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 30. Generating the Meaningful, Unique Identifier (MUID) ● Each literal gets a 1000-dimensional vector. This vector 'captures the meaning' of the text ● Too long to be passed around in RDF. MD5 hashing compacts it @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> . mpib:c97169cadaadbba92afbc2895b9eb9f Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 31. Example results. Taking any page and getting the closest dbpedia concepts results for the search 'http://www.google.de' : @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix dbpedia: <http://en.wikipedia.org/wiki#> mpib:c97169cadaadbba92afbc2895b9eb9f skos:related dbpebia:http://en.wikipedia.org/wiki/Google_Alerts mpib:8482e762cceb5d7636529cccf1c825 skos:related dbpebia:http://en.wikipedia.org/wiki/Google_Apps mpib:278c93125941f38c18dfe67591c94a5 skos:related dbpebia:http://en.wikipedia.org/wiki/Googlepedia mpib:2885141b46cd2fdc3c447bcfa18b73 skos:related dbpebia:http://en.wikipedia.org/wiki/IGoogle mpib:2959b4e35ca423f34a47b8fce196cf skos:related dbpebia:http://en.wikipedia.org/wiki/List_of_Google_products Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 32. Example results Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 33. Problems ● Nearest neighbors on the current space takes 2 minutes. Fortunately, it's easily paralellizable ● Vectors depend on the corpora. Two wikipedia version from different years may render slightly different vectors ● Selecting the most relevant concepts on wikipedia is an extra source of free parameters Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 34. Advantages ● We can now use any text as subject. We can say that an essay is a review, or that a particular paragraph is insightful ● Works at different granularity levels, from single word to entire books ● We could use this to disambiguate text ● It may reduce graph search time by connecting dangling nodes to central parts of the graph. Whether this is a good idea is an open question Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 35. Future work ● Merge meaningful ID generation and compression into a single step ● Improve nearest neighbors time ● Apply it in a realistic use case scenario Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 36. What's in an identifier? Uniform Resource Identifier (URI) Scheme ":" ["//" authority "/"] [path] [ "?" query ] [ "#" fragment] Meaningful, unique identifier (MUID) @prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> . mpib:c97169cadaadbba92afbc2895b9eb9f Jose Quesada: Random indexing spaces for bridging the Human and Data Webs
  • 37. Random indexing spaces for bridging the Human and Data Webs Jose Quesada, quesada@gmail.com Max Planck Institute, Adaptive Behavior and Cognition, Berlin Jose Quesada: Random indexing spaces for bridging the Human and Data Webs