SlideShare a Scribd company logo
1 of 21
Digital Enterprise Research Institute                                                   www.deri.ie




          Explicit vs. Latent Concept Models for Cross-
                 Language Information Retrieval

                                                  Nitish Aggarwal
                                                 DERI, NUI Galway
                                           firstname.lastname@deri.org




 Tuesday,Digitalth June, 2012 All rights reserved.
 Copyright 2011
                26 Enterprise Research Institute.
 DERI, Reading Group
                                                                  Enabling Networked Knowledge
Based On:
Digital Enterprise Research Institute                                                  www.deri.ie




            Title:
                   “Explicit vs. Latent Concept Models for Cross-Language
                    Information Retrieval”

            Authors:
                   Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg,
                    Steffen Staab

            Published:
                    International Joint Conference on Artificial Intelligence, 2009




                                                             Enabling Networked Knowledge
Overview
Digital Enterprise Research Institute                                            www.deri.ie




            Introduction
                   Cross lingual information retrieval (CLIR)
            Concept Model
                   Explicit Semantics
                   Latent Semantics
            Evaluation
            Conclusion




                                                           Enabling Networked Knowledge
Introduction: CLIR
Digital Enterprise Research Institute                                            www.deri.ie




            Cross Lingual Information Retrieval
                   Many documents, web sites
                    are written in different languages


                   Retrieve all information without
                    a language barrier


                   Query and documents are in different
                     languages




                                                           Enabling Networked Knowledge
Introduction: CLIR
Digital Enterprise Research Institute                                                          www.deri.ie




            CLIR based on Machine Translation
                   Translation of queries or documents
                   Reduced problem to monolingual retrieval
                       – Issues:
                               – MT is not available for all language pairs
                               – Increase vocabulary mismatch




                                                                         Enabling Networked Knowledge
Introduction: CLIR
Digital Enterprise Research Institute                                                         www.deri.ie




      Interlingua or Concepts based
            Use language independent representation
                – Mapping all queries and documents in different language to concepts space
                – Define a concept space and relevance function




                                        Language independent
                                           representation



                                                                Enabling Networked Knowledge
Concept Model
Digital Enterprise Research Institute                                          www.deri.ie

            Document in conceptspace
                   Di = {t1, t2,t3…tn}
                ti in space
                                                               C1
                    – Associationwitheveryconcept
                   Composite semanticsofalltokens
                       – Σti , Πti


            Typesofconceptmodel                          ti

                   Explicit
                                                                              C2
                   Latent/implicit



                                                    C3


                                                         Enabling Networked Knowledge
ConceptModel: Explicit
Digital Enterprise Research Institute                                                 www.deri.ie

            Intuition: define concepts from external resources
                   Definition of concepts
                       – Wikipedia articles, tagged web pages
                   Cover a broad range of vocabulary and language
            Example
                   Wikipedia based Explicit semantic analysis (ESA)




                                                                Enabling Networked Knowledge
Concept Model: ESA
Digital Enterprise Research Institute                                                    www.deri.ie

            ExplicitConceptSpace
                   Di = {t1, t2,t3…tn}
                ti    = {w1a1 + w2a2… + wnan}               query   University
                                                                                  docs
                   Composite semanticsofalltoken
                       – Σti




                                                                                           Student




                                                 Education


                                                              Enabling Networked Knowledge
Cross lingual - ESA
Digital Enterprise Research Institute                                                                            www.deri.ie

            Extension of ESA
                   Use Wikipedia cross language links
                   Linked articles define same concepts in different languages

                                                               EN        Word1 W1*URI1+w2*URI2…. wn*URIn

                                                                         Wordn W1*URI1+w2*URI2…. wn*URIn


                                                               DE        Word1 W1*URI1+w2*URI2…. wn*URIn

                                                                         Wordn W1*URI1+w2*URI2…. wn*URIn

                                                               ES        Word1 W1*URI1+w2*URI2…. wn*URIn

                                                                         Wordn W1*URI1+w2*URI2…. wn*URIn

                                                                                  Inverted Index




                                Term@en   W11*URI1+w12*URI2…. w1n*URIn
                                                                               Vector               Semantic
                               Term@de    W11*URI1+w12*URI2…. w1n*URIn
                                                                               Cosine              Relatedness




                                                                                        Enabling Networked Knowledge
Concept Model: Latent
Digital Enterprise Research Institute                                                                www.deri.ie

            Intuition: semantic space of latent concepts
                   Definition of latent concepts
                       – Cluster of similar things define a latent concept


                               Latent Concept1                        Latent Concept2
                                    30% broccoli                         20% chinchillas
                                   15% bananas                             20% kittens
                                   10% breakfast                            20% cute
                                   10% munching
                                     (Food)                               15% hamster
                                                                          (animals)



                                 Look at this cute hamster munching on a piece of brocoli
                                    (40% Latent Concept1 and 60%Latent Concept2)




                                                                               Enabling Networked Knowledge
Concept Model: Latent
Digital Enterprise Research Institute                                                   www.deri.ie




                                                                                 docs
                                                                query
                                                                        LC1




     Training
     Corpus



                                         Derived Latent                                   LC2
                                           Concepts
                                        LC1

                                        LC2

                                        LC3
                                                          LC3




                                                                 Enabling Networked Knowledge
Latent Semantic Analysis (LSA)
Digital Enterprise Research Institute                                                  www.deri.ie

            Definition
                   Dimensionality reductions to find latent concepts
            Approach
                   Build term-documents matrix M
                   Perform single value decomposition (SVD) on M


                   Approximate M by taking top N singular values
                       – N singular values reflect N different latent concepts
                       – U defines term-concept-correlation
                       – V defines document-concept-correlation
            Cross Lingual-LSA
                   Use parallel corpus


                                                                 Enabling Networked Knowledge
Latent Dirichlet Allocation (LDA)
Digital Enterprise Research Institute                                                   www.deri.ie


            Definition
                   Generative model
                       – Words generate latent concepts (Topics)
                       – Topics generate document to learn the parameter


            Approach
                   Topic distribution is assumed to be Dirichlet prior
                   Fit corpus and document level properties using variational
                    Expectation Maximization (EM) procedure


            Cross-lingual-LDA
                   Use parallel corpus



                                                                  Enabling Networked Knowledge
Evaluation
Digital Enterprise Research Institute                                                    www.deri.ie




            Parallel corpora
                   All documents are translated into many languages


            Relevance assessment
                   Use documents in one language as query to retrieve documents
                    of other language
                   Translated document = relevant document
                       – No manual relevant assessment is needed


            Measures used
                   Mean reciprocal rank (MRR)
                   Average score over all language pairs

                                                                   Enabling Networked Knowledge
Evaluation: Datasets
Digital Enterprise Research Institute                                                     www.deri.ie


            Multilingual corpora
                   MultextCorpus
                       – 3066 Q/A pairs from the Official Journal of European Community
                   JRC-AQUIS Corpus
                       – 21,000 legislative documents of the European Union
                       – We randomly selected 3,000 documents as queries



            Set up
                   English, German and French documents were used
                   Split dataset for latent topic extraction
                       – 60% learning, 40% testing




                                                                   Enabling Networked Knowledge
Evaluation: Datasets
Digital Enterprise Research Institute                                                        www.deri.ie




            Wikipedia
                   Snapshot
                       – 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)
                       – Collection of 166,484 articles



                   CL-ESA: Use cross-language links for concepts in different
                    language


                   LSA/LDA: Wikipedia as parallel corpus
                       – Use it as training corpus for latent concepts extraction




                                                                       Enabling Networked Knowledge
Evaluation: Parameter
Digital Enterprise Research Institute                                                    www.deri.ie




            Cross-lingual ESA
                   Problem
                       – Too many concepts
                   Solution
                       – Only use highest m values


            LSI/LDA
                   Problem
                       – Computational costs increase with number of topics
                   Solution
                       – Use fixed number of latent topics




                                                                   Enabling Networked Knowledge
Evaluation: Results
Digital Enterprise Research Institute                                          www.deri.ie



            Multext Dataset




                                                         Enabling Networked Knowledge
Evaluation: Results
Digital Enterprise Research Institute                                          www.deri.ie



            JRC-Aquis Dataset




                                                         Enabling Networked Knowledge
Conclusion
Digital Enterprise Research Institute                                              www.deri.ie



            Parameter tuning
                   ESA performs good for m=10,000
                   Maximum of 500 topics for LSI tested
                       – Not maximal performance, but seems to converge


            Results
                   LSA performs better than LDA
                   Comparable results of CL-ESA and LSA
                       – Explicit Vs Implicit
                   Explicit model Perform better than latent model




                                                             Enabling Networked Knowledge

More Related Content

What's hot

Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...ijtsrd
 
Rethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed SemanticRethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed SemanticAlexandre Passant
 
A study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptographyA study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptographyAlexander Decker
 
Mist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo andoMist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo andoRuo Ando
 
Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...Benjamin Heitmann
 
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...CSCJournals
 
Knowledge-based generation of educational web pages
Knowledge-based generation of educational web pagesKnowledge-based generation of educational web pages
Knowledge-based generation of educational web pagesStefan Trausan-Matu
 
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic AnalysisEnrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic AnalysisLinkedTV
 
The learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologiesThe learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologiesgrainne
 
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...Benjamin Heitmann
 
Lessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web appsLessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web appsBenjamin Heitmann
 
Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...Alexandre Passant
 
Kbms knowledge
Kbms knowledgeKbms knowledge
Kbms knowledgeokeee
 
TEL Developments & Trends
TEL Developments & TrendsTEL Developments & Trends
TEL Developments & Trendstimku
 
The Future of Technology and Information
The Future of Technology and InformationThe Future of Technology and Information
The Future of Technology and InformationNick Finck
 
Issues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media PublishingIssues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media PublishingBeat Signer
 
Introduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology StackIntroduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology StackFabian Christ
 

What's hot (20)

Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...Simulation based Performance Analysis of Histogram Shifting Method on Various...
Simulation based Performance Analysis of Histogram Shifting Method on Various...
 
Rethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed SemanticRethinking Microblogging: Open Distributed Semantic
Rethinking Microblogging: Open Distributed Semantic
 
[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh
[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh
[IJET-V1I6P12] Authors: Manisha Bhagat, Komal Chavan ,Shriniwas Deshmukh
 
A study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptographyA study of image fingerprinting by using visual cryptography
A study of image fingerprinting by using visual cryptography
 
Mist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo andoMist2012 panel discussion-ruo ando
Mist2012 panel discussion-ruo ando
 
Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...Transitioning web application frameworks towards the Semantic Web (master the...
Transitioning web application frameworks towards the Semantic Web (master the...
 
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
Quality - Security Uncompromised and Plausible Watermarking for Patent Infrin...
 
Knowledge-based generation of educational web pages
Knowledge-based generation of educational web pagesKnowledge-based generation of educational web pages
Knowledge-based generation of educational web pages
 
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic AnalysisEnrichment of News Show Videos with Multimodal Semi-Automatic Analysis
Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis
 
The learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologiesThe learner voice: students' use and experience of technologies
The learner voice: students' use and experience of technologies
 
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
 
Lessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web appsLessons and requirements from a decade of deployed Semantic Web apps
Lessons and requirements from a decade of deployed Semantic Web apps
 
Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...Federating Distributed Social Data to Build an Interlinked Online Information...
Federating Distributed Social Data to Build an Interlinked Online Information...
 
Kbms knowledge
Kbms knowledgeKbms knowledge
Kbms knowledge
 
TEL Developments & Trends
TEL Developments & TrendsTEL Developments & Trends
TEL Developments & Trends
 
185 189
185 189185 189
185 189
 
The Future of Technology and Information
The Future of Technology and InformationThe Future of Technology and Information
The Future of Technology and Information
 
Issues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media PublishingIssues of Information Semantics and Granularity in Cross-Media Publishing
Issues of Information Semantics and Granularity in Cross-Media Publishing
 
Introduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology StackIntroduction to the IKS 7.0 Technology Stack
Introduction to the IKS 7.0 Technology Stack
 
1709 1715
1709 17151709 1715
1709 1715
 

Similar to Cross-Language Info Retrieval Models

Linked Open Data
Linked Open DataLinked Open Data
Linked Open DataDerilinx
 
Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...jodischneider
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalismBahareh Heravi
 
ICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work EnvironmentsICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work EnvironmentsLaura Dragan
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceEdward Curry
 
Swap2010 agave
Swap2010 agaveSwap2010 agave
Swap2010 agavejuanaya
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomesjodischneider
 
Annotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting ApplicationsAnnotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting ApplicationsDavid Crowley
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...jodischneider
 
Hello Open World - Semtech 2009
Hello Open World - Semtech 2009Hello Open World - Semtech 2009
Hello Open World - Semtech 2009Alexandre Passant
 
Stefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALSStefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALSStefan Decker
 
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...Alexandre Passant
 
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...Andre Freitas
 
A distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph dataA distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph dataAndre Freitas
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
VoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsVoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsRichard Cyganiak
 
Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataEdward Curry
 
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyondEDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyondEuropean Data Forum
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government DataFadi Maali
 

Similar to Cross-Language Info Retrieval Models (20)

Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...Making sense out of disagreement, University of Limerick Interaction Design C...
Making sense out of disagreement, University of Limerick Interaction Design C...
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
 
ICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work EnvironmentsICOM: A Framework for Integrated Collaborative Work Environments
ICOM: A Framework for Integrated Collaborative Work Environments
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked Dataspace
 
Swap2010 agave
Swap2010 agaveSwap2010 agave
Swap2010 agave
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
 
Annotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting ApplicationsAnnotating Microblog Posts with Sensor Data for Emergency Reporting Applications
Annotating Microblog Posts with Sensor Data for Emergency Reporting Applications
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...
 
Lgd 2
Lgd 2Lgd 2
Lgd 2
 
Hello Open World - Semtech 2009
Hello Open World - Semtech 2009Hello Open World - Semtech 2009
Hello Open World - Semtech 2009
 
Stefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALSStefan Decker Keynote at CSHALS
Stefan Decker Keynote at CSHALS
 
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
Semantic Enterprise 2.0 - Enabling Semantic Web technologies in Enterprise 2...
 
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs fr...
 
A distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph dataA distributional structured semantic space for querying rdf graph data
A distributional structured semantic space for querying rdf graph data
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
VoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsVoID: Metadata for RDF Datasets
VoID: Metadata for RDF Datasets
 
Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked Data
 
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyondEDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
EDF2013: Keynote Stefan Decker: Big Data In Ireland - Linked Data and beyond
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government Data
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Cross-Language Info Retrieval Models

  • 1. Digital Enterprise Research Institute www.deri.ie Explicit vs. Latent Concept Models for Cross- Language Information Retrieval Nitish Aggarwal DERI, NUI Galway firstname.lastname@deri.org Tuesday,Digitalth June, 2012 All rights reserved. Copyright 2011 26 Enterprise Research Institute. DERI, Reading Group Enabling Networked Knowledge
  • 2. Based On: Digital Enterprise Research Institute www.deri.ie  Title:  “Explicit vs. Latent Concept Models for Cross-Language Information Retrieval”  Authors:  Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg, Steffen Staab  Published:  International Joint Conference on Artificial Intelligence, 2009 Enabling Networked Knowledge
  • 3. Overview Digital Enterprise Research Institute www.deri.ie  Introduction  Cross lingual information retrieval (CLIR)  Concept Model  Explicit Semantics  Latent Semantics  Evaluation  Conclusion Enabling Networked Knowledge
  • 4. Introduction: CLIR Digital Enterprise Research Institute www.deri.ie  Cross Lingual Information Retrieval  Many documents, web sites are written in different languages  Retrieve all information without a language barrier  Query and documents are in different languages Enabling Networked Knowledge
  • 5. Introduction: CLIR Digital Enterprise Research Institute www.deri.ie  CLIR based on Machine Translation  Translation of queries or documents  Reduced problem to monolingual retrieval – Issues: – MT is not available for all language pairs – Increase vocabulary mismatch Enabling Networked Knowledge
  • 6. Introduction: CLIR Digital Enterprise Research Institute www.deri.ie  Interlingua or Concepts based  Use language independent representation – Mapping all queries and documents in different language to concepts space – Define a concept space and relevance function Language independent representation Enabling Networked Knowledge
  • 7. Concept Model Digital Enterprise Research Institute www.deri.ie  Document in conceptspace  Di = {t1, t2,t3…tn}  ti in space C1 – Associationwitheveryconcept  Composite semanticsofalltokens – Σti , Πti  Typesofconceptmodel ti  Explicit C2  Latent/implicit C3 Enabling Networked Knowledge
  • 8. ConceptModel: Explicit Digital Enterprise Research Institute www.deri.ie  Intuition: define concepts from external resources  Definition of concepts – Wikipedia articles, tagged web pages  Cover a broad range of vocabulary and language  Example  Wikipedia based Explicit semantic analysis (ESA) Enabling Networked Knowledge
  • 9. Concept Model: ESA Digital Enterprise Research Institute www.deri.ie  ExplicitConceptSpace  Di = {t1, t2,t3…tn}  ti = {w1a1 + w2a2… + wnan} query University docs  Composite semanticsofalltoken – Σti Student Education Enabling Networked Knowledge
  • 10. Cross lingual - ESA Digital Enterprise Research Institute www.deri.ie  Extension of ESA  Use Wikipedia cross language links  Linked articles define same concepts in different languages EN Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn DE Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn ES Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn Inverted Index Term@en W11*URI1+w12*URI2…. w1n*URIn Vector Semantic Term@de W11*URI1+w12*URI2…. w1n*URIn Cosine Relatedness Enabling Networked Knowledge
  • 11. Concept Model: Latent Digital Enterprise Research Institute www.deri.ie  Intuition: semantic space of latent concepts  Definition of latent concepts – Cluster of similar things define a latent concept Latent Concept1 Latent Concept2 30% broccoli 20% chinchillas 15% bananas 20% kittens 10% breakfast 20% cute 10% munching (Food) 15% hamster (animals) Look at this cute hamster munching on a piece of brocoli (40% Latent Concept1 and 60%Latent Concept2) Enabling Networked Knowledge
  • 12. Concept Model: Latent Digital Enterprise Research Institute www.deri.ie docs query LC1 Training Corpus Derived Latent LC2 Concepts LC1 LC2 LC3 LC3 Enabling Networked Knowledge
  • 13. Latent Semantic Analysis (LSA) Digital Enterprise Research Institute www.deri.ie  Definition  Dimensionality reductions to find latent concepts  Approach  Build term-documents matrix M  Perform single value decomposition (SVD) on M  Approximate M by taking top N singular values – N singular values reflect N different latent concepts – U defines term-concept-correlation – V defines document-concept-correlation  Cross Lingual-LSA  Use parallel corpus Enabling Networked Knowledge
  • 14. Latent Dirichlet Allocation (LDA) Digital Enterprise Research Institute www.deri.ie  Definition  Generative model – Words generate latent concepts (Topics) – Topics generate document to learn the parameter  Approach  Topic distribution is assumed to be Dirichlet prior  Fit corpus and document level properties using variational Expectation Maximization (EM) procedure  Cross-lingual-LDA  Use parallel corpus Enabling Networked Knowledge
  • 15. Evaluation Digital Enterprise Research Institute www.deri.ie  Parallel corpora  All documents are translated into many languages  Relevance assessment  Use documents in one language as query to retrieve documents of other language  Translated document = relevant document – No manual relevant assessment is needed  Measures used  Mean reciprocal rank (MRR)  Average score over all language pairs Enabling Networked Knowledge
  • 16. Evaluation: Datasets Digital Enterprise Research Institute www.deri.ie  Multilingual corpora  MultextCorpus – 3066 Q/A pairs from the Official Journal of European Community  JRC-AQUIS Corpus – 21,000 legislative documents of the European Union – We randomly selected 3,000 documents as queries  Set up  English, German and French documents were used  Split dataset for latent topic extraction – 60% learning, 40% testing Enabling Networked Knowledge
  • 17. Evaluation: Datasets Digital Enterprise Research Institute www.deri.ie  Wikipedia  Snapshot – 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German) – Collection of 166,484 articles  CL-ESA: Use cross-language links for concepts in different language  LSA/LDA: Wikipedia as parallel corpus – Use it as training corpus for latent concepts extraction Enabling Networked Knowledge
  • 18. Evaluation: Parameter Digital Enterprise Research Institute www.deri.ie  Cross-lingual ESA  Problem – Too many concepts  Solution – Only use highest m values  LSI/LDA  Problem – Computational costs increase with number of topics  Solution – Use fixed number of latent topics Enabling Networked Knowledge
  • 19. Evaluation: Results Digital Enterprise Research Institute www.deri.ie  Multext Dataset Enabling Networked Knowledge
  • 20. Evaluation: Results Digital Enterprise Research Institute www.deri.ie  JRC-Aquis Dataset Enabling Networked Knowledge
  • 21. Conclusion Digital Enterprise Research Institute www.deri.ie  Parameter tuning  ESA performs good for m=10,000  Maximum of 500 topics for LSI tested – Not maximal performance, but seems to converge  Results  LSA performs better than LDA  Comparable results of CL-ESA and LSA – Explicit Vs Implicit  Explicit model Perform better than latent model Enabling Networked Knowledge