Explicit vs. latent concept models for cross language information retrieval

  • 528 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
528
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Digital Enterprise Research Institute www.deri.ie Explicit vs. Latent Concept Models for Cross- Language Information Retrieval Nitish Aggarwal DERI, NUI Galway firstname.lastname@deri.org Tuesday,Digitalth June, 2012 All rights reserved. Copyright 2011 26 Enterprise Research Institute. DERI, Reading Group Enabling Networked Knowledge
  • 2. Based On:Digital Enterprise Research Institute www.deri.ie  Title:  “Explicit vs. Latent Concept Models for Cross-Language Information Retrieval”  Authors:  Philipp Cimiano, Antje Schultz, SergejSizov, Philipp Sorg, Steffen Staab  Published:  International Joint Conference on Artificial Intelligence, 2009 Enabling Networked Knowledge
  • 3. OverviewDigital Enterprise Research Institute www.deri.ie  Introduction  Cross lingual information retrieval (CLIR)  Concept Model  Explicit Semantics  Latent Semantics  Evaluation  Conclusion Enabling Networked Knowledge
  • 4. Introduction: CLIRDigital Enterprise Research Institute www.deri.ie  Cross Lingual Information Retrieval  Many documents, web sites are written in different languages  Retrieve all information without a language barrier  Query and documents are in different languages Enabling Networked Knowledge
  • 5. Introduction: CLIRDigital Enterprise Research Institute www.deri.ie  CLIR based on Machine Translation  Translation of queries or documents  Reduced problem to monolingual retrieval – Issues: – MT is not available for all language pairs – Increase vocabulary mismatch Enabling Networked Knowledge
  • 6. Introduction: CLIRDigital Enterprise Research Institute www.deri.ie  Interlingua or Concepts based  Use language independent representation – Mapping all queries and documents in different language to concepts space – Define a concept space and relevance function Language independent representation Enabling Networked Knowledge
  • 7. Concept ModelDigital Enterprise Research Institute www.deri.ie  Document in conceptspace  Di = {t1, t2,t3…tn}  ti in space C1 – Associationwitheveryconcept  Composite semanticsofalltokens – Σti , Πti  Typesofconceptmodel ti  Explicit C2  Latent/implicit C3 Enabling Networked Knowledge
  • 8. ConceptModel: ExplicitDigital Enterprise Research Institute www.deri.ie  Intuition: define concepts from external resources  Definition of concepts – Wikipedia articles, tagged web pages  Cover a broad range of vocabulary and language  Example  Wikipedia based Explicit semantic analysis (ESA) Enabling Networked Knowledge
  • 9. Concept Model: ESADigital Enterprise Research Institute www.deri.ie  ExplicitConceptSpace  Di = {t1, t2,t3…tn}  ti = {w1a1 + w2a2… + wnan} query University docs  Composite semanticsofalltoken – Σti Student Education Enabling Networked Knowledge
  • 10. Cross lingual - ESADigital Enterprise Research Institute www.deri.ie  Extension of ESA  Use Wikipedia cross language links  Linked articles define same concepts in different languages EN Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn DE Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn ES Word1 W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn Inverted Index Term@en W11*URI1+w12*URI2…. w1n*URIn Vector Semantic Term@de W11*URI1+w12*URI2…. w1n*URIn Cosine Relatedness Enabling Networked Knowledge
  • 11. Concept Model: LatentDigital Enterprise Research Institute www.deri.ie  Intuition: semantic space of latent concepts  Definition of latent concepts – Cluster of similar things define a latent concept Latent Concept1 Latent Concept2 30% broccoli 20% chinchillas 15% bananas 20% kittens 10% breakfast 20% cute 10% munching (Food) 15% hamster (animals) Look at this cute hamster munching on a piece of brocoli (40% Latent Concept1 and 60%Latent Concept2) Enabling Networked Knowledge
  • 12. Concept Model: LatentDigital Enterprise Research Institute www.deri.ie docs query LC1 Training Corpus Derived Latent LC2 Concepts LC1 LC2 LC3 LC3 Enabling Networked Knowledge
  • 13. Latent Semantic Analysis (LSA)Digital Enterprise Research Institute www.deri.ie  Definition  Dimensionality reductions to find latent concepts  Approach  Build term-documents matrix M  Perform single value decomposition (SVD) on M  Approximate M by taking top N singular values – N singular values reflect N different latent concepts – U defines term-concept-correlation – V defines document-concept-correlation  Cross Lingual-LSA  Use parallel corpus Enabling Networked Knowledge
  • 14. Latent Dirichlet Allocation (LDA)Digital Enterprise Research Institute www.deri.ie  Definition  Generative model – Words generate latent concepts (Topics) – Topics generate document to learn the parameter  Approach  Topic distribution is assumed to be Dirichlet prior  Fit corpus and document level properties using variational Expectation Maximization (EM) procedure  Cross-lingual-LDA  Use parallel corpus Enabling Networked Knowledge
  • 15. EvaluationDigital Enterprise Research Institute www.deri.ie  Parallel corpora  All documents are translated into many languages  Relevance assessment  Use documents in one language as query to retrieve documents of other language  Translated document = relevant document – No manual relevant assessment is needed  Measures used  Mean reciprocal rank (MRR)  Average score over all language pairs Enabling Networked Knowledge
  • 16. Evaluation: DatasetsDigital Enterprise Research Institute www.deri.ie  Multilingual corpora  MultextCorpus – 3066 Q/A pairs from the Official Journal of European Community  JRC-AQUIS Corpus – 21,000 legislative documents of the European Union – We randomly selected 3,000 documents as queries  Set up  English, German and French documents were used  Split dataset for latent topic extraction – 60% learning, 40% testing Enabling Networked Knowledge
  • 17. Evaluation: DatasetsDigital Enterprise Research Institute www.deri.ie  Wikipedia  Snapshot – 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German) – Collection of 166,484 articles  CL-ESA: Use cross-language links for concepts in different language  LSA/LDA: Wikipedia as parallel corpus – Use it as training corpus for latent concepts extraction Enabling Networked Knowledge
  • 18. Evaluation: ParameterDigital Enterprise Research Institute www.deri.ie  Cross-lingual ESA  Problem – Too many concepts  Solution – Only use highest m values  LSI/LDA  Problem – Computational costs increase with number of topics  Solution – Use fixed number of latent topics Enabling Networked Knowledge
  • 19. Evaluation: ResultsDigital Enterprise Research Institute www.deri.ie  Multext Dataset Enabling Networked Knowledge
  • 20. Evaluation: ResultsDigital Enterprise Research Institute www.deri.ie  JRC-Aquis Dataset Enabling Networked Knowledge
  • 21. ConclusionDigital Enterprise Research Institute www.deri.ie  Parameter tuning  ESA performs good for m=10,000  Maximum of 500 topics for LSI tested – Not maximal performance, but seems to converge  Results  LSA performs better than LDA  Comparable results of CL-ESA and LSA – Explicit Vs Implicit  Explicit model Perform better than latent model Enabling Networked Knowledge