Latent Semantic
Analysis
Auro Tripathy
ipserv@yahoo.com
Outline

   Introduction
   Singular Value Decomposition
   Dimensionality Reduction
   LSA in Information Retrieval
Latent Semantic Analysis




         Introduction
Mathematical treatment capable
of inferring meaning
   Measures of word-word, word-passage,
    & passage-passage relations that
    correlate well with human
    understanding of semantic similarity
   Similarity estimates are NOT based on
    contiguity frequencies, co-occurrence
    counts, or usage correlations
   Mathematical way capable of inferring
    deeper relationships; hence “latent”
Akin to a well-read nun dispensing
sex-advice

   Analysis of text alone
   Its knowledge does NOT come from
    perceived information about the physical
    world, NOT from instinct, NOT from
    feelings, NOT from emotions
   Does NOT take into account word-order,
    phrases, syntactic relationships, logic,
   It takes in large amounts of text and looks
    for mutual interdependencies in the text
Words and Passages
   LSA represents the meaning of a word as the
    average of the meaning of all the passages in
    which it appears…
   …and the meaning of the passage as an
    average of the meaning of the words it
    contains


      word1
      word2
      word3
What is LSA?
   LSA is a mathematical technique for
    extracting and inferring relations of
    expected contextual usage of words in
    documents
What LSA is not
   Not a natural language processing
    program
   Not an artificial intelligence program
   Does NOT use dictionaries or databases
   Does NOT use syntactic parsers
   Does not use morphologies
Takes as input – words and text
 paragraphs
Example
   Titles of N=9 technical memoranda
       Five on human-computer interaction
       Four on mathematical graph theory
       Disjoint topics




                Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Sample Word-by-Document Matrix
    Word selection criteria – occurs in at least two of the
     titles
How much was said about a topic




                     Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Semantic Similarity
 using Spearman rank coefficient
 correlation
    The correlation between human and user is
     negative, -0.38
    The correlation between human and minor is
     also negative, -0.29
    Expected; words never in the same
     passage, no co-occurrences
     Spearman ρ (human.user) = -0.38


     Spearman ρ (human.minor) = -0.29

http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
Singular Value Decomposition
The Term Space
        Documents
Terms




                    Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
The Document Space
        Documents
Terms




                    Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
The Semantic Space
one space for terms and documents

    Represent terms AND documents in one
     space
    Makes it possible to calculate similarities
        Between documents
        Between terms
        Between terms and documents
The Decomposition

Term1
Term2
Term3                                           S               DT


                M                 T
             Term-by-                          rxr             rxd
            document
              matrix


               txd               txr

     Splits the term-document matrix into three matrices
     New space, the SVD space
           because new axes were found by SVD along which the terms
            and documents can be grouped
New Term Vector, New Document
Vector, & Singular Values

      T contains in its rows the term vectors
       scaled to a new basis
      DT contains the new vectors of the
       documents
      S contains the singular values
          σ1,σ2, …. σn
          Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0
Dimensionality Reduction




To reveal the latent semantic structure
Reduce to k Dimensions

Term1
Term2                      S    DT
Term3

            M        T
         Term-by-         kxk   rxk
        document
          matrix


          txd       txk
Example
Term Vector Reduced to two Dimensions
                                                                T




                                                                                       S




                                                                                            D




              Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Reconstruction of the original matrix
based on the reduced dimensions



NEW




                     Original




  Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Recomputed Semantic Similarity
        using Spearman rank coefficient
        correlation
                               Spearman ρ (human.user) = +0.94
             NEW
                              Spearman ρ (human.minor) = -0.83




                               Spearman ρ (human.user) = -0.38
             Original
                              Spearman ρ (human.minor) = -0.29


Humans-user correlation went up and the human-minor correlation went down
Correlation between a title and all
     other titles – Raw Data



•Correlation between the human-computer interaction titles was low
•Average correlations, 0.2; half the Spearman correlations were 0

•Correlation between the four graph-theory papers (mx / my) was mixed
•Average Spearman correlation was 0.44, 0.

•Correlation between human-computer interaction titles and the
graph-theory papers was -0.3, despite no semantic overlap
                       Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Correlation in the reduced
     dimension (k=2) space



•Average correlations jumped from 0.2 to 0.92


•Correlation between the graph-theory papers (mx/my) was HIGH;1.0

•Correlation between human-computer interaction titles and the
graph-theory papers was strongly negative
                       Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
LSA in Information Retrieval
How to treat a query
   Matrix of term-by-document
   Perform SVD, reduce dimensions to 50-400
   A query is a “pseudo-document”
       Weighted average of the vector of the words it
        contains
   Use a similarity metric (such as cosine)
    between the query vector and the document-
    to-document vectors
   Rank the results
The Query Vector




   Does better that literal matches between terms in
    query documents
   Superior when query and document use different
    words         Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
References
• Latent Semantic Indexing and
  Information Retrieval, Johanna Geiß
• An Introduction to Latent Semantic
  Analysis, Landauer, Foltz, Laham

Latent Semanctic Analysis Auro Tripathy

  • 1.
  • 2.
    Outline  Introduction  Singular Value Decomposition  Dimensionality Reduction  LSA in Information Retrieval
  • 3.
  • 4.
    Mathematical treatment capable ofinferring meaning  Measures of word-word, word-passage, & passage-passage relations that correlate well with human understanding of semantic similarity  Similarity estimates are NOT based on contiguity frequencies, co-occurrence counts, or usage correlations  Mathematical way capable of inferring deeper relationships; hence “latent”
  • 5.
    Akin to awell-read nun dispensing sex-advice  Analysis of text alone  Its knowledge does NOT come from perceived information about the physical world, NOT from instinct, NOT from feelings, NOT from emotions  Does NOT take into account word-order, phrases, syntactic relationships, logic,  It takes in large amounts of text and looks for mutual interdependencies in the text
  • 6.
    Words and Passages  LSA represents the meaning of a word as the average of the meaning of all the passages in which it appears…  …and the meaning of the passage as an average of the meaning of the words it contains word1 word2 word3
  • 7.
    What is LSA?  LSA is a mathematical technique for extracting and inferring relations of expected contextual usage of words in documents
  • 8.
    What LSA isnot  Not a natural language processing program  Not an artificial intelligence program  Does NOT use dictionaries or databases  Does NOT use syntactic parsers  Does not use morphologies Takes as input – words and text paragraphs
  • 9.
    Example  Titles of N=9 technical memoranda  Five on human-computer interaction  Four on mathematical graph theory  Disjoint topics Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
  • 10.
    Sample Word-by-Document Matrix  Word selection criteria – occurs in at least two of the titles How much was said about a topic Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
  • 11.
    Semantic Similarity usingSpearman rank coefficient correlation  The correlation between human and user is negative, -0.38  The correlation between human and minor is also negative, -0.29  Expected; words never in the same passage, no co-occurrences Spearman ρ (human.user) = -0.38 Spearman ρ (human.minor) = -0.29 http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
  • 12.
  • 13.
    The Term Space Documents Terms Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
  • 14.
    The Document Space Documents Terms Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
  • 15.
    The Semantic Space onespace for terms and documents  Represent terms AND documents in one space  Makes it possible to calculate similarities  Between documents  Between terms  Between terms and documents
  • 16.
    The Decomposition Term1 Term2 Term3 S DT M T Term-by- rxr rxd document matrix txd txr  Splits the term-document matrix into three matrices  New space, the SVD space  because new axes were found by SVD along which the terms and documents can be grouped
  • 17.
    New Term Vector,New Document Vector, & Singular Values  T contains in its rows the term vectors scaled to a new basis  DT contains the new vectors of the documents  S contains the singular values  σ1,σ2, …. σn  Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0
  • 18.
    Dimensionality Reduction To revealthe latent semantic structure
  • 19.
    Reduce to kDimensions Term1 Term2 S DT Term3 M T Term-by- kxk rxk document matrix txd txk
  • 20.
    Example Term Vector Reducedto two Dimensions T S D Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
  • 21.
    Reconstruction of theoriginal matrix based on the reduced dimensions NEW Original Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
  • 22.
    Recomputed Semantic Similarity using Spearman rank coefficient correlation Spearman ρ (human.user) = +0.94 NEW Spearman ρ (human.minor) = -0.83 Spearman ρ (human.user) = -0.38 Original Spearman ρ (human.minor) = -0.29 Humans-user correlation went up and the human-minor correlation went down
  • 23.
    Correlation between atitle and all other titles – Raw Data •Correlation between the human-computer interaction titles was low •Average correlations, 0.2; half the Spearman correlations were 0 •Correlation between the four graph-theory papers (mx / my) was mixed •Average Spearman correlation was 0.44, 0. •Correlation between human-computer interaction titles and the graph-theory papers was -0.3, despite no semantic overlap Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
  • 24.
    Correlation in thereduced dimension (k=2) space •Average correlations jumped from 0.2 to 0.92 •Correlation between the graph-theory papers (mx/my) was HIGH;1.0 •Correlation between human-computer interaction titles and the graph-theory papers was strongly negative Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
  • 25.
  • 26.
    How to treata query  Matrix of term-by-document  Perform SVD, reduce dimensions to 50-400  A query is a “pseudo-document”  Weighted average of the vector of the words it contains  Use a similarity metric (such as cosine) between the query vector and the document- to-document vectors  Rank the results
  • 27.
    The Query Vector  Does better that literal matches between terms in query documents  Superior when query and document use different words Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
  • 28.
    References • Latent SemanticIndexing and Information Retrieval, Johanna Geiß • An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham