Latent Semanctic Analysis Auro Tripathy

Latent Semantic
Analysis
Auro Tripathy
ipserv@yahoo.com

Outline

 Introduction
 Singular Value Decomposition
 Dimensionality Reduction
 LSA in Information Retrieval

Latent Semantic Analysis

Introduction

Mathematical treatment capable
of inferring meaning
 Measures of word-word, word-passage,
& passage-passage relations that
correlate well with human
understanding of semantic similarity
 Similarity estimates are NOT based on
contiguity frequencies, co-occurrence
counts, or usage correlations
 Mathematical way capable of inferring
deeper relationships; hence “latent”

Akin to a well-read nun dispensing
sex-advice

 Analysis of text alone
 Its knowledge does NOT come from
perceived information about the physical
world, NOT from instinct, NOT from
feelings, NOT from emotions
 Does NOT take into account word-order,
phrases, syntactic relationships, logic,
 It takes in large amounts of text and looks
for mutual interdependencies in the text

Words and Passages
 LSA represents the meaning of a word as the
average of the meaning of all the passages in
which it appears…
 …and the meaning of the passage as an
average of the meaning of the words it
contains

word1
word2
word3

What is LSA?
 LSA is a mathematical technique for
extracting and inferring relations of
expected contextual usage of words in
documents

What LSA is not
 Not a natural language processing
program
 Not an artificial intelligence program
 Does NOT use dictionaries or databases
 Does NOT use syntactic parsers
 Does not use morphologies
Takes as input – words and text
paragraphs

Example
 Titles of N=9 technical memoranda
 Five on human-computer interaction
 Four on mathematical graph theory
 Disjoint topics

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Sample Word-by-Document Matrix
 Word selection criteria – occurs in at least two of the
titles
How much was said about a topic


Semantic Similarity
using Spearman rank coefficient
correlation
 The correlation between human and user is
negative, -0.38
 The correlation between human and minor is
also negative, -0.29
 Expected; words never in the same
passage, no co-occurrences
Spearman ρ (human.user) = -0.38

Spearman ρ (human.minor) = -0.29

http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient

The Term Space
Documents
Terms

Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

The Document Space
Documents
Terms

Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

The Semantic Space
one space for terms and documents

 Represent terms AND documents in one
space
 Makes it possible to calculate similarities
 Between documents
 Between terms
 Between terms and documents

The Decomposition

Term1
Term2
Term3 S DT

M T
Term-by- rxr rxd
document
matrix

txd txr

 Splits the term-document matrix into three matrices
 New space, the SVD space
 because new axes were found by SVD along which the terms
and documents can be grouped

New Term Vector, New Document
Vector, & Singular Values

 T contains in its rows the term vectors
scaled to a new basis
 DT contains the new vectors of the
documents
 S contains the singular values
 σ1,σ2, …. σn
 Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0

Dimensionality Reduction

To reveal the latent semantic structure

Reduce to k Dimensions

Term1
Term2 S DT
Term3

M T
Term-by- kxk rxk
document
matrix

txd txk

Example
Term Vector Reduced to two Dimensions
T

S

D


Reconstruction of the original matrix
based on the reduced dimensions

NEW

Original


Recomputed Semantic Similarity
using Spearman rank coefficient
correlation
Spearman ρ (human.user) = +0.94
NEW

Spearman ρ (human.user) = -0.38
Original

Humans-user correlation went up and the human-minor correlation went down

Correlation between a title and all
other titles – Raw Data

•Correlation between the human-computer interaction titles was low
•Average correlations, 0.2; half the Spearman correlations were 0

•Correlation between the four graph-theory papers (mx / my) was mixed
•Average Spearman correlation was 0.44, 0.

•Correlation between human-computer interaction titles and the
graph-theory papers was -0.3, despite no semantic overlap

Correlation in the reduced
dimension (k=2) space

•Average correlations jumped from 0.2 to 0.92

•Correlation between the graph-theory papers (mx/my) was HIGH;1.0

•Correlation between human-computer interaction titles and the
graph-theory papers was strongly negative

How to treat a query
 Matrix of term-by-document
 Perform SVD, reduce dimensions to 50-400
 A query is a “pseudo-document”
 Weighted average of the vector of the words it
contains
 Use a similarity metric (such as cosine)
between the query vector and the document-
to-document vectors
 Rank the results

The Query Vector

 Does better that literal matches between terms in
query documents
 Superior when query and document use different
words Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

References
• Latent Semantic Indexing and
Information Retrieval, Johanna Geiß
• An Introduction to Latent Semantic
Analysis, Landauer, Foltz, Laham

Latent Semanctic Analysis Auro Tripathy

More Related Content

What's hot

Similar to Latent Semanctic Analysis Auro Tripathy

More from Auro Tripathy

Latent Semanctic Analysis Auro Tripathy