SlideShare a Scribd company logo
1 of 25
INFORMATION RETRIEVAL
TECHNIQUES
BY
DR. ADNAN ABID
Lecture # 31
Dimensionality reduction
1
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline
sources
1. “Introduction to information retrieval” by Prabhakar Raghavan,
Christopher D. Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C.
Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro
Bozzon, Marco Brambilla
Outline
• Dimensionality reduction
• Random projection onto k<<m axes
• Computing the random projection
• Latent semantic indexing (LSI)
• The matrix
• Dimension reduction
3
Dimensionality reduction
• What if we could take our vectors and “pack” them into fewer
dimensions (say 10000100) while preserving distances?
• (Well, almost.)
• Speeds up cosine computations.
• Two methods:
• Random projection.
• “Latent semantic indexing”.
Random projection onto k<<m axes.
• Choose a random direction x1 in the vector space.
• For i = 2 to k,
• Choose a random direction xi that is orthogonal to x1, x2, … xi-1
.
• Project each doc vector into the subspace x1, x2, … xk.
E.g., from 3 to 2 dimensions
D2
D1
x1
t 3
x2
t 2
t 1
x1
x2
D2
D1
x1 is a random direction in (t1,t2,t3) space.
x2 is chosen randomly but orthogonal to x1.
Guarantee
• With high probability, relative distances are (approximately)
preserved by projection.
• Pointer to precise theorem in Resources.
Computing the random projection
• Projecting n vectors from m dimensions down to k
dimensions:
• Start with m  n matrix of terms  docs, A.
• Find random k  m orthogonal projection matrix R.
• Compute matrix product W = R  A.
• jth column of W is the vector corresponding to doc j, but
now in k << m dimensions.
Cost of computation
• This takes a total of kmn multiplications.
• Expensive - see Resources for ways to do essentially the same thing,
quicker.
• Exercise: by projecting from 10000 dimensions down to 100, are we
really going to make each cosine computation faster?
• Size of the vectors would decrease
• Will result into smaller postings
Latent semantic indexing (LSI)
• Another technique for dimension reduction
• Random projection was data-independent
• LSI on the other hand is data-dependent
• Eliminate redundant axes
• Pull together “related” axes
• car and automobile
Notions from linear algebra
• Matrix, vector
• Matrix transpose and product
• Rank
• Eigenvalues and eigenvectors.
The matrix
12
• The matrix
has rank 2: the first two rows are linearly independent, so the rank is at
least 2, but all three rows are linearly dependent (the first is equal to
the sum of the second and third) so the rank must be less than 3.
• The matrix
has rank 1: there are nonzero colums. So the rank is positive. but any
pair of columns is linearly dependent.
The matrix
Similarly, the transpose
of A has rank 1. Indeed, since the column vectors of A are the row
vectors of the transpose of A. the statement that the column rank of a
matrix equals its row rank is equivalent to the statement that the rank
of a matrix is equal to the rank of its transpose, i.e., rk(A) = rk(A).
13
Singular-Value Decomposition
• Recall m  n matrix of terms  docs, A.
• A has rank r  m,n.
• Define term-term correlation matrix T=AAt
• At denotes the matrix transpose of A.
• T is a square, symmetric m  m matrix.
• Doc-doc correlation matrix D=AtA.
• D is a square, symmetric n  n matrix.
Why?
Eigenvectors
• Denote by P the m  r matrix of eigenvectors of T.
• Denote by R the n  r matrix of eigenvectors of D.
• It turns out A can be expressed (decomposed) as A = PQRt
• Q is a diagonal matrix with the eigenvalues of AAt in sorted order.
Eigen Vectors
• The transformation matrix preserves the direction of
• vectors parallel to (in blue) and [ _1
1] (in violet). The
• points that lie on the line through the origin, parallel to an eigenvector, remain
on the line after the transformation. The vectors in red are not eigenvectors,
therefore their direction is altered by the transformation. See also: An extended
version. showing all four quadrants.
Eigen Vectors and Eigen Values
• Av = λv
Visualization
=
A P Q Rt
mn mr rr rn
Dimension reduction
• For some s << r, zero out all but the s biggest eigenvalues in Q.
• Denote by Qs this new version of Q.
• Typically s in the hundreds while r could be in the (tens of) thousands.
• Let As = P Qs Rt
• Turns out As is a pretty good approximation to A.
Visualization
=
As P Qs Rt
0
The columns of As represent the docs, but in s<<m dimensions.
0
0
Guarantee
• Relative distances are (approximately) preserved by projection:
• Of all m  n rank s matrices, As is the best approximation to A.
• Pointer to precise theorem in Resources.
Doc-doc similarities
•As As
t is a matrix of doc-doc similarities:
•the (j,k) entry is a measure of the similarity of doc j
to doc k.
Semi-precise intuition
• We accomplish more than dimension reduction here:
• Docs with lots of overlapping terms stay together
• Terms from these docs also get pulled together.
• Thus car and automobile get pulled together because both co-occur
in docs with tires, radiator, cylinder, etc.
Query processing
• View a query as a (short) doc:
• call it row 0 of As.
• Now the entries in row 0 of As As
t give the similarities of the query
with each doc.
• Entry (0,j) is the score of doc j on the query.
• Exercise: fill in the details of scoring/ranking.
• LSI is expensive in terms of computation…
• Randomly choosing a subset of documents for dimensional reduction
can give a significant boost in performance.
Resources
• Random projection theorem:
http://citeseer.nj.nec.com/dasgupta99elementary.html
• Faster random projection:
http://citeseer.nj.nec.com/frieze98fast.html
• Latent semantic indexing:
http://citeseer.nj.nec.com/deerwester90indexing.html
• Books: MG 4.6, MIR 2.7.2.

More Related Content

Similar to Lecture-31.pptx

Linear Algebra for beginner and advance students
Linear Algebra for beginner and advance studentsLinear Algebra for beginner and advance students
Linear Algebra for beginner and advance studentsShamshadAli58
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Kulum alin-11 jan2014
Kulum alin-11 jan2014Kulum alin-11 jan2014
Kulum alin-11 jan2014rolly purnomo
 
Brief review on matrix Algebra for mathematical economics
Brief review on matrix Algebra for mathematical economicsBrief review on matrix Algebra for mathematical economics
Brief review on matrix Algebra for mathematical economicsfelekephiliphos3
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithmsMazin Alwaaly
 
Applied Mathematics 3 Matrices.pdf
Applied Mathematics 3 Matrices.pdfApplied Mathematics 3 Matrices.pdf
Applied Mathematics 3 Matrices.pdfPrasadBaravkar1
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Chip Package Co-simulation
Chip Package Co-simulationChip Package Co-simulation
Chip Package Co-simulationMing-Chi Chen
 
20070823
2007082320070823
20070823neostar
 
Blind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningBlind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningDavide Nardone
 
Matrix Algebra : Mathematics for Business
Matrix Algebra : Mathematics for BusinessMatrix Algebra : Mathematics for Business
Matrix Algebra : Mathematics for BusinessKhan Tanjeel Ahmed
 
Vectorise all the things
Vectorise all the thingsVectorise all the things
Vectorise all the thingsJodieBurchell1
 
Linear_Algebra.pptx
Linear_Algebra.pptxLinear_Algebra.pptx
Linear_Algebra.pptxSuhasL11
 

Similar to Lecture-31.pptx (20)

LinearAlgebra.ppt
LinearAlgebra.pptLinearAlgebra.ppt
LinearAlgebra.ppt
 
Linear Algebra for beginner and advance students
Linear Algebra for beginner and advance studentsLinear Algebra for beginner and advance students
Linear Algebra for beginner and advance students
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Kulum alin-11 jan2014
Kulum alin-11 jan2014Kulum alin-11 jan2014
Kulum alin-11 jan2014
 
Linear Regression.pptx
Linear Regression.pptxLinear Regression.pptx
Linear Regression.pptx
 
Manu maths ppt
Manu maths pptManu maths ppt
Manu maths ppt
 
Brief review on matrix Algebra for mathematical economics
Brief review on matrix Algebra for mathematical economicsBrief review on matrix Algebra for mathematical economics
Brief review on matrix Algebra for mathematical economics
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
Applied Mathematics 3 Matrices.pdf
Applied Mathematics 3 Matrices.pdfApplied Mathematics 3 Matrices.pdf
Applied Mathematics 3 Matrices.pdf
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Chip Package Co-simulation
Chip Package Co-simulationChip Package Co-simulation
Chip Package Co-simulation
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
20070823
2007082320070823
20070823
 
Blind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningBlind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary Learning
 
Matrix Algebra : Mathematics for Business
Matrix Algebra : Mathematics for BusinessMatrix Algebra : Mathematics for Business
Matrix Algebra : Mathematics for Business
 
Oed chapter 1
Oed chapter 1Oed chapter 1
Oed chapter 1
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Vectorise all the things
Vectorise all the thingsVectorise all the things
Vectorise all the things
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Linear_Algebra.pptx
Linear_Algebra.pptxLinear_Algebra.pptx
Linear_Algebra.pptx
 

More from AliZaib71

Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxAliZaib71
 
Computer_Graphics_circle_drawing_techniq.ppt
Computer_Graphics_circle_drawing_techniq.pptComputer_Graphics_circle_drawing_techniq.ppt
Computer_Graphics_circle_drawing_techniq.pptAliZaib71
 
CS911-Lecture-21_43709.pptx
CS911-Lecture-21_43709.pptxCS911-Lecture-21_43709.pptx
CS911-Lecture-21_43709.pptxAliZaib71
 
Wireless Networks - CS718 Power Point Slides Lecture 02.ppt
Wireless Networks - CS718 Power Point Slides Lecture 02.pptWireless Networks - CS718 Power Point Slides Lecture 02.ppt
Wireless Networks - CS718 Power Point Slides Lecture 02.pptAliZaib71
 
Web-01-HTTP.pptx
Web-01-HTTP.pptxWeb-01-HTTP.pptx
Web-01-HTTP.pptxAliZaib71
 

More from AliZaib71 (6)

Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Computer_Graphics_circle_drawing_techniq.ppt
Computer_Graphics_circle_drawing_techniq.pptComputer_Graphics_circle_drawing_techniq.ppt
Computer_Graphics_circle_drawing_techniq.ppt
 
SE UML.ppt
SE UML.pptSE UML.ppt
SE UML.ppt
 
CS911-Lecture-21_43709.pptx
CS911-Lecture-21_43709.pptxCS911-Lecture-21_43709.pptx
CS911-Lecture-21_43709.pptx
 
Wireless Networks - CS718 Power Point Slides Lecture 02.ppt
Wireless Networks - CS718 Power Point Slides Lecture 02.pptWireless Networks - CS718 Power Point Slides Lecture 02.ppt
Wireless Networks - CS718 Power Point Slides Lecture 02.ppt
 
Web-01-HTTP.pptx
Web-01-HTTP.pptxWeb-01-HTTP.pptx
Web-01-HTTP.pptx
 

Recently uploaded

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Lecture-31.pptx

  • 1. INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 31 Dimensionality reduction 1
  • 2. ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources 1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze 2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell 3. “Modern information retrieval” by Baeza-Yates Ricardo, 4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
  • 3. Outline • Dimensionality reduction • Random projection onto k<<m axes • Computing the random projection • Latent semantic indexing (LSI) • The matrix • Dimension reduction 3
  • 4. Dimensionality reduction • What if we could take our vectors and “pack” them into fewer dimensions (say 10000100) while preserving distances? • (Well, almost.) • Speeds up cosine computations. • Two methods: • Random projection. • “Latent semantic indexing”.
  • 5. Random projection onto k<<m axes. • Choose a random direction x1 in the vector space. • For i = 2 to k, • Choose a random direction xi that is orthogonal to x1, x2, … xi-1 . • Project each doc vector into the subspace x1, x2, … xk.
  • 6. E.g., from 3 to 2 dimensions D2 D1 x1 t 3 x2 t 2 t 1 x1 x2 D2 D1 x1 is a random direction in (t1,t2,t3) space. x2 is chosen randomly but orthogonal to x1.
  • 7. Guarantee • With high probability, relative distances are (approximately) preserved by projection. • Pointer to precise theorem in Resources.
  • 8. Computing the random projection • Projecting n vectors from m dimensions down to k dimensions: • Start with m  n matrix of terms  docs, A. • Find random k  m orthogonal projection matrix R. • Compute matrix product W = R  A. • jth column of W is the vector corresponding to doc j, but now in k << m dimensions.
  • 9. Cost of computation • This takes a total of kmn multiplications. • Expensive - see Resources for ways to do essentially the same thing, quicker. • Exercise: by projecting from 10000 dimensions down to 100, are we really going to make each cosine computation faster? • Size of the vectors would decrease • Will result into smaller postings
  • 10. Latent semantic indexing (LSI) • Another technique for dimension reduction • Random projection was data-independent • LSI on the other hand is data-dependent • Eliminate redundant axes • Pull together “related” axes • car and automobile
  • 11. Notions from linear algebra • Matrix, vector • Matrix transpose and product • Rank • Eigenvalues and eigenvectors.
  • 12. The matrix 12 • The matrix has rank 2: the first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3. • The matrix has rank 1: there are nonzero colums. So the rank is positive. but any pair of columns is linearly dependent.
  • 13. The matrix Similarly, the transpose of A has rank 1. Indeed, since the column vectors of A are the row vectors of the transpose of A. the statement that the column rank of a matrix equals its row rank is equivalent to the statement that the rank of a matrix is equal to the rank of its transpose, i.e., rk(A) = rk(A). 13
  • 14. Singular-Value Decomposition • Recall m  n matrix of terms  docs, A. • A has rank r  m,n. • Define term-term correlation matrix T=AAt • At denotes the matrix transpose of A. • T is a square, symmetric m  m matrix. • Doc-doc correlation matrix D=AtA. • D is a square, symmetric n  n matrix. Why?
  • 15. Eigenvectors • Denote by P the m  r matrix of eigenvectors of T. • Denote by R the n  r matrix of eigenvectors of D. • It turns out A can be expressed (decomposed) as A = PQRt • Q is a diagonal matrix with the eigenvalues of AAt in sorted order.
  • 16. Eigen Vectors • The transformation matrix preserves the direction of • vectors parallel to (in blue) and [ _1 1] (in violet). The • points that lie on the line through the origin, parallel to an eigenvector, remain on the line after the transformation. The vectors in red are not eigenvectors, therefore their direction is altered by the transformation. See also: An extended version. showing all four quadrants.
  • 17. Eigen Vectors and Eigen Values • Av = λv
  • 18. Visualization = A P Q Rt mn mr rr rn
  • 19. Dimension reduction • For some s << r, zero out all but the s biggest eigenvalues in Q. • Denote by Qs this new version of Q. • Typically s in the hundreds while r could be in the (tens of) thousands. • Let As = P Qs Rt • Turns out As is a pretty good approximation to A.
  • 20. Visualization = As P Qs Rt 0 The columns of As represent the docs, but in s<<m dimensions. 0 0
  • 21. Guarantee • Relative distances are (approximately) preserved by projection: • Of all m  n rank s matrices, As is the best approximation to A. • Pointer to precise theorem in Resources.
  • 22. Doc-doc similarities •As As t is a matrix of doc-doc similarities: •the (j,k) entry is a measure of the similarity of doc j to doc k.
  • 23. Semi-precise intuition • We accomplish more than dimension reduction here: • Docs with lots of overlapping terms stay together • Terms from these docs also get pulled together. • Thus car and automobile get pulled together because both co-occur in docs with tires, radiator, cylinder, etc.
  • 24. Query processing • View a query as a (short) doc: • call it row 0 of As. • Now the entries in row 0 of As As t give the similarities of the query with each doc. • Entry (0,j) is the score of doc j on the query. • Exercise: fill in the details of scoring/ranking. • LSI is expensive in terms of computation… • Randomly choosing a subset of documents for dimensional reduction can give a significant boost in performance.
  • 25. Resources • Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html • Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html • Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html • Books: MG 4.6, MIR 2.7.2.

Editor's Notes

  1. 00:02:40  00:03:05
  2. 00:03:20  00:03:30 00:03:40  00:04:05 00:04:10  00:05:20 00:06:30  00:06:50
  3. 00:07:04  00:08:10 00:08:50  00:09:15
  4. 00:09:43  00:10:05 00:10:25  00:10:35
  5. 00:10:45  00:11:20 00:11:30  00:13:25
  6. 00:13:45  00:14:15 00:14:25  00:14:45
  7. 00:15:56  00:16:10 00:16:20  00:16:45(random) 00:18:30  00:18:45 (lsi) 00:19:04  00:19:45 (eliminate) 00:20:50  00:21:05 (pulll)
  8. 00:21:35  00:21:48 (layering)
  9. 00:22:30  00:23:00 00:23:15  00:24:33
  10. 00:26:15  00:26:29
  11. 00:27:10  00:28:00 (recall) 00:28:25  00:29:00 (define) 00:29:05  00:30:10 (doc-doc) The column rank of a matrix A is the maximum number of linearly independent column vectors of A. The row rank of A is the maximum number of linearly independent row vectors of A. Equivalently, the column rank of A is the dimension of the column space of A, while the row rank of A is the dimension of the row space of A. A result of fundamental importance in linear algebra is that the column rank and the row rank are always equal. (Two proofs of this result are given below.) This number (i.e., the number of linearly independent rows or columns) is simply called the rank of A. The matrix has rank 2: the first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3. The matrix has rank 1: there are nonzero columns, so the rank is positive, but any pair of columns is linearly dependent. Similarly, the transpose of A has rank 1. Indeed, since the column vectors of A are the row vectors of the transpose of A, the statement that the column rank of a matrix equals its row rank is equivalent to the statement that the rank of a matrix is equal to the rank of its transpose, i.e., rk(A) = rk(AT).
  12. 00:31:05  00:31:40 (donate by p) 00:31:48  00:32:00 (donate by r) 00:32:12  00:33:15 (it turns)
  13. 00:33:00  00:34:55 00:35:05  00:35:35
  14. 00:36:25  00:36:50 00:37:15  00:39:00
  15. 00:40:05  00:40:20 00:41:10  00:42:37
  16. 00:43:45  00:44:25
  17. 00:47:40  00:48:25
  18. 00:49:50  00:50:25
  19. 00:51:20  00:51:40
  20. 00:52:10  00:52:50 00:53:20  00:53:40
  21. 00:54:15  00:55:00