Converting High DimensionalProblems to Low Dimensional Ones
General Paradigm Reduce and Conquer• Large Problem Small Problem – Break array into two parts – Consider odd and even elements – Sample edges in a graph to obtain a smaller graph – Represent a graph by a collection of trees – Take number modulo small prime – Multiply matrix by a random vector – Project high dimensional point sets into fewer dimensions
The Problem• Given n points in D dimensional space• Project them in d << D dimensions – So (Euclidean) distance between every pair of points is (almost) preserved• How does d compare to n?
Application• Hierarchical Clustering• Say ten thousand samples each over a few million SNPs• Few million Few Hundreds/Thousands? And Fast?
First Attempt• Can we make d=n-1? – X axis through 2 of the points – Y axis so 3rd point is in the XY plane – Z axis so 4th point is in the XYZ 3d space – And so on
First Attempt• Time taken – Each new axis has to be made orthogonal to all previous axes – O(n2 D) – Too slow
Second Attempt Use Random Projections• Take d random vectors r1..rd• For every point p, take the d dimensional point • [ p.r1 p.r2 .. p.rd ] * scaling-factor• Do these d-dim points preserve inter-point distances approximately? How large should d be?
Random Projections Further Simplification• Take any vector p in D dimensions• Suppose we show – [ p.r1 p.r2 .. p.rd ] * scaling-factor has length ~ |p| – Failure prob < 1/n3• Prob that even one of the n2 difference vector lengths is not preserved with prob < n2/n3 ~ 1/n
Random Projections What is a random vector?• No directional bias
Normal Distributions• Pr of being between x and x+dx For N(0,1), ~ e-x2/2
Generating Random Vectors without Directional Bias• Take D numbers (X1...XD), each N(0,1), independently• Distribution of each number X – Pr of being between a..a+da ~ e-a2/2• Pr X1 in a1..a1+da1 : X2 in a2..a2+da2 ::: XD in aD..aD+daD – e-a12/2 e-a22/2 … e-aD2/2 da1da2….daD – e-(a12+a22+aD2)/2 da1da2….daD – e-l2/2 da1da2….daD So no dependence on direction, only on length l !
The Algorithm• Take d random vectors r1..rd – Each ri = [Xi1 Xi2 … XiD] where the X’s are chosen from N(0,1) independently• For every point p, take the d dimensional point • [ p.r1 p.r2 .. p.rd ] * sqrt(1/d)• Time: n*d*D
Simplifying Further• Take any vector p in D dimensions• We need to show that • [ p.r1 p.r2 .. p.rd ] * sqrt(1/d) has length ~ |p| • Failure prob < 1/n3• We can assume p to be 1 0 0 0 0 0 … – because random vectors have no directional bias – Then [ p.r1 p.r2 .. p.rd ] * sqrt(1/d) = [X11 X21 … Xd1] * sqrt(1/d)
Analysis• We need to show that • [X1 X2 … Xd] * sqrt(1/d) has length ~ 1 • Failure prob < 1/n3• Or (X12+…+Xd2)/d ~ 1, failure prob < 1/n3• Or (X12+…+Xd2) ~ d, failure prob < 1/n3• Note Xi has mean 1 and s.d sqrt(2)
Law of Large Numbers• Y1..Yd each with any (decent) distribution with mean 1 and s.d sqrt(2)• Then Y1+…+Yd tends to a Normal distribution with mean d and s.d sqrt(2d) (for large d)• Pr (Y1+…+Yd not in (1+∆)d.. (1-∆)d) < • e-(∆d)2/2.2d = e-∆2d/4• Choose d=12 ln n/∆2 , this is < 1/n3 as needed
Conclusion• n numbers in D dimensions – can be projected to 12 ln n/∆2 dimensions – all distances stretch only by (1+/-∆) – with prob > 1-1/n