Successfully reported this slideshow.
Upcoming SlideShare
×

# Applied machine learning for search engine relevance 3

878 views

Published on

Applied Machine Learning for Search Relevance and Recommender Systems

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No

### Applied machine learning for search engine relevance 3

1. 1. Applied Machine Learning for Search Engine Relevance Charles H Martin, PhD
2. 2. Relevance as a Linear Regression r =X†w+e x: (tf-idf) bag of words vector r: relevance score (i.e. 1/-1) w: weight vector w = X†r/X†X x=1 querymodel* form X from data (i.e. group of queries) Solveas a numericalminimization (i.e. iterativemethods like SOR, CG, etc ) min  X†w-r 2  w 2 : 2-norm of w *Actually will model and predict pairwise relations and not exact rank. ..stay tuned. Moore-PenrosePseudoinverse
3. 3. Relevance as a Linear Regression: Tikhonov Regularization w = (X†X)-1 X†r Problem: inverse may be not exist (numerical instabilities,poles) Solution: add constant a to diagonalof (X†X)-1 w = (X†X + aI)-1 X†r a: single, adjustable smoothingparameter Equivalentminimization problem min X†w-r2 + a w2 More generally: form (something like) X†X + G†G + aI, which is a self-adjoint , bounded operator => min  X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting
4. 4. The Representer Theorem Revisited: Kernels and Greens Functions f(x) = S aiR(x, xi) R := Kernel Problem: to estimate a function f(x) from trainingdata (xi) Solution: solve a general minimizationproblem min Loss[(f(xi), yi)] + a Gx2 Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006) min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj) Equivalentto: given a Linear regularization operator ( G:H->L2(x) ) where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx so K is the Green’s Function for (GG)†, or G= (K1/2)† in Dirac Notation: R(x,y) = <y|(GG)†|x> f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G
5. 5. Personalized Relevance Algorithms: eSelf Personality Subspace qpages personalitytraitsp Cars: 0.4 User Reading (musicsite) Present to user (used sports car ad) Learned Traits: (Likes cars 0.4) (Sports cars 0.3) ads Sports cars 0.0 =>0.3 Rock-n-roll Hard rock Computepersonalitytraits during user visit to web site q values = stored learned “personalitytraits” Providerelevance rankings(for pages or ads) which includepersonalitytraits
6. 6. Personalized Relevance Algorithms: eSelf Personality Subspace model: L [p,q] = [h, u] where L is a square matrix h: history (observed outputs) p: output nodes (observables) Web pages, Classified Ads, … q: hidden nodes (not observed) Individualized Personality Traits u: user segmentation
7. 7. Personalized Search: Effective Regression Problem [p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t) PLP PLQ p = h QLP QLQ q u Leff = (PLP + PLQ (QLQ)-1 QLP) Leff p = h p = (Leff [q,u])-1 h PLP p + PLQ q = h QLP p +QLQ q = 0 Formal solution: => Adaptson each visit, finding relevantpages p(t) based on the links L, and the learnedpersonalitytraits (q(t-1)) Regularizationof PLP achievedwith “Green’s Function / ResolventOperator” i.e. G†G ~= PLQ (QLQ)-1 QLP Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression
8. 8. Related Dimensional Noise Reductions: Rank (k) Approximations of a Matrix LatentSemantic Analysis(LSA) (Truncated)SingularValueDecomposition (SVD): DiagonalizetheDensity operator D = A†A Retaina subset of (k) eigenvalues/vectors Equivalentrelationsfor SVD Optimalrank(k) apprx. X s.t. min (D-X)2 2 Decomposition: A = U∑ V† A†A = V (∑† ∑) V† PDP PDQ QDP QDQ *VariableLatentSemanticIndexing (Yahoo!Research Labs) http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2 2 ] Cangeneralize to variousnoise models: i.e. VLSA*, PLSA** **ProbabilisticLatentSemanticIndexing(Recommind,Inc) http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)] P = U∑ V† P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence
9. 9. Personalized Relevance: Mobile Services and Advertising France Telecom: Last Inch Relevance Engine time location playgame send msg playsong suggest …
10. 10. KA for Comm Services • Based on Empirical Bayesian score and Suggestion mapping table, a decision is made to one or more possible Comm services • Based on Business Intelligence (BI) data mining and/or Pattern Recognition algorithms (i.e. supervised or unsupervised learning) , we compute statisticalscores indicating who are the most likely people to Call, send an SMS, MMS, or E-Mail. qEvents PersonalContext (Sun. mornings) p Events Map to a contextual comm service Suggestions foruser (Call, SMS, MMS, E-mail) Learned Traits: On Sunday morning, mostlikely to call Mom Comm Services Mom (5) Call [who] SMS [who] MMS [who] Bob (3) Phone company(1)
11. 11. p( |POD ) > p( |SUN); p( |POD) < p( |SUN) Comm/Call Patterns LOC Dayof Week p( , P|POD) > 0; p( , ) = 0 POD callsto different #'s
12. 12. Bayesian Score Estimation To estimate p(call|POD) frequency: p(call|POD) = # of times user called someone at that POD Bayesian: p(call|POD) = p(POD|call)p(call) Sq p(POD|q)p(q) where q = call, sms, mms, or email
13. 13. i.e. Bayesian Choice Estimator • We seek to know the probability of a "call" (choice) at a given POD. • We "borrow information" from other PODs, assuming this is less biased, to improve our statisticalestimate 5 days 3 PODs 3 choices f( | 1) = 2/5 p( | 1) = (2/5)(3/15) . (2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15) = 6/23 ~ 1/4 1 2 3 frequency estimator Bayesianchoice estimator Note: the Bayesianestimate is significantly lower because we now expect we might see a at POD 1
14. 14. Incorporating Feedback • It is not enough to simple recognize call patterns in the Event Facts—it is also necessary to incorporate feedback into our suggestions scores • p( c | user, pod, loc, facts, feedback ) = ? Event Facts Suggestions random irrelevant poor good p( c | user, pod, facts, feedback ) = p( c | user, pod, facts ) p ( c | user, pod, feedback) A: Simply Factorize: Evaluateprobabilitiesindependently, perhaps using different Bayesianmodels
15. 15. Personalized Relevance: Empirical Bayesian Models Closed form models: Correct a sample estimate (mean m, variance ) with a weighted average of sample + complete dataset m = B m + (1-B) m B shrinkage factor i.e: individual sample user segment 1 play game send msg playsong Canrank order mobile services based on estimated likelihood(m , ) 1 23
16. 16. Personalized Relevance: Empirical Bayesian Models What is Empirical Bayes modeling? specify Likelihood L(y|) and Prior () distributions estimatethe posterior () = L(y|) () L(y|) ()  L(y|) ()d (marginal) CombinesBayesianismand frequentism: Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc. Estimatesmarginal using empirical data Uses empirical data to infer prior, plug into likelihood to make predictions Note: Special case of Effective OperatorRegression: P space ~ Q space ; PLQ = I ; u  0 Q-space defines prior information
17. 17. Empirical Bayesian Methods: Poisson Gamma Model Likelihood L(y| ) = Poisson distribution ( y e- )/y! ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ;  > 0 posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) ) y+a-1 e-(1+(1/b)) also a Gamma distribution(a’,b’) a’ = y + a ; b’ = (1+1/b)-1 Take MLE estimate of Marginal = mean (m) of the posterior (ab) Obtaina,b from the mean (m = ab) and variance (ab2) of complete data FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof sample mean y=my and prior mean m E(y) = ( my + a ) (1+1/b)-1 E(y) = (b/1+b) my + (1/1+b) m
18. 18. Linear Personality Matrix events suggestions actio n Linear(or non-linear) Matrixtransformation: M s = a Notice: the personality matrix may or may not mix suggestions across events, can include semantic information andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a Over time, we can estimate the Ma,s = prob( a | s ) i.e. calls s1 = call s2 = sms s3 = mms s4 = email i.e. for a given time and location… count how many times we suggested a call but the user chose an email instead Obviously we would like M to be diagonal…or as good as possible ! Can we devise an algorithm that will learn to give "optimal" suggestions?
19. 19. Matrices for Pattern Recognition (Statistical Factor Analysis) Call on Mon @ pod 1 Call on Mon @ pod 2 Call on Mon @ pod 3 … … Smson Tue @ pod 1 … Week 1 2 3 4 5 … We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data. CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers. Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization 1. Enumerate all choices 3. Formweekly choice density Matrix AtA 2. Count # of times a choice is made each week 4. Weekly patterns are collapsedintothe density Matrix At A They canbe detected using spectral analysis (i.e. principal eigenvalues) All weekly patterns Pure Noise Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement. Suitablewhen the number (#) of choices is not too large, and patternsare weekly.
20. 20. Search Engine Relevance : Listing on Which 5 items to list at bottom of page ?
21. 21. Statistical Machine Learning: Support Vector Machines (SVM) From Regression to Classification: Maximum Margin Solutions 2 w2 w Classification := Find the line thatseparates the pointswith the maximum margin min ½w2 2 subject to constraints all “above” line all “below” line “above” : w.xi–b >= 1 + I “below” : w.xi –b <= 1 + i constraint specifications: Simple minimization (regression) becomes a convex optimization (classification) perhaps within some slack (i.e. min ½ w2 2 + C S I )
22. 22. SVM Light: Multivariate Rank Constraints MultivariateClassification: min ½w2 2 +C s.t. for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -  let Ψ(x,y’) = S y’x be a linear fcn x1 x2 … xn - 0.1 +1.2 … -0.7 x - 1 +1 … -1 y sgn wTx maps docs to relevance scores (+1/-1) learn weights (w) s.t. max wT Ψ(x,y’) is correct for training set (within a single slack constraint ) max wT Ψ(x,y’) D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) ) Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj yij (xi -xj) )
23. 23. SvmLight Ranking SVMs SVMperf : ROC Area, F1 Score, Precision/Recall SVMmap : Mean Average Precision ( warning: buggy ! ) SVMrank : OrdinalRegression Stnd Classificationon pairwise differences min ½ w2 2 + C S I,j,k s.t for all queries qk (later, may not be query specific in SVMstruct) doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k DROCArea = 1- # swapped pairs Enforces a directed ordering 1 2 3 4 5 6 7 8 1 0 0 0 0 1 1 0 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 MAP ROC Area 0.56 0.47 0.51 0.53 A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)
24. 24. Large Scale, Linear SVMs • Solving the Primal – Conjugate Gradient – Joachims: cutting plane algorithm – Nyogi • HandlingLarge Numbers of Constraints • Cutting Plane Algorithm • Open Source Implementations: – LibSVM – SVMLight
25. 25. Search Engine Relevance : Listing on A ranking SVM consistentlyimproves Shopping.com <click rank> by %12
26. 26. Various Sparse Matrix Problems: Google Page Rank algorithm Ma = a rank a series of web pages by simulating user browsing patterns (a) based on probabilistic model (M) of page links Pattern Recognition, Inference L p = h estimateunknown probabilities (p) based on historical observations (h) and probability model (L) of links between hidden nodes Quantum Chemistry H  = E  compute color of dyes, pigments given empirical information on realted molecules and/or solving massive eigenvalue problems
27. 27. Quantum Chemistry: the electronic structure eigenproblem Solve a massive eigenvalue problem (109-1012) H  (, , …) =   (, , …) H nergy Matrix  quantumstateeigenvector ,  , … electrons Methods can have general applicability: Davidson method for dominanteigenvalue/ eigenvectors Motivation for Personalization Technology From solution of understanding the conceptualfoundations of semi-empirical models (noiseless dimensionalreduction) E
28. 28. Relations between Quantum Mechanics and ProbabilisticLanguage Models • QuantumStates  resemble the states(strings, words, phrases) in probabilistic language models (HMMs, SCFGs), except:  is a sum* of strings of electrons:  (, ) = 0. |  1  2  1  2 | +0.2 |  2  3  1  2 | +… • Energy Matrix H is known exactly, but large. Models of H can be inferred from empirical data to simplify computations. • Energies ~= Log [Probabilities], un-normalized *Not just a single string!
29. 29. Ab initio (from first principles): Solve entire H  (, ) =   (, ) …approximately OR Semi-empirical: Assume(, ) electrons statisticallyindependent:  (, ) = p() q() Treat  -electrons explicitly, ignore  (hidden): PHP p() =  p() muchsmaller problem Parameterize PHP matrix => Heff with empirical data using a small set of molecules, then apply to others (dyes,pigments) Dimensional Reduction in Quantum Chemistry: where do semi-empirical Hamiltonians come from?
30. 30. Effective Hamiltonians: Semi-Empirical Pi-Electron Methods Heff [] p() =  p() PHP PHQ p = E p QHP QHQ q q Heff [] = (PHP – PHQ (E-QHQ)-1 QHP) PHP p + PHQ q = E p QHP p + QHQ q = E q => implicit/ hidden Final Heff can be solved iteratively (as with eSelf Leff), or perturbatively in various forms Solution is formally exact => Dimensional Reduction / “Renormalization”
31. 31. Graphical Methods Vij = + … DecomposeHeff into effective interactions between  electrons (Expand (E-QHQ)-1 in an infinite series, remove E dependence) Represent diagrammatically, ~300 diagrams to evaluate Precompileusing symbolic manipulation: ~35 MG executable; 8-10 hours to compile run time: 3-4 hours/parameter + +
32. 32. Effective Hamiltonians: Numerical Calculations VCC -only effective empirical 16 11.5 11-12 (eV) Compute ab initio empirical parameters : Can test all basic assumptions of semi-empirical theory , “from first principles” Alsoprovides highly accurate eigenvalue spectra Augmentcommercial packages (i.e. Fujitsu MOPAC) to model spectroscopy of photoactive proteins example