Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applied machine learning for search engine relevance 3

878 views

Published on

Applied Machine Learning for Search Relevance and Recommender Systems

Published in: Technology

Applied machine learning for search engine relevance 3

  1. 1. Applied Machine Learning for Search Engine Relevance Charles H Martin, PhD
  2. 2. Relevance as a Linear Regression r =X†w+e x: (tf-idf) bag of words vector r: relevance score (i.e. 1/-1) w: weight vector w = X†r/X†X x=1 querymodel* form X from data (i.e. group of queries) Solveas a numericalminimization (i.e. iterativemethods like SOR, CG, etc ) min  X†w-r 2  w 2 : 2-norm of w *Actually will model and predict pairwise relations and not exact rank. ..stay tuned. Moore-PenrosePseudoinverse
  3. 3. Relevance as a Linear Regression: Tikhonov Regularization w = (X†X)-1 X†r Problem: inverse may be not exist (numerical instabilities,poles) Solution: add constant a to diagonalof (X†X)-1 w = (X†X + aI)-1 X†r a: single, adjustable smoothingparameter Equivalentminimization problem min X†w-r2 + a w2 More generally: form (something like) X†X + G†G + aI, which is a self-adjoint , bounded operator => min  X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting
  4. 4. The Representer Theorem Revisited: Kernels and Greens Functions f(x) = S aiR(x, xi) R := Kernel Problem: to estimate a function f(x) from trainingdata (xi) Solution: solve a general minimizationproblem min Loss[(f(xi), yi)] + a Gx2 Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006) min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj) Equivalentto: given a Linear regularization operator ( G:H->L2(x) ) where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx so K is the Green’s Function for (GG)†, or G= (K1/2)† in Dirac Notation: R(x,y) = <y|(GG)†|x> f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G
  5. 5. Personalized Relevance Algorithms: eSelf Personality Subspace qpages personalitytraitsp Cars: 0.4 User Reading (musicsite) Present to user (used sports car ad) Learned Traits: (Likes cars 0.4) (Sports cars 0.3) ads Sports cars 0.0 =>0.3 Rock-n-roll Hard rock Computepersonalitytraits during user visit to web site q values = stored learned “personalitytraits” Providerelevance rankings(for pages or ads) which includepersonalitytraits
  6. 6. Personalized Relevance Algorithms: eSelf Personality Subspace model: L [p,q] = [h, u] where L is a square matrix h: history (observed outputs) p: output nodes (observables) Web pages, Classified Ads, … q: hidden nodes (not observed) Individualized Personality Traits u: user segmentation
  7. 7. Personalized Search: Effective Regression Problem [p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t) PLP PLQ p = h QLP QLQ q u Leff = (PLP + PLQ (QLQ)-1 QLP) Leff p = h p = (Leff [q,u])-1 h PLP p + PLQ q = h QLP p +QLQ q = 0 Formal solution: => Adaptson each visit, finding relevantpages p(t) based on the links L, and the learnedpersonalitytraits (q(t-1)) Regularizationof PLP achievedwith “Green’s Function / ResolventOperator” i.e. G†G ~= PLQ (QLQ)-1 QLP Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression
  8. 8. Related Dimensional Noise Reductions: Rank (k) Approximations of a Matrix LatentSemantic Analysis(LSA) (Truncated)SingularValueDecomposition (SVD): DiagonalizetheDensity operator D = A†A Retaina subset of (k) eigenvalues/vectors Equivalentrelationsfor SVD Optimalrank(k) apprx. X s.t. min (D-X)2 2 Decomposition: A = U∑ V† A†A = V (∑† ∑) V† PDP PDQ QDP QDQ *VariableLatentSemanticIndexing (Yahoo!Research Labs) http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2 2 ] Cangeneralize to variousnoise models: i.e. VLSA*, PLSA** **ProbabilisticLatentSemanticIndexing(Recommind,Inc) http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)] P = U∑ V† P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence
  9. 9. Personalized Relevance: Mobile Services and Advertising France Telecom: Last Inch Relevance Engine time location playgame send msg playsong suggest …
  10. 10. KA for Comm Services • Based on Empirical Bayesian score and Suggestion mapping table, a decision is made to one or more possible Comm services • Based on Business Intelligence (BI) data mining and/or Pattern Recognition algorithms (i.e. supervised or unsupervised learning) , we compute statisticalscores indicating who are the most likely people to Call, send an SMS, MMS, or E-Mail. qEvents PersonalContext (Sun. mornings) p Events Map to a contextual comm service Suggestions foruser (Call, SMS, MMS, E-mail) Learned Traits: On Sunday morning, mostlikely to call Mom Comm Services Mom (5) Call [who] SMS [who] MMS [who] Bob (3) Phone company(1)
  11. 11. p( |POD ) > p( |SUN); p( |POD) < p( |SUN) Comm/Call Patterns LOC Dayof Week p( , P|POD) > 0; p( , ) = 0 POD callsto different #'s
  12. 12. Bayesian Score Estimation To estimate p(call|POD) frequency: p(call|POD) = # of times user called someone at that POD Bayesian: p(call|POD) = p(POD|call)p(call) Sq p(POD|q)p(q) where q = call, sms, mms, or email
  13. 13. i.e. Bayesian Choice Estimator • We seek to know the probability of a "call" (choice) at a given POD. • We "borrow information" from other PODs, assuming this is less biased, to improve our statisticalestimate 5 days 3 PODs 3 choices f( | 1) = 2/5 p( | 1) = (2/5)(3/15) . (2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15) = 6/23 ~ 1/4 1 2 3 frequency estimator Bayesianchoice estimator Note: the Bayesianestimate is significantly lower because we now expect we might see a at POD 1
  14. 14. Incorporating Feedback • It is not enough to simple recognize call patterns in the Event Facts—it is also necessary to incorporate feedback into our suggestions scores • p( c | user, pod, loc, facts, feedback ) = ? Event Facts Suggestions random irrelevant poor good p( c | user, pod, facts, feedback ) = p( c | user, pod, facts ) p ( c | user, pod, feedback) A: Simply Factorize: Evaluateprobabilitiesindependently, perhaps using different Bayesianmodels
  15. 15. Personalized Relevance: Empirical Bayesian Models Closed form models: Correct a sample estimate (mean m, variance ) with a weighted average of sample + complete dataset m = B m + (1-B) m B shrinkage factor i.e: individual sample user segment 1 play game send msg playsong Canrank order mobile services based on estimated likelihood(m , ) 1 23
  16. 16. Personalized Relevance: Empirical Bayesian Models What is Empirical Bayes modeling? specify Likelihood L(y|) and Prior () distributions estimatethe posterior () = L(y|) () L(y|) ()  L(y|) ()d (marginal) CombinesBayesianismand frequentism: Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc. Estimatesmarginal using empirical data Uses empirical data to infer prior, plug into likelihood to make predictions Note: Special case of Effective OperatorRegression: P space ~ Q space ; PLQ = I ; u  0 Q-space defines prior information
  17. 17. Empirical Bayesian Methods: Poisson Gamma Model Likelihood L(y| ) = Poisson distribution ( y e- )/y! ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ;  > 0 posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) ) y+a-1 e-(1+(1/b)) also a Gamma distribution(a’,b’) a’ = y + a ; b’ = (1+1/b)-1 Take MLE estimate of Marginal = mean (m) of the posterior (ab) Obtaina,b from the mean (m = ab) and variance (ab2) of complete data FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof sample mean y=my and prior mean m E(y) = ( my + a ) (1+1/b)-1 E(y) = (b/1+b) my + (1/1+b) m
  18. 18. Linear Personality Matrix events suggestions actio n Linear(or non-linear) Matrixtransformation: M s = a Notice: the personality matrix may or may not mix suggestions across events, can include semantic information andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a Over time, we can estimate the Ma,s = prob( a | s ) i.e. calls s1 = call s2 = sms s3 = mms s4 = email i.e. for a given time and location… count how many times we suggested a call but the user chose an email instead Obviously we would like M to be diagonal…or as good as possible ! Can we devise an algorithm that will learn to give "optimal" suggestions?
  19. 19. Matrices for Pattern Recognition (Statistical Factor Analysis) Call on Mon @ pod 1 Call on Mon @ pod 2 Call on Mon @ pod 3 … … Smson Tue @ pod 1 … Week 1 2 3 4 5 … We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data. CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers. Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization 1. Enumerate all choices 3. Formweekly choice density Matrix AtA 2. Count # of times a choice is made each week 4. Weekly patterns are collapsedintothe density Matrix At A They canbe detected using spectral analysis (i.e. principal eigenvalues) All weekly patterns Pure Noise Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement. Suitablewhen the number (#) of choices is not too large, and patternsare weekly.
  20. 20. Search Engine Relevance : Listing on Which 5 items to list at bottom of page ?
  21. 21. Statistical Machine Learning: Support Vector Machines (SVM) From Regression to Classification: Maximum Margin Solutions 2 w2 w Classification := Find the line thatseparates the pointswith the maximum margin min ½w2 2 subject to constraints all “above” line all “below” line “above” : w.xi–b >= 1 + I “below” : w.xi –b <= 1 + i constraint specifications: Simple minimization (regression) becomes a convex optimization (classification) perhaps within some slack (i.e. min ½ w2 2 + C S I )
  22. 22. SVM Light: Multivariate Rank Constraints MultivariateClassification: min ½w2 2 +C s.t. for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -  let Ψ(x,y’) = S y’x be a linear fcn x1 x2 … xn - 0.1 +1.2 … -0.7 x - 1 +1 … -1 y sgn wTx maps docs to relevance scores (+1/-1) learn weights (w) s.t. max wT Ψ(x,y’) is correct for training set (within a single slack constraint ) max wT Ψ(x,y’) D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) ) Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj yij (xi -xj) )
  23. 23. SvmLight Ranking SVMs SVMperf : ROC Area, F1 Score, Precision/Recall SVMmap : Mean Average Precision ( warning: buggy ! ) SVMrank : OrdinalRegression Stnd Classificationon pairwise differences min ½ w2 2 + C S I,j,k s.t for all queries qk (later, may not be query specific in SVMstruct) doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k DROCArea = 1- # swapped pairs Enforces a directed ordering 1 2 3 4 5 6 7 8 1 0 0 0 0 1 1 0 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 MAP ROC Area 0.56 0.47 0.51 0.53 A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)
  24. 24. Large Scale, Linear SVMs • Solving the Primal – Conjugate Gradient – Joachims: cutting plane algorithm – Nyogi • HandlingLarge Numbers of Constraints • Cutting Plane Algorithm • Open Source Implementations: – LibSVM – SVMLight
  25. 25. Search Engine Relevance : Listing on A ranking SVM consistentlyimproves Shopping.com <click rank> by %12
  26. 26. Various Sparse Matrix Problems: Google Page Rank algorithm Ma = a rank a series of web pages by simulating user browsing patterns (a) based on probabilistic model (M) of page links Pattern Recognition, Inference L p = h estimateunknown probabilities (p) based on historical observations (h) and probability model (L) of links between hidden nodes Quantum Chemistry H  = E  compute color of dyes, pigments given empirical information on realted molecules and/or solving massive eigenvalue problems
  27. 27. Quantum Chemistry: the electronic structure eigenproblem Solve a massive eigenvalue problem (109-1012) H  (, , …) =   (, , …) H nergy Matrix  quantumstateeigenvector ,  , … electrons Methods can have general applicability: Davidson method for dominanteigenvalue/ eigenvectors Motivation for Personalization Technology From solution of understanding the conceptualfoundations of semi-empirical models (noiseless dimensionalreduction) E
  28. 28. Relations between Quantum Mechanics and ProbabilisticLanguage Models • QuantumStates  resemble the states(strings, words, phrases) in probabilistic language models (HMMs, SCFGs), except:  is a sum* of strings of electrons:  (, ) = 0. |  1  2  1  2 | +0.2 |  2  3  1  2 | +… • Energy Matrix H is known exactly, but large. Models of H can be inferred from empirical data to simplify computations. • Energies ~= Log [Probabilities], un-normalized *Not just a single string!
  29. 29. Ab initio (from first principles): Solve entire H  (, ) =   (, ) …approximately OR Semi-empirical: Assume(, ) electrons statisticallyindependent:  (, ) = p() q() Treat  -electrons explicitly, ignore  (hidden): PHP p() =  p() muchsmaller problem Parameterize PHP matrix => Heff with empirical data using a small set of molecules, then apply to others (dyes,pigments) Dimensional Reduction in Quantum Chemistry: where do semi-empirical Hamiltonians come from?
  30. 30. Effective Hamiltonians: Semi-Empirical Pi-Electron Methods Heff [] p() =  p() PHP PHQ p = E p QHP QHQ q q Heff [] = (PHP – PHQ (E-QHQ)-1 QHP) PHP p + PHQ q = E p QHP p + QHQ q = E q => implicit/ hidden Final Heff can be solved iteratively (as with eSelf Leff), or perturbatively in various forms Solution is formally exact => Dimensional Reduction / “Renormalization”
  31. 31. Graphical Methods Vij = + … DecomposeHeff into effective interactions between  electrons (Expand (E-QHQ)-1 in an infinite series, remove E dependence) Represent diagrammatically, ~300 diagrams to evaluate Precompileusing symbolic manipulation: ~35 MG executable; 8-10 hours to compile run time: 3-4 hours/parameter + +
  32. 32. Effective Hamiltonians: Numerical Calculations VCC -only effective empirical 16 11.5 11-12 (eV) Compute ab initio empirical parameters : Can test all basic assumptions of semi-empirical theory , “from first principles” Alsoprovides highly accurate eigenvalue spectra Augmentcommercial packages (i.e. Fujitsu MOPAC) to model spectroscopy of photoactive proteins example

×