Software tookits for machine learning and graphical models

Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

Outline Discriminative models for iid data Beyond iid data: conditional random fields Beyond supervised learning: generative models Beyond optimization: Bayesian models

Supervised learning as Bayesian inference  Y 1 X 1 Y N X N Y * X *  Y n X n Y * X * N Training Testing

Supervised learning as optimization  Y 1 X 1 Y N X N Y * X *  Y n X n Y * X * N Training Testing

Example: logistic regression Let y n 2 {1,…,C} be given by a softmax Maximize conditional log likelihood “ Max margin” solution

1D chain CRFs for sequence labeling  Y n1 Y nm Y n2 X n A 1D conditional random field (CRF) is an extension of logistic regression to the case where the output labels are sequences, y n 2 {1,…,C} m Local evidence Edge potential  i  ij

2D Lattice CRFs for pixel labeling A conditional random field (CRF) is a discriminative model of P(y|x). The edge potentials  ij are image dependent.

2D Lattice MRFs for pixel labeling A Markov Random Field (MRF) is an undirected graphical model. Here we model correlation between pixel labels using  ij (y i ,y j ). We also have a per-pixel generative model of observations P(x i |y i ) Local evidence Potential function Partition function

Tree-structured CRFs Used in parts-based object detection Y i is location of part i in image eyeL nose eyeR mouth Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73 Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

General CRFs In general, the graph may have arbitrary structure eg for collective web page classification, nodes=urls, edges=hyperlinks The potentials are in general defined on cliques, not just edges

Factor graphs Square nodes = factors (potentials) Round nodes = random variables Graph structure = bipartite

Potential functions For the local evidence, we can use a discriminative classifier (trained iid) For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

Restricted potential functions For some applications (esp in vision), we often use a Potts model of the form We can generalize this for ordered labels (eg discretization of continuous states)   l

Learning CRFs If the log likelihood is then the gradient is cliques Gradient = features – expected features Tied params

Learning CRFs Given the gradient r d , one can find the global optimum using first or second order optimization methods, such as Conjugate gradient Limited memory BFGS Stochastic meta descent (SMD)? The bottleneck is computing the expected features needed for the gradient

Exact inference For 1D chains, one can compute P(y i,i+1 |x) exactly in O(N K 2 ) time using belief propagation (BP = forwards backwards algorithm) For restricted potentials (eg  ij =  (  l)), one can do this in O(NK) time using FFT-like tricks This can be generalized to trees.

Sum-product vs max-product We use sum-product to compute marginal probabilities needed for learning We use max-product to find the most probable assignment (Viterbi decoding) We can also compute max-marginals

Complexity of exact inference In general, the running time is  (N K w ), where w is the treewidth of the graph; this is the size of the maximal clique of the triangulated graph (assuming an optimal elimination ordering). For chains and trees, w = 2. For n £ n lattices, w = O(n).

Approximate sum-product O(N K I) General Mean field O(N K I) General Swendsen-Wang O(N K I) General Gibbs O(N K 2c I) c = cluster size General Generalized BP O(N K I) Restricted BP+FFT (exact iff tree) O(N K 2 I) General BP (exact iff tree) Time N=num nodes, K = num states, I = num iterations Potential (pairwise) Algorithm

Approximate max-product O(N K I) General SLS (stochastic local search) O(N K I) General ICM (iterated conditional modes) O(N 2 K I) [?] Restricted Graph-cuts (exact iff K=2) O(N K 2c I) c = cluster size General Generalized BP O(N K I) Restricted BP+DT (exact iff tree) O(N K 2 I) General BP (exact iff tree) Time N=num nodes, K = num states, I = num iterations Potential (pairwise) Algorithm

Learning intractable CRFs We can use approximate inference and hope the gradient is “good enough”. If we use max-product, we are doing “Viterbi training” (cf perceptron rule) Or we can use other techniques, such as pseudo likelihood, which does not need inference.

Software for inference and learning in 1D CRFs Various packages Mallet (McCallum et al) – Java Crf.sourceforge.net (Sarawagi, Cohen) – Java My code – matlab (just a toy, not integrated with BNT) Ben Taskar says he will soon release his Max Margin Markov net code (which uses LP for inference and QP for learning). Nothing standard, emphasis on NLP apps

Software for inference in general CRFs/ MRFs Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al “ A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference) Sum-product: various other ad hoc pieces My matlab BP code (MRF2) Rivasseau’s C++ code for BP, Gibbs, tree-sampling (factor graphs) Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

Software for learning general MRFs/CRFs Hardly any! Parise’s matlab code (approx gradient, pseudo likelihood, CD, etc) My matlab code (IPF, approx gradient – just a toy – not integrated with BNT)

Structure of ideal toolbox train performance visualize summarize Generator/GUI/file infer decisionEngine utilities decision testData trainData learnEngine infEngine queries model model infEngine probDist decide Nbest list

Structure of BNT train visualize summarize Generator/GUI/file infer testData decisionEngine BP Jtree MCMC EM StructuralEM Graphs+CPDs Graphs+CPDs LeRay Shan Cell array NodeIds VarElim N=1 (MAP) Array, Gaussian, samples LIMID Jtree VarElim policy Cell array trainData learnEngine infEngine queries model model infEngine probDist decide Nbest list

Unsupervised learning: why? Labeling data is time-consuming. Often not clear what label to use. Complex objects often not describable with a single discrete label. Humans learn without labels. Want to discover novel patterns/ structure.

Unsupervised learning: what? Clusters (eg GMM) Low dim manifolds (eg PCA) Graph structure (eg biology, social networks) “ Features” (eg maxent models of language and texture) “ Objects” (eg sprite models in vision)

Unsupervised learning of objects from video Frey and Jojic; Williams and Titsias ; et al

Unsupervised learning: issues Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression). Local minima (non convex objective). Uses inference as subroutine (can be slow – no worse than discriminative learning)

Unsupervised learning: how? Construct a generative model (eg a Bayes net). Perform inference. May have to use approximations such as maximum likelihood and BP. Cannot use max likelihood for model selection…

A comparison of BN software www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

Popular BN software BNT (matlab) Intel’s PNL (C++) Hugin (commercial) Netica (commercial) GMTk (free .exe from Jeff Bilmes)

Bayesian inference: why? It is optimal. It can easily incorporate prior knowledge (esp. useful for small n, large p problems). It properly reports confidence in output (useful for combining estimates, and for risk-averse applications). It separates models from algorithms.

Bayesian inference: how? Since we want to integrate, we cannot use max-product. Since the unknown parameters are continuous, we cannot use sum-product. But we can use EP (expectation propagation), which is similar to BP. We can also use variational inference. Or MCMC (eg Gibbs sampling).

General purpose Bayesian software BUGS (Gibbs sampling) VIBES (variational message passing) Minka and Winn’s toolbox (infer.net)

Structure of ideal Bayesian toolbox train performance visualize summarize Generator/ GUI/ file infer decisionEngine utilities decision testData trainData learnEngine infEngine queries model model infEngine probDist decide

Software tookits for machine learning and graphical models

More Related Content

What's hot

Similar to Software tookits for machine learning and graphical models

More from butest

Software tookits for machine learning and graphical models