• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Software tookits for machine learning and graphical models
 

Software tookits for machine learning and graphical models

on

  • 1,221 views

 

Statistics

Views

Total Views
1,221
Views on SlideShare
1,221
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Software tookits for machine learning and graphical models Software tookits for machine learning and graphical models Presentation Transcript

    • Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005
    • Outline
      • Discriminative models for iid data
      • Beyond iid data: conditional random fields
      • Beyond supervised learning: generative models
      • Beyond optimization: Bayesian models
    • Supervised learning as Bayesian inference  Y 1 X 1 Y N X N Y * X *  Y n X n Y * X * N Training Testing
    • Supervised learning as optimization  Y 1 X 1 Y N X N Y * X *  Y n X n Y * X * N Training Testing
    • Example: logistic regression
      • Let y n 2 {1,…,C} be given by a softmax
      • Maximize conditional log likelihood
      • “ Max margin” solution
    • Outline
      • Discriminative models for iid data
      • Beyond iid data: conditional random fields
      • Beyond supervised learning: generative models
      • Beyond optimization: Bayesian models
    • 1D chain CRFs for sequence labeling  Y n1 Y nm Y n2 X n A 1D conditional random field (CRF) is an extension of logistic regression to the case where the output labels are sequences, y n 2 {1,…,C} m Local evidence Edge potential  i  ij
    • 2D Lattice CRFs for pixel labeling A conditional random field (CRF) is a discriminative model of P(y|x). The edge potentials  ij are image dependent.
    • 2D Lattice MRFs for pixel labeling A Markov Random Field (MRF) is an undirected graphical model. Here we model correlation between pixel labels using  ij (y i ,y j ). We also have a per-pixel generative model of observations P(x i |y i ) Local evidence Potential function Partition function
    • Tree-structured CRFs
      • Used in parts-based object detection
      • Y i is location of part i in image
      eyeL nose eyeR mouth Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73 Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05
    • General CRFs
      • In general, the graph may have arbitrary structure
      • eg for collective web page classification, nodes=urls, edges=hyperlinks
      • The potentials are in general defined on cliques, not just edges
    • Factor graphs Square nodes = factors (potentials) Round nodes = random variables Graph structure = bipartite
    • Potential functions
      • For the local evidence, we can use a discriminative classifier (trained iid)
      • For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features
    • Restricted potential functions
      • For some applications (esp in vision), we often use a Potts model of the form
      • We can generalize this for ordered labels (eg discretization of continuous states)
        l
    •  
    • Learning CRFs
      • If the log likelihood is
      • then the gradient is
      cliques Gradient = features – expected features Tied params
    • Learning CRFs
      • Given the gradient r d , one can find the global optimum using first or second order optimization methods, such as
        • Conjugate gradient
        • Limited memory BFGS
        • Stochastic meta descent (SMD)?
      • The bottleneck is computing the expected features needed for the gradient
    • Exact inference
      • For 1D chains, one can compute P(y i,i+1 |x) exactly in O(N K 2 ) time using belief propagation (BP = forwards backwards algorithm)
      • For restricted potentials (eg  ij =  (  l)), one can do this in O(NK) time using FFT-like tricks
      • This can be generalized to trees.
    • Sum-product vs max-product
      • We use sum-product to compute marginal probabilities needed for learning
      • We use max-product to find the most probable assignment (Viterbi decoding)
      • We can also compute max-marginals
    • Complexity of exact inference In general, the running time is  (N K w ), where w is the treewidth of the graph; this is the size of the maximal clique of the triangulated graph (assuming an optimal elimination ordering). For chains and trees, w = 2. For n £ n lattices, w = O(n).
    • Approximate sum-product O(N K I) General Mean field O(N K I) General Swendsen-Wang O(N K I) General Gibbs O(N K 2c I) c = cluster size General Generalized BP O(N K I) Restricted BP+FFT (exact iff tree) O(N K 2 I) General BP (exact iff tree) Time N=num nodes, K = num states, I = num iterations Potential (pairwise) Algorithm
    • Approximate max-product O(N K I) General SLS (stochastic local search) O(N K I) General ICM (iterated conditional modes) O(N 2 K I) [?] Restricted Graph-cuts (exact iff K=2) O(N K 2c I) c = cluster size General Generalized BP O(N K I) Restricted BP+DT (exact iff tree) O(N K 2 I) General BP (exact iff tree) Time N=num nodes, K = num states, I = num iterations Potential (pairwise) Algorithm
    • Learning intractable CRFs
      • We can use approximate inference and hope the gradient is “good enough”.
        • If we use max-product, we are doing “Viterbi training” (cf perceptron rule)
      • Or we can use other techniques, such as pseudo likelihood, which does not need inference.
    • Pseudo-likelihood
    • Software for inference and learning in 1D CRFs
      • Various packages
        • Mallet (McCallum et al) – Java
        • Crf.sourceforge.net (Sarawagi, Cohen) – Java
        • My code – matlab (just a toy, not integrated with BNT)
        • Ben Taskar says he will soon release his Max Margin Markov net code (which uses LP for inference and QP for learning).
      • Nothing standard, emphasis on NLP apps
    • Software for inference in general CRFs/ MRFs
      • Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al
        • “ A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother
      • Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference)
      • Sum-product: various other ad hoc pieces
        • My matlab BP code (MRF2)
        • Rivasseau’s C++ code for BP, Gibbs, tree-sampling (factor graphs)
        • Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)
    • Software for learning general MRFs/CRFs
      • Hardly any!
        • Parise’s matlab code (approx gradient, pseudo likelihood, CD, etc)
        • My matlab code (IPF, approx gradient – just a toy – not integrated with BNT)
    • Structure of ideal toolbox train performance visualize summarize Generator/GUI/file infer decisionEngine utilities decision testData trainData learnEngine infEngine queries model model infEngine probDist decide Nbest list
    • Structure of BNT train visualize summarize Generator/GUI/file infer testData decisionEngine BP Jtree MCMC EM StructuralEM Graphs+CPDs Graphs+CPDs LeRay Shan Cell array NodeIds VarElim N=1 (MAP) Array, Gaussian, samples LIMID Jtree VarElim policy Cell array trainData learnEngine infEngine queries model model infEngine probDist decide Nbest list
    • Outline
      • Discriminative models for iid data
      • Beyond iid data: conditional random fields
      • Beyond supervised learning: generative models
      • Beyond optimization: Bayesian models
    • Unsupervised learning: why?
      • Labeling data is time-consuming.
      • Often not clear what label to use.
      • Complex objects often not describable with a single discrete label.
      • Humans learn without labels.
      • Want to discover novel patterns/ structure.
    • Unsupervised learning: what?
      • Clusters (eg GMM)
      • Low dim manifolds (eg PCA)
      • Graph structure (eg biology, social networks)
      • “ Features” (eg maxent models of language and texture)
      • “ Objects” (eg sprite models in vision)
    • Unsupervised learning of objects from video Frey and Jojic; Williams and Titsias ; et al
    • Unsupervised learning: issues
      • Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression).
      • Local minima (non convex objective).
      • Uses inference as subroutine (can be slow – no worse than discriminative learning)
    • Unsupervised learning: how?
      • Construct a generative model (eg a Bayes net).
      • Perform inference.
      • May have to use approximations such as maximum likelihood and BP.
      • Cannot use max likelihood for model selection…
    • A comparison of BN software www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html
    • Popular BN software
      • BNT (matlab)
      • Intel’s PNL (C++)
      • Hugin (commercial)
      • Netica (commercial)
      • GMTk (free .exe from Jeff Bilmes)
    • Outline
      • Discriminative models for iid data
      • Beyond iid data: conditional random fields
      • Beyond supervised learning: generative models
      • Beyond optimization: Bayesian models
    • Bayesian inference: why?
      • It is optimal.
      • It can easily incorporate prior knowledge (esp. useful for small n, large p problems).
      • It properly reports confidence in output (useful for combining estimates, and for risk-averse applications).
      • It separates models from algorithms.
    • Bayesian inference: how?
      • Since we want to integrate, we cannot use max-product.
      • Since the unknown parameters are continuous, we cannot use sum-product.
      • But we can use EP (expectation propagation), which is similar to BP.
      • We can also use variational inference.
      • Or MCMC (eg Gibbs sampling).
    • General purpose Bayesian software
      • BUGS (Gibbs sampling)
      • VIBES (variational message passing)
      • Minka and Winn’s toolbox (infer.net)
    • Structure of ideal Bayesian toolbox train performance visualize summarize Generator/ GUI/ file infer decisionEngine utilities decision testData trainData learnEngine infEngine queries model model infEngine probDist decide