This document provides an overview of Frank Nielsen's talk on pattern learning and recognition using information geometry and statistical manifolds. The talk focuses on departing from vector space representations and dealing with (dis)similarities that do not have Euclidean or metric properties. This poses new theoretical and computational challenges for pattern recognition. The talk describes using exponential family mixture models defined on dually flat statistical manifolds induced by convex functions. On these manifolds, dual coordinate systems and dual affine geodesics allow for computing-friendly representations of divergences and similarities between probabilistic patterns. The techniques aim to achieve statistical invariance and enable algorithmic approaches to problems like Gaussian mixture modeling, shape retrieval, and diffusion tensor imaging analysis.
Slides: A glance at information-geometric signal processingFrank Nielsen
This document discusses information geometry and its applications in statistical signal processing. It introduces several key concepts:
1) Statistical signal processing models data with probability distributions like Gaussians and histograms. Information geometry provides a geometric framework for intuitive reasoning about these statistical models.
2) Exponential family mixture models generalize Gaussian and Rayleigh mixtures and are algorithmically useful in dually flat spaces.
3) Distances between statistical models, like Kullback-Leibler divergence and Bregman divergences, can be interpreted geometrically in terms of convex conjugates and Legendre transformations.
1. The document discusses the author's research in three areas: graph-based clustering methods, approximate Bayesian computation (ABC), and Bayesian computation using empirical likelihood.
2. For graph-based clustering, the author presents asymptotic results for spectral clustering as the number of data points and bandwidth approach infinity.
3. For ABC, the author discusses sequential ABC algorithms and challenges of model choice and high-dimensional summary statistics. Machine learning methods are proposed to analyze simulated ABC data.
4. For empirical likelihood, the author proposes using it for Bayesian computation when the likelihood is intractable and simulation is infeasible, as it provides correct confidence intervals unlike composite likelihoods.
This document summarizes different approaches for structure learning in graph neural networks. It discusses three main classes of methods: 1) metric-based learning which learns a similarity matrix between nodes, 2) probabilistic models which learn the parameters of a distribution over graphs, and 3) direct optimization which directly optimizes the graph adjacency matrix. The document provides examples of methods within each class and notes challenges such as the simplicity of probabilistic models and computational difficulties of direct optimization.
From RNN to neural networks for cyclic undirected graphstuxette
This document discusses different neural network methods for processing graph-structured data. It begins by describing recurrent neural networks (RNNs) and their limitations for graphs, such as an inability to handle undirected or cyclic graphs. It then summarizes two alternative approaches: one that uses contraction maps to allow recurrent updates on arbitrary graphs, and one that employs a constructive architecture with frozen neurons to avoid issues with cycles. Both methods aim to make predictions at the node or graph level on relational data like molecules or web pages.
This document summarizes an approach for incremental learning-to-learn with statistical guarantees. It proposes an online algorithm for linear feature learning across multiple tasks, modeled as a distribution over distributions. The algorithm performs projected stochastic gradient descent on the empirical risk to learn a low-rank feature mapping. A statistical analysis shows the algorithm converges to the optimal feature mapping, with a bound that improves on independent task learning when the covariance across tasks is small. The approach is related to multitask learning with trace norm regularization.
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
Despite the title the methods are appropriate for more general dynamical models (including state-space models). Presentation given at Nordstat 2012, Umeå. Relevant research paper at http://arxiv.org/abs/1204.5459 and software code at https://sourceforge.net/projects/abc-sde/
Slides: A glance at information-geometric signal processingFrank Nielsen
This document discusses information geometry and its applications in statistical signal processing. It introduces several key concepts:
1) Statistical signal processing models data with probability distributions like Gaussians and histograms. Information geometry provides a geometric framework for intuitive reasoning about these statistical models.
2) Exponential family mixture models generalize Gaussian and Rayleigh mixtures and are algorithmically useful in dually flat spaces.
3) Distances between statistical models, like Kullback-Leibler divergence and Bregman divergences, can be interpreted geometrically in terms of convex conjugates and Legendre transformations.
1. The document discusses the author's research in three areas: graph-based clustering methods, approximate Bayesian computation (ABC), and Bayesian computation using empirical likelihood.
2. For graph-based clustering, the author presents asymptotic results for spectral clustering as the number of data points and bandwidth approach infinity.
3. For ABC, the author discusses sequential ABC algorithms and challenges of model choice and high-dimensional summary statistics. Machine learning methods are proposed to analyze simulated ABC data.
4. For empirical likelihood, the author proposes using it for Bayesian computation when the likelihood is intractable and simulation is infeasible, as it provides correct confidence intervals unlike composite likelihoods.
This document summarizes different approaches for structure learning in graph neural networks. It discusses three main classes of methods: 1) metric-based learning which learns a similarity matrix between nodes, 2) probabilistic models which learn the parameters of a distribution over graphs, and 3) direct optimization which directly optimizes the graph adjacency matrix. The document provides examples of methods within each class and notes challenges such as the simplicity of probabilistic models and computational difficulties of direct optimization.
From RNN to neural networks for cyclic undirected graphstuxette
This document discusses different neural network methods for processing graph-structured data. It begins by describing recurrent neural networks (RNNs) and their limitations for graphs, such as an inability to handle undirected or cyclic graphs. It then summarizes two alternative approaches: one that uses contraction maps to allow recurrent updates on arbitrary graphs, and one that employs a constructive architecture with frozen neurons to avoid issues with cycles. Both methods aim to make predictions at the node or graph level on relational data like molecules or web pages.
This document summarizes an approach for incremental learning-to-learn with statistical guarantees. It proposes an online algorithm for linear feature learning across multiple tasks, modeled as a distribution over distributions. The algorithm performs projected stochastic gradient descent on the empirical risk to learn a low-rank feature mapping. A statistical analysis shows the algorithm converges to the optimal feature mapping, with a bound that improves on independent task learning when the covariance across tasks is small. The approach is related to multitask learning with trace norm regularization.
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
Despite the title the methods are appropriate for more general dynamical models (including state-space models). Presentation given at Nordstat 2012, Umeå. Relevant research paper at http://arxiv.org/abs/1204.5459 and software code at https://sourceforge.net/projects/abc-sde/
This document summarizes generative models like VAEs and GANs. It begins with an introduction to information theory, defining key concepts like entropy and maximum likelihood estimation. It then explains generative models as estimating the joint distribution P(X,Y) compared to discriminative models estimating P(Y|X). VAEs are discussed as maximizing the evidence lower bound (ELBO) to estimate the latent variable distribution P(Z|X), allowing generation of new X values. GANs are also covered, defining their minimax game between a generator G and discriminator D, with G learning to generate samples resembling the real data distribution Pemp.
This document provides an outline and introduction to the topics of pattern recognition and machine learning. It begins with an overview of key concepts like probability theory, decision theory, and the curse of dimensionality. It then covers specific techniques like polynomial curve fitting, the Gaussian distribution, and Bayesian curve fitting. The document also includes an appendix on properties of matrices such as determinants, matrix derivatives, and the eigenvector equation.
This document describes the Space Alternating Data Augmentation (SADA) algorithm, an efficient Markov chain Monte Carlo method for sampling from posterior distributions. SADA extends the Data Augmentation algorithm by introducing multiple sets of missing data, with each set corresponding to a subset of model parameters. These are sampled in a "space alternating" manner to improve convergence. The document applies SADA to finite mixtures of Gaussians, introducing different types of missing data to update parameter subsets. Simulation results show SADA provides better mixing and convergence than standard Data Augmentation.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
This document discusses macrocanonical models for texture synthesis. It begins by introducing the goal of texture synthesis and providing a brief history. It then describes the parametric question of combining randomness and structure in images. Specifically, it discusses maximizing entropy under geometric constraints. The document goes on to discuss links to statistical physics, defining microcanonical and macrocanonical models. It focuses on studying the macrocanonical model, describing how to find optimal parameters through gradient descent and how to sample from the model using Langevin dynamics. The document provides examples of texture synthesis and compares results to other methods.
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
My data are incomplete and noisy: Information-reduction statistical methods f...Umberto Picchini
We review parameter inference for stochastic modelling in complex scenario, such as bad parameters initialization and near-chaotic dynamics. We show how state-of-art methods for state-space models can fail while, in some situations, reducing data to summary statistics (information reduction) enables robust estimation. Wood's synthetic likelihoods method is reviewed and the lecture closes with an example of approximate Bayesian computation methodology.
Accompanying code is available at https://github.com/umbertopicchini/pomp-ricker and https://github.com/umbertopicchini/abc_g-and-k
Readership lecture given at Lund University on 7 June 2016. The lecture is of popular science nature hence mathematical detail is kept to a minimum. However numerous links and references are offered for further reading.
Approximate Bayesian computation for the Ising/Potts modelMatt Moores
This document provides an introduction to Approximate Bayesian Computation (ABC). ABC is a likelihood-free method for approximating posterior distributions when the likelihood function is intractable or expensive to evaluate. The document outlines the basic ABC rejection sampling algorithm and discusses extensions like using summary statistics, ABC-MCMC, and ABC sequential Monte Carlo. It also applies ABC to parameter inference for a hidden Potts model used in Bayesian image segmentation.
This document summarizes results on analyzing stochastic gradient descent (SGD) algorithms for minimizing convex functions. It shows that a continuous-time version of SGD (SGD-c) can strongly approximate the discrete-time version (SGD-d) under certain conditions. It also establishes that SGD achieves the minimax optimal convergence rate of O(t^-1/2) for α=1/2 by using an "averaging from the past" procedure, closing the gap between previous lower and upper bound results.
1. The document presents Plug-and-Play priors for Bayesian imaging using Langevin-based sampling methods.
2. It introduces the Bayesian framework for image restoration and discusses challenges in modeling the prior.
3. A Plug-and-Play approach is proposed that uses an implicit prior defined by a denoising network in conjunction with Langevin sampling, termed PnP-ULA. Experiments demonstrate its effectiveness on image deblurring and inpainting tasks.
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
The document discusses quantitative analysis of stochastic gradient descent (SGD) for training wide neural networks. It presents two different regimes - a deterministic regime where the limiting dynamics is described by an ordinary differential equation, and a stochastic regime where the limiting dynamics is a stochastic differential equation. Experiments on MNIST classification show that the stochastic regime with larger step sizes exhibits better regularization properties. The analysis provides insights into the behavior of neural network training as the number of neurons becomes large.
This document describes a statistical framework for interactive image category search based on mental matching. The framework allows a user to search an unstructured image database for a target category that exists only as a "mental picture" by providing feedback on displayed image sets. At each iteration, the system selects images to maximize the information gained from the user's response. The goal is to minimize the number of iterations needed to display an image from the target category. Experiments showed the approach was effective on databases containing tens of thousands of images across several semantic categories.
Pre-computation for ABC in image analysisMatt Moores
MCMSki IV (the 5th IMS-ISBA joint meeting)
January 2014
Chamonix Mont-Blanc, France
The associated journal article has now been uploaded to arXiv: http://arxiv.org/abs/1403.4359
Some sampling techniques for big data analysisJae-kwang Kim
This document describes different sampling techniques for big data analysis, including reservoir sampling and its variants. It provides an example to illustrate simple random sampling and calculates the expected value and variance of sampling errors. It then discusses probability sampling and its advantages over non-probability sampling. The document also introduces survey sampling and challenges in the era of big data, as well as how sampling techniques can still be useful for handling big data. It outlines reservoir sampling and two methods to improve it: balanced reservoir sampling and stratified reservoir sampling. A simulation study is described to compare the performance of these reservoir sampling methods.
This document summarizes the Bayesian Additive Regression Trees (BART) model and the Monotone BART (MBART) extension. BART approximates an unknown function using an ensemble of regression trees with a regularization prior. It connects to ideas in Bayesian nonparametrics, dynamic random basis elements, and gradient boosting. The document outlines the BART MCMC algorithm and how it can provide automatic uncertainty quantification and variable selection. It then introduces MBART, which constrains trees to be monotonic, and describes an MCMC algorithm for fitting MBART models. Examples illustrate BART and MBART fits to simulated monotonic and non-monotonic functions.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
This document presents an improved Bregman ball tree (BB-tree++) for efficient nearest neighbor search using Bregman divergences. The BB-tree++ speeds up construction using Bregman 2-means++ initialization and adapts the branching factor. It also handles symmetrized Bregman divergences and prioritizes closer nodes. Experiments on image retrieval with SIFT descriptors show the BB-tree++ outperforms the original BB-tree and random sampling, providing faster approximate nearest neighbor search.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
80 Percent Of Homeowners Behind On Mortgage Ineligible For Loan Modification ...observantoutfit37
The Treasury Department report found that 80% of homeowners who are technically eligible for the government's mortgage modification program under Making Home Affordable are actually ineligible. Only 900,000 homeowners, or less than 20% of those delinquent on their mortgages, qualify for the program. Critics argue the program is ineffective as it is only designed to help a small portion of homeowners and fell short of its original goals. Common reasons for ineligibility include the type of loan, the property not being a primary residence, monthly payments already deemed affordable, or the homeowner abandoning the property.
Los maestros en Colombia hicieron una huelga para exigir aumentos salariales y un mejor acceso a la atención médica. El gobierno acordó negociar si los maestros regresaban a clases después de casi una semana de huelga. Finalmente, el gobierno y la federación de maestros acordaron aumentos salariales del 12% para los maestros y cambios en la forma en que son evaluados, poniendo fin a una huelga de 15 días que afectó a casi nueve millones de estudiantes.
This document summarizes generative models like VAEs and GANs. It begins with an introduction to information theory, defining key concepts like entropy and maximum likelihood estimation. It then explains generative models as estimating the joint distribution P(X,Y) compared to discriminative models estimating P(Y|X). VAEs are discussed as maximizing the evidence lower bound (ELBO) to estimate the latent variable distribution P(Z|X), allowing generation of new X values. GANs are also covered, defining their minimax game between a generator G and discriminator D, with G learning to generate samples resembling the real data distribution Pemp.
This document provides an outline and introduction to the topics of pattern recognition and machine learning. It begins with an overview of key concepts like probability theory, decision theory, and the curse of dimensionality. It then covers specific techniques like polynomial curve fitting, the Gaussian distribution, and Bayesian curve fitting. The document also includes an appendix on properties of matrices such as determinants, matrix derivatives, and the eigenvector equation.
This document describes the Space Alternating Data Augmentation (SADA) algorithm, an efficient Markov chain Monte Carlo method for sampling from posterior distributions. SADA extends the Data Augmentation algorithm by introducing multiple sets of missing data, with each set corresponding to a subset of model parameters. These are sampled in a "space alternating" manner to improve convergence. The document applies SADA to finite mixtures of Gaussians, introducing different types of missing data to update parameter subsets. Simulation results show SADA provides better mixing and convergence than standard Data Augmentation.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
This document discusses macrocanonical models for texture synthesis. It begins by introducing the goal of texture synthesis and providing a brief history. It then describes the parametric question of combining randomness and structure in images. Specifically, it discusses maximizing entropy under geometric constraints. The document goes on to discuss links to statistical physics, defining microcanonical and macrocanonical models. It focuses on studying the macrocanonical model, describing how to find optimal parameters through gradient descent and how to sample from the model using Langevin dynamics. The document provides examples of texture synthesis and compares results to other methods.
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
My data are incomplete and noisy: Information-reduction statistical methods f...Umberto Picchini
We review parameter inference for stochastic modelling in complex scenario, such as bad parameters initialization and near-chaotic dynamics. We show how state-of-art methods for state-space models can fail while, in some situations, reducing data to summary statistics (information reduction) enables robust estimation. Wood's synthetic likelihoods method is reviewed and the lecture closes with an example of approximate Bayesian computation methodology.
Accompanying code is available at https://github.com/umbertopicchini/pomp-ricker and https://github.com/umbertopicchini/abc_g-and-k
Readership lecture given at Lund University on 7 June 2016. The lecture is of popular science nature hence mathematical detail is kept to a minimum. However numerous links and references are offered for further reading.
Approximate Bayesian computation for the Ising/Potts modelMatt Moores
This document provides an introduction to Approximate Bayesian Computation (ABC). ABC is a likelihood-free method for approximating posterior distributions when the likelihood function is intractable or expensive to evaluate. The document outlines the basic ABC rejection sampling algorithm and discusses extensions like using summary statistics, ABC-MCMC, and ABC sequential Monte Carlo. It also applies ABC to parameter inference for a hidden Potts model used in Bayesian image segmentation.
This document summarizes results on analyzing stochastic gradient descent (SGD) algorithms for minimizing convex functions. It shows that a continuous-time version of SGD (SGD-c) can strongly approximate the discrete-time version (SGD-d) under certain conditions. It also establishes that SGD achieves the minimax optimal convergence rate of O(t^-1/2) for α=1/2 by using an "averaging from the past" procedure, closing the gap between previous lower and upper bound results.
1. The document presents Plug-and-Play priors for Bayesian imaging using Langevin-based sampling methods.
2. It introduces the Bayesian framework for image restoration and discusses challenges in modeling the prior.
3. A Plug-and-Play approach is proposed that uses an implicit prior defined by a denoising network in conjunction with Langevin sampling, termed PnP-ULA. Experiments demonstrate its effectiveness on image deblurring and inpainting tasks.
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
The document discusses quantitative analysis of stochastic gradient descent (SGD) for training wide neural networks. It presents two different regimes - a deterministic regime where the limiting dynamics is described by an ordinary differential equation, and a stochastic regime where the limiting dynamics is a stochastic differential equation. Experiments on MNIST classification show that the stochastic regime with larger step sizes exhibits better regularization properties. The analysis provides insights into the behavior of neural network training as the number of neurons becomes large.
This document describes a statistical framework for interactive image category search based on mental matching. The framework allows a user to search an unstructured image database for a target category that exists only as a "mental picture" by providing feedback on displayed image sets. At each iteration, the system selects images to maximize the information gained from the user's response. The goal is to minimize the number of iterations needed to display an image from the target category. Experiments showed the approach was effective on databases containing tens of thousands of images across several semantic categories.
Pre-computation for ABC in image analysisMatt Moores
MCMSki IV (the 5th IMS-ISBA joint meeting)
January 2014
Chamonix Mont-Blanc, France
The associated journal article has now been uploaded to arXiv: http://arxiv.org/abs/1403.4359
Some sampling techniques for big data analysisJae-kwang Kim
This document describes different sampling techniques for big data analysis, including reservoir sampling and its variants. It provides an example to illustrate simple random sampling and calculates the expected value and variance of sampling errors. It then discusses probability sampling and its advantages over non-probability sampling. The document also introduces survey sampling and challenges in the era of big data, as well as how sampling techniques can still be useful for handling big data. It outlines reservoir sampling and two methods to improve it: balanced reservoir sampling and stratified reservoir sampling. A simulation study is described to compare the performance of these reservoir sampling methods.
This document summarizes the Bayesian Additive Regression Trees (BART) model and the Monotone BART (MBART) extension. BART approximates an unknown function using an ensemble of regression trees with a regularization prior. It connects to ideas in Bayesian nonparametrics, dynamic random basis elements, and gradient boosting. The document outlines the BART MCMC algorithm and how it can provide automatic uncertainty quantification and variable selection. It then introduces MBART, which constrains trees to be monotonic, and describes an MCMC algorithm for fitting MBART models. Examples illustrate BART and MBART fits to simulated monotonic and non-monotonic functions.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
This document presents an improved Bregman ball tree (BB-tree++) for efficient nearest neighbor search using Bregman divergences. The BB-tree++ speeds up construction using Bregman 2-means++ initialization and adapts the branching factor. It also handles symmetrized Bregman divergences and prioritizes closer nodes. Experiments on image retrieval with SIFT descriptors show the BB-tree++ outperforms the original BB-tree and random sampling, providing faster approximate nearest neighbor search.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
80 Percent Of Homeowners Behind On Mortgage Ineligible For Loan Modification ...observantoutfit37
The Treasury Department report found that 80% of homeowners who are technically eligible for the government's mortgage modification program under Making Home Affordable are actually ineligible. Only 900,000 homeowners, or less than 20% of those delinquent on their mortgages, qualify for the program. Critics argue the program is ineffective as it is only designed to help a small portion of homeowners and fell short of its original goals. Common reasons for ineligibility include the type of loan, the property not being a primary residence, monthly payments already deemed affordable, or the homeowner abandoning the property.
Los maestros en Colombia hicieron una huelga para exigir aumentos salariales y un mejor acceso a la atención médica. El gobierno acordó negociar si los maestros regresaban a clases después de casi una semana de huelga. Finalmente, el gobierno y la federación de maestros acordaron aumentos salariales del 12% para los maestros y cambios en la forma en que son evaluados, poniendo fin a una huelga de 15 días que afectó a casi nueve millones de estudiantes.
Este documento trata sobre la eficacia, eficiencia y efectividad en los procesos empresariales. Introduce la importancia de optimizar los recursos para reducir costos, obtener mayores beneficios y cumplir objetivos empresariales. Explica que la administración es una ciencia desarrollada a lo largo del tiempo y enriquecida por experiencias empresariales, siendo enseñada y aplicada en organizaciones.
Este documento presenta un índice de contenidos para un libro sobre aprendizaje de Java y programación orientada a objetos. El índice incluye secciones introductorias sobre la máquina virtual de Java, el kit de desarrollo Java y comandos básicos, así como secciones sobre codificación inicial, métodos estáticos, arreglos, objetos y otros temas. El documento proporciona una guía general sobre el contenido cubierto en el libro.
The document discusses how the iMentorCorps program can help improve military recruiting efforts and reduce costs. It notes that recruiting and training recruits costs $25,000-$50,000 each, but one-third of recruits fail to complete their term, costing the military billions per year. iMentorCorps aims to match recruits with recruiters who share their interests, provide test preparation to help recruits qualify, and structure a two-month dialogue between each recruit and recruiter to better determine fit and commitment. This is expected to reduce early attrition, save billions per year, and create a more stable military force.
Este documento descreve um projeto para construir um tacômetro usando Arduino. O projeto usa um emissor infravermelho, receptor infravermelho, Arduino e outros componentes. Ele fornece instruções detalhadas sobre como conectar os componentes e programação do Arduino para medir RPM.
Manual de uso de ms outlook de erick israel reyna gonzález 6° fErick Reyna
Este documento proporciona una introducción general a Microsoft Outlook, incluyendo sus características principales como la administración de correo electrónico, contactos, tareas, citas y reuniones. Explica cómo vincular una cuenta de correo electrónico, administrar el correo entrante y saliente, y crear elementos como contactos, citas, tareas y reuniones. También describe la interfaz gráfica de Outlook y las ventajas de usarlo para organizar la información personal y profesional.
(Modelos y teorías) t. de principiante a expertadiamiarieldoris
La teoría de Patricia Benner describe las etapas por las que pasa una enfermera desde principiante hasta experta en base a su experiencia clínica. Identificó 5 etapas: 1) principiante, 2) principiante avanzada, 3) competente, 4) eficiente y 5) experta. La teoría sostiene que el conocimiento clínico aumenta con la experiencia y cada enfermera desarrolla habilidades únicas. Esta teoría provee una guía para mejorar la práctica de enfermería.
1) The document appears to be a chapter from a story about the Iron family and their daily lives and challenges.
2) It describes various interactions between family members, such as the children bickering, getting in trouble at school, and growing older.
3) The narrator provides occasional commentary on the family and moves the story along by having the family members age up or go to school over the course of the chapter.
El documento describe la formación y la importancia del carbón, con un enfoque en la minería de carbón en Cerrejón, la mina de carbón a cielo abierto más grande del mundo ubicada en La Guajira, Colombia. El carbón en Cerrejón es de alta calidad debido a la posición geológica única de Colombia. Aunque la minería genera ganancias, también tiene impactos negativos significativos en el medio ambiente y las comunidades locales a través de la pérdida de bosques, la contaminación del agua, y el
La Unión Europea ha acordado un embargo petrolero contra Rusia en respuesta a la invasión de Ucrania. El embargo prohibirá las importaciones marítimas de petróleo ruso a la UE y pondrá fin a las entregas a través de oleoductos dentro de seis meses. Esta medida forma parte de un sexto paquete de sanciones de la UE destinadas a aumentar la presión económica sobre Moscú y privar al Kremlin de fondos para financiar su guerra.
Computational Information Geometry: A quick review (ICMS)Frank Nielsen
From the workshop
Computational information geometry for image and signal processing
Sep 21, 2015 - Sep 25, 2015
ICMS, 15 South College Street, Edinburgh
http://www.icms.org.uk/workshop.php?id=343
Patch Matching with Polynomial Exponential Families and Projective DivergencesFrank Nielsen
This document presents a method called Polynomial Exponential Family-Patch Matching (PEF-PM) to solve the patch matching problem. PEF-PM models patch colors using polynomial exponential families (PEFs), which are universal smooth positive densities. It estimates PEFs using a Score Matching Estimator and accelerates batch estimation using Summed Area Tables. Patch similarity is measured using a statistical projective divergence called the symmetrized γ-divergence. Experiments show PEF-PM handles noise robustly, symmetries, and outperforms baseline methods.
Information-theoretic clustering with applicationsFrank Nielsen
Information-theoretic clustering with applications
Abstract: Clustering is a fundamental and key primitive to discover structural groups of homogeneous data in data sets, called clusters. The most famous clustering technique is the celebrated k-means clustering that seeks to minimize the sum of intra-cluster variances. k-Means is NP-hard as soon as the dimension and the number of clusters are both greater than 1. In the first part of the talk, we first present a generic dynamic programming method to compute the optimal clustering of n scalar elements into k pairwise disjoint intervals. This case includes 1D Euclidean k-means but also other kinds of clustering algorithms like the k-medoids, the k-medians, the k-centers, etc.
We extend the method to incorporate cluster size constraints and show how to choose the appropriate number of clusters using model selection. We then illustrate and refine the method on two case studies: 1D Bregman clustering and univariate statistical mixture learning maximizing the complete likelihood. In the second part of the talk, we introduce a generalization of k-means to cluster sets of histograms that has become an important ingredient of modern information processing due to the success of the bag-of-word modelling paradigm.
Clustering histograms can be performed using the celebrated k-means centroid-based algorithm. We consider the Jeffreys divergence that symmetrizes the Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We prove that the Jeffreys centroid can be expressed analytically using the Lambert W function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms and conclude with some remarks on the k-means histogram clustering.
References: - Optimal interval clustering: Application to Bregman clustering and statistical mixture learning IEEE ISIT 2014 (recent result poster) http://arxiv.org/abs/1403.2485
- Jeffreys Centroids: A Closed-Form Expression for Positive Histograms and a Guaranteed Tight Approximation for Frequency Histograms.
IEEE Signal Process. Lett. 20(7): 657-660 (2013) http://arxiv.org/abs/1303.7286
http://www.i.kyoto-u.ac.jp/informatics-seminar/
This document discusses various methods for estimating normalizing constants that arise when evaluating integrals numerically. It begins by noting there are many computational methods for approximating normalizing constants across different communities. It then lists the topics that will be covered in the upcoming workshop, including discussions on estimating constants using Monte Carlo methods and Bayesian versus frequentist approaches. The document provides examples of estimating normalizing constants using Monte Carlo integration, reverse logistic regression, and Xiao-Li Meng's maximum likelihood estimation approach. It concludes by discussing some of the challenges in bringing a statistical framework to constant estimation problems.
Slides: Hypothesis testing, information divergence and computational geometryFrank Nielsen
Bayesian multiple hypothesis testing can be viewed from the perspective of computational geometry. The probability of error can be upper bounded by divergences such as the total variation and Chernoff distance. When the hypotheses are distributions from an exponential family, the optimal MAP Bayesian rule is a nearest neighbor classifier on an additive Bregman Voronoi diagram. For binary hypotheses, the best error exponent is the Chernoff information, which is a Bregman divergence on the exponential family manifold. This viewpoint generalizes to multiple hypotheses, where the best error exponent comes from the closest Bregman pair of distributions.
This document summarizes research on computing stochastic partial differential equations (SPDEs) using an adaptive multi-element polynomial chaos method (MEPCM) with discrete measures. Key points include:
1) MEPCM uses polynomial chaos expansions and numerical integration to compute SPDEs with parametric uncertainty.
2) Orthogonal polynomials are generated for discrete measures using various methods like Vandermonde, Stieltjes, and Lanczos.
3) Numerical integration is tested on discrete measures using Genz functions in 1D and sparse grids in higher dimensions.
4) The method is demonstrated on the KdV equation with random initial conditions. Future work includes applying these techniques to SPDEs driven
The document discusses distributed online convex optimization algorithms for coordinating multiple agents. It presents a coordination algorithm where each agent performs proportional-integral feedback to minimize local objectives while sharing information with neighbors over noisy communication channels. The algorithm is proven to achieve exponential convergence of second moments to the optimal solution and an ultimate bound on the error that depends on the noise level. Simulation results on a medical diagnosis example are also presented to illustrate the algorithm's behavior.
This document provides an introduction to Bayesian analysis and probabilistic modeling. It begins with an overview of Bayes' theorem and common probability distributions used in Bayesian modeling like the Bernoulli, binomial, beta, Dirichlet, and multinomial distributions. It then discusses how these distributions can be used in Bayesian modeling for problems like estimating probabilities based on observed data. Specifically, it explains how conjugate prior distributions allow the posterior distribution to be of the same family as the prior. The document concludes by discussing how neural networks can quantify classification uncertainty by outputting evidence for different classes modeled with a Dirichlet distribution.
This document summarizes a presentation on matrix computations in machine learning. It discusses how matrix computations are used in traditional machine learning algorithms like regression, PCA, and spectral graph partitioning. It also covers specialized machine learning algorithms that involve numerical optimization and matrix computations, such as Lasso, non-negative matrix factorization, sparse PCA, matrix completion, and co-clustering. Finally, it discusses how machine learning can be applied to improve graph clustering algorithms.
Multiple estimators for Monte Carlo approximationsChristian Robert
This document discusses multiple estimators that can be used to approximate integrals using Monte Carlo simulations. It begins by introducing concepts like multiple importance sampling, Rao-Blackwellisation, and delayed acceptance that allow combining multiple estimators to improve accuracy. It then discusses approaches like mixtures as proposals, global adaptation, and nonparametric maximum likelihood estimation (NPMLE) that frame Monte Carlo estimation as a statistical estimation problem. The document notes various advantages of the statistical formulation, like the ability to directly estimate simulation error from the Fisher information. Overall, the document presents an overview of different techniques for combining Monte Carlo simulations to obtain more accurate integral approximations.
최근 이수가 되고 있는 Bayesian Deep Learning 관련 이론과 최근 어플리케이션들을 소개합니다. Bayesian Inference 의 이론에 관해서 간단히 설명하고 Yarin Gal 의 Monte Carlo Dropout 의 이론과 어플리케이션들을 소개합니다.
This document presents a computational framework based on transported meshfree methods. It discusses using Monte Carlo integration with kernels to estimate integration errors. Two types of kernels are introduced: lattice-based kernels suited for Lebesgue measures, and transported kernels where a transport map is applied. An example shows optimal discrepancy errors can be achieved for Monte Carlo integration with a Matern kernel in various dimensions. The framework is applied to machine learning problems, showing how kernels can be used for interpolation and extrapolation of observations with error bounds.
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
The document summarizes key concepts in image formation, including how light interacts with objects and lenses to form images, and how different imaging systems like the human eye and digital cameras work. It discusses factors that affect image quality such as point spread functions and noise. Methods for analyzing the effects of noise propagation and algorithms on image quality are presented, such as error propagation techniques and Monte Carlo simulations.
Data fusion is the process of combining data from different sources to enhance the utility of the combined product. In remote sensing, input data sources are typically massive, noisy, and have different spatial supports and sampling characteristics. We take an inferential approach to this data fusion problem: we seek to infer a true but not directly observed spatial (or spatio-temporal) field from heterogeneous inputs. We use a statistical model to make these inferences, but like all models it is at least somewhat uncertain. In this talk, we will discuss our experiences with the impacts of these uncertainties and some potential ways addressing them.
Monte Carlo methods rely on repeated random sampling to compute results. They generate random samples from a population according to a probability distribution and use them to obtain numerical results. The founders of the Monte Carlo method were J. von Neumann and S. Ulam during the Manhattan Project in the 1940s. Monte Carlo methods can be used to solve multidimensional integrals and have better convergence than classical numerical integration methods for dimensions greater than 4. The variance of Monte Carlo estimates decreases as 1/N, where N is the number of samples, resulting in slow convergence. Variance reduction techniques can improve the convergence rate.
Multi Model Ensemble (MME) predictions are a popular ad-hoc technique for improving predictions of high-dimensional, multi-scale dynamical systems. The heuristic idea behind MME framework is simple: given a collection of models, one considers predictions obtained through the convex superposition of the individual probabilistic forecasts in the hope of mitigating model error. However, it is not obvious if this is a viable strategy and which models should be included in the MME forecast in order to achieve the best predictive performance. I will present an information-theoretic approach to this problem which allows for deriving a sufficient condition for improving dynamical predictions within the MME framework; moreover, this formulation gives rise to systematic and practical guidelines for optimising data assimilation techniques which are based on multi-model ensembles. Time permitting, the role and validity of “fluctuation-dissipation” arguments for improving imperfect predictions of externally perturbed non-autonomous systems - with possible applications to climate change considerations - will also be addressed.
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
The introduction of expert knowledge when learning Bayesian Networks from data is known to be an excellent approach to boost the performance of automatic learning methods, specially when the data is scarce. Previous approaches for this problem based on Bayesian statistics introduce the expert knowledge modifying the prior probability distributions. In this study, we propose a new methodology based on Monte Carlo simulation which starts with non-informative priors and requires knowledge from the expert a posteriori, when the simulation ends. We also explore a new Importance Sampling method for Monte Carlo simulation and the definition of new non-informative priors for the structure of the network. All these approaches are experimentally validated with five standard Bayesian networks.
Read more:
http://link.springer.com/chapter/10.1007%2F978-3-642-14049-5_70
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
This document discusses an empirical Bayesian approach for estimating regularization parameters in inverse problems using maximum likelihood estimation. It proposes the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm, which uses Markov chain sampling to approximate gradients in a stochastic projected gradient descent scheme for optimizing the regularization parameter. The algorithm is shown to converge to the maximum likelihood estimate under certain conditions on the log-likelihood and prior distributions.
Similar to Pattern learning and recognition on statistical manifolds: An information-geometric review (20)
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Monitoring Java Application Security with JDK Tools and JFR Events
Pattern learning and recognition on statistical manifolds: An information-geometric review
1. Pattern learning and recognition on
statistical manifolds:
An information-geometric review
Frank Nielsen
Frank.Nielsen@acm.org
www.informationgeometry.org
Sony Computer Science Laboratories, Inc.
July 2013, SIMBAD, York, UK.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
1/64
2. Praise for Computational Information Geometry
◮
What is Information? = Essence of data (datum=“thing”)
(make it tangible → e.g., parameters of generative models)
◮
Can we do Intrinsic computing?
(unbiased by any particular “data representation” → same
results after recoding data)
◮
Geometry −→ Science of invariance
(mother of Science, compass & ruler, Descartes
analytic=coordinate/Cartesian, imaginaries, ...).
The open-ended poetic mathematics...
?!
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
2/64
3. Rationale for Computational Information Geometry
◮
Information is ...never void! → lower bounds
◮
◮
◮
◮
◮
Geometry:
◮
◮
◮
Cram´r-Rao lower bound and Fisher information (estimation)
e
Bayes error and Chernoff information (classification)
Coding and Shannon entropy (communication)
Program and Kolmogorov complexity (compression).
(Unfortunately not computable!)
Language (point, line, ball, dimension, orthogonal, projection,
geodesic, immersion, etc.)
Power of characterization (eg., intersection of two
pseudo-segments not admitting closed-form expression)
Computing: Information computing. Seeking for mathematical
convenience and mathematical tricks (eg., kernel).
Do you know the “space of functions” ?!?
(Infinity and 1-to-1 mapping, language vs continuum)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
3/64
4. This talk: Information-geometric Pattern Recognition
◮
Focus on
statistical pattern recognition ↔ geometric computing.
◮
Consider probability distribution families (parameter spaces)
and statistical manifolds.
Beyond traditional Riemannian and Hilbert sphere
√
representations (p → p)
◮
Describe dually flat spaces induced by convex functions:
◮
◮
◮
Legendre transformation → dual and mixed coordinate systems
Dual similarities/divergences
Computing-friendly dual affine geodesics
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
4/64
5. SIMBAD 2013
Information-geometric Pattern Recognition
By departing from vector-space representations one is confronted
with the challenging problem of dealing with (dis)similarities that
do not necessarily possess the Euclidean behavior or
not even obey the requirements of a metric. The lack of the
Euclidean and/or metric properties undermines the very
foundations of traditional pattern recognition theories and
algorithms, and poses totally
new theoretical/computational questions and challenges.
Thank you to Prof. Edwin Hancock (U. York, UK) and Prof.
Marcello Pelillo (U. Venice, IT)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
5/64
6. Statistical Pattern Recognition
Models data with distributions (generative) or stochastic processes:
◮
◮
◮
parametric (Gaussians, histograms) [model size ∼ D],
semi-parametric (mixtures) [model size ∼ kD],
non-parametric (kernel density estimators [model size ∼ n],
Dirichlet/Gaussian processes [model size ∼ D log n],)
Data = Pattern (→information) + noise (independent)
Statistical machine learning hot topic: Deep learning
(restricted Boltzmann machines)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
6/64
7. Example I: Information-geometric PR (I)
Pattern = Gaussian mixture models (universal class)
Statistical (dis)similarity/distance: total Bregman divergence
(tBD, tKL).
Invariance: ..., pi (x) ∼ N(µi , Σi ), y = A(x) = Lx + t,
yi ∼ N(Lµi + t, LΣi L⊤ ), D(X1 : X2 ) = D(Y1 : Y2 )
(L: any invertible affine transformation, t a translation)
Shape Retrieval using Hierarchical Total Bregman Soft Clustering [11],
IEEE PAMI, 2012.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
7/64
8. Example II: Information-geometric PR
DTI: diffusion ellipsoids, tensor interpolation.
Pattern = zero-centered “Gaussians“
Statistical (dis)similarity/distance: total Bregman divergence
(tBD, tKL).
Invariance: ..., D(A⊤ PA : A⊤ QA) = D(P : Q), A ∈ SL(d):
orthogonal matrix
(volume/orientation preserving)
total Bregman divergence (tBD).
(3D rat corpus callosum)
c
Total Bregman Divergence and its Applications to DTI Analysis [34],
IEEE TMI. 2011. Science Laboratories, Inc.
2013 Frank Nielsen, Sony Computer
8/64
9. Statistical mixtures: Generative models of data sets
GMM = feature descriptor for information retrieval (IR)
→ classification [21], matching, etc.
Increase dimension using color image patches.
Low-frequency information encoded into compact statistical model.
Generative model → statistical image by GMM sampling.
→ A mixture k=1 wi N(µi , Σi ) is interpreted as a weighted point
i
set in a parameter space: {wi , θi = (µi , Σi )}k=1 .
i
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
9/64
10. Statistical invariance
Riemannian structure (M, g ) on {p(x; θ) | θ ∈ Θ ⊂ RD }
◮
θ-Invariance under non-singular parameterization:
ρ(p(x; θ), p(x; θ ′ )) = ρ(p(x; λ(θ)), p(x; λ(θ ′ )))
◮
Normal parameterization (µ, σ) or (µ, σ 2 ) yields same distance
x-Invariance under different x-representation:
Sufficient statistics (Fisher, 1922):
Pr(X = x|t(X ) = t, θ) = Pr(X = x|T (X ) = t)
All information for θ is contained in T .
→ Lossless information data reduction (exponential families).
Markov kernel = statistical morphism (Chentsov 1972,[6, 7]).
A particular Markov kernel is a deterministic mapping
T : X → Y with y = T (x), py = px T −1 .
Invariance if and only if g ∝ Fisher information matrix
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
10/64
11. f -divergences (1960’s)
A statistical non-metric distance between two probability measures:
If (p : q) =
f
p(x)
q(x)
q(x)dx
f : continuous convex function with f (1) = 0, f ′ (1) = 0, f ′′ (1) = 1.
→ asymmetric (not a metric, except TV), modulo affine term.
→ can always be symmetrized using s = f + f ∗ , with
f ∗ (x) = xf (1/x).
include many well-known statistical measures: Kullback-Leibler,
α-divergences, Hellinger, Chi squared, total variation (TV), etc.
f -divergences are the only statistical divergences that preserves
equivalence wrt. sufficient statistic mapping:
If (p : q) ≥ If (pM : qM )
with equality if and only if M = T (monotonicity property).
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
11/64
12. Information geometry: Dually flat spaces, α-geometry
Statistical invariance also obtained using (M, g , ∇, ∇∗ ) where
∇ and ∇∗ are dual affine connections.
Riemannian structure (M, g ) is particular case for ∇ = ∇∗ = ∇0 ,
Levi-Civita connection: (M, g ) = (M, g , ∇(0) , ∇(0) )
Dually flat space (α = ±1) are algorithmically-friendly:
◮
Statistical mixtures of exponential families
◮
Learning & simplifying mixtures (k-MLE)
◮
Bregman Voronoi diagrams & dually ⊥ triangulations
Goal: Algorithmics of Gaussians/histograms wrt. Kullback-Leibler divergence.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
12/64
13. Exponential Family Mixture Models (EFMMs)
Generalize Gaussian & Rayleigh MMs to many usual distributions.
k
m(x) =
i =1
wi pF (x; λi ) with ∀i wi > 0,
pF (x; λ) = e
k
i =1 wi
=1
t(x),θ −F (θ)+k(x)
F : log-Laplace transform (partition, cumulant function):
e
F (θ) = log
t(x),θ +k(x)
dx,
x∈X
θ∈Θ=
θ
e
t(x),θ +k(x)
x∈X
dx < ∞
the natural parameter space.
◮
◮
d: Dimension of the support X .
D: order of the family (= dimΘ). Statistic: t(x) : Rd → RD .
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
13/64
14. Statistical mixtures: Rayleigh MMs [33, 22]
IntraVascular UltraSound (IVUS) imaging:
Rayleigh distribution:
x2
x
p(x; λ) = λ2 e − 2λ2
x ∈ R+
d = 1 (univariate)
D = 1 (order 1)
1
θ = − 2λ2
Θ = (−∞, 0)
F (θ) = − log(−2θ)
t(x) = x 2
k(x) = log x
(Weibull k = 2)
Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues
Rayleigh Mixture Models (RMMs):
for segmentation and classification tasks
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
14/64
15. Statistical mixtures: Gaussian MMs [8, 22, 9]
Gaussian mixture models (GMMs): model low frequency.
Color image interpreted as a 5D xyRGB point set.
Gaussian distribution p(x; µ, Σ):
1
1
e − 2 DΣ−1 (x−µ,x−µ)
d√
(2π) 2
|Σ|
Squared Mahalanobis distance:
DQ (x, y ) = (x − y )T Q(x − y )
x ∈ Rd
d (multivariate)
D = d(d+3) (order)
2
θ = (Σ−1 µ, 1 Σ−1 ) = (θv , θM )
2
d
Θ = R × S++
1 T −1
F (θ) = 4 θv θM θv − 1 log |θM | +
2
d
log π
2
t(x) = (x, −xx T )
k(x) = 0
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
15/64
16. Sampling from a Gaussian Mixture Model
To sample a variate x from a GMM:
◮
Choose a component l according to the weight distribution
w1 , ..., wk ,
◮
Draw a variate x according to N(µl , Σl ).
→ Sampling is a doubly stochastic process:
◮
throw a biased dice with k faces to choose the component:
l ∼ Multinomial(w1 , ..., wk )
(Multinomial is also an EF, normalized histogram.)
◮
then draw at random a variate x from the l -th component
x ∼ Normal(µl , Σl )
x = µ + Cz with Cholesky: Σ = CC T√
and z = [z1 ... zd ]T
standard normal random variate:zi = −2 log U1 cos(2πU2 )
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
16/64
17. Relative entropy for exponential families
◮
◮
Distance between features (e.g., GMMs)
Kullback-Leibler divergence (cross-entropy minus entropy):
KL(P : Q) =
=
p(x)
dx ≥ 0
q(x)
1
1
p(x) log
dx − p(x) log
dx
q(x)
p(x)
p(x) log
H × (P:Q)
H(p)=H × (P:P)
= F (θQ ) − F (θP ) − θQ − θP , ∇F (θP )
= BF (θQ : θP )
◮
Bregman divergence BF defined for a strictly convex and
differentiable function up to some affine terms.
Proof KL(P : Q) = BF (θQ : θP ) follows from
X ∼ EF (θ) =⇒ E [t(X )] = ∇F (θ)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
17/64
18. Convex duality: Legendre transformation
◮
For a strictly convex and differentiable function F : X → R:
F ∗ (y ) = sup { y , x − F (x)}
x∈X
lF (y ;x);
◮
Maximum obtained for y = ∇F (x):
∇x lF (y ; x) = y − ∇F (x) = 0 ⇒ y = ∇F (x)
◮
Maximum unique from convexity of F (∇2 F ≻ 0):
∇2 lF (y ; x) = −∇2 F (x) ≺ 0
x
◮
Convex conjugates:
(F , X ) ⇔ (F ∗ , Y),
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
Y = {∇F (x) | x ∈ X }
18/64
19. Legendre duality: Geometric interpretation
Consider the epigraph of F as a convex object:
◮ convex hull (V -representation), versus
◮ half-space (H-representation).
z
O
F
Dual
coordinate systems:
x
P = P
HP : yP = F ′ (xP )
Q
HQ
HP +
: z = (x − xQ )F ′ (p) + F (xQ )
HP : z = (x − xP )F ′ (xP ) + F (xP )
zP = F (xP )
P : (x, F (x))
xP
0
x
(0, F (xP ) − xP F ′ (xP ) = −F ∗ (yP ))
Legendre transform also called “slope” transform.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
19/64
20. Legendre duality & Canonical divergence
◮
◮
◮
Convex conjugates have functional inverse gradients
∇F −1 = ∇F ∗
∇F ∗ may require numerical approximation
(not always available in analytical closed-form)
Involution: (F ∗ )∗ = F with ∇F ∗ = (∇F )−1 .
Convex conjugate F ∗ expressed using (∇F )−1 :
F ∗ (y ) =
=
◮
x, y − F (x), x = ∇y F ∗ (y )
(∇F )−1 (y ), y − F ((∇F )−1 (y ))
Fenchel-Young inequality at the heart of canonical divergence:
F (x) + F ∗ (y ) ≥ x, y
AF (x : y ) = AF ∗ (y : x) = F (x) + F ∗ (y ) − x, y ≥ 0
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
20/64
21. Dual Bregman divergences & canonical divergence [26]
p(x)
≥0
q(x)
= BF (θQ : θP ) = BF ∗ (ηP : ηQ )
KL(P : Q) = EP log
= F (θQ ) + F ∗ (ηP ) − θQ , ηP
= AF (θQ : ηP ) = AF ∗ (ηP : θQ )
with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP )
(moment parameterization).
1
1
dx − p(x) log
dx
KL(P : Q) = p(x) log
q(x)
p(x)
H × (P:Q)
H(p)=H × (P:P)
Shannon cross-entropy and entropy of EF [26]:
H × (P : Q) = F (θQ ) − θQ , ∇F (θP ) − EP [k(x)]
H(P) = F (θP ) − θP , ∇F (θP ) − EP [k(x)]
H(P) = −F ∗ (ηP ) − EP [k(x)]
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
21/64
22. Bregman divergence: Geometric interpretation (I)
Potential function F , graph plot F : (x, F (x)).
DF (p : q) = F (p) − F (q) − p − q, ∇F (q)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
22/64
23. Bregman divergence: Geometric interpretation (III)
Bregman divergence and path integrals
B(θ1 : θ2 ) = F (θ1 ) − F (θ2 ) − θ1 − θ2 , ∇F (θ2 ) ,
(1)
θ1
=
θ2
η2
=
η1
∗
∇F (t) − ∇F (θ2 ), dt ,
(2)
∇F ∗ (t) − ∇F ∗ (η1 ), dt ,
(3)
= B (η2 : η1 )
(4)
η = ∇F (θ)
η1
η2
θ2
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
θ1
θ
23/64
24. Bregman divergence: Geometric interpretation (II)
Potential function f , graph plot F : (x, f (x)).
Bf (p||q) = f (p) − f (q) − (p − q)f ′ (q)
F
p
ˆ
Bf (p||q)
q
ˆ
Hq
X
q
p
Bf (.||q): vertical distance between the hyperplane Hq tangent to
F at lifted point q , and the translated hyperplane at p .
ˆ
ˆ
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
24/64
25. total Bregman divergence (tBD)
By analogy to least squares and total least squares
total Bregman divergence (tBD) [11, 34, 12]
δf (x, y ) =
bf (x, y )
1 + ∇f (y )
2
Proved statistical robustness of tBD.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
25/64
26. Bregman sided centroids [25, 21]
Bregman centroids = unique minimizers of average Bregman
divergences (BF convex in right argument)
1
¯
θ = argminθ
n
1
¯
θ ′ = argminθ
n
¯
θ =
1
n
n
BF (θi : θ)
i =1
n
BF (θ : θi )
i =1
n
θi , center of mass, independent of F
i =1
¯
θ ′ = (∇F )−1
1
n
n
(∇F )(θi )
i =1
→ Generalized Kolmogorov-Nagumo f -means.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
26/64
27. Bregman divergences BF and ∇F -means
Bijection quasi-arithmetic means (∇F ) ⇔ Bregman divergence BF .
Bregman divergence BF
(entropy/loss function F )
F
←
→
f = F′
f −1 = (F ′ )−1
f -mean
(Generalized means)
Squared Euclidean distance
(half squared loss)
Kullback-Leibler divergence
1 x2
2
←
→
x
x
x log x − x
←
→
log x
exp x
Arithmetic mean
n
1
j =1 n xj
Geometric mean
− log x
←
→
1
−x
1
−x
Harmonic mean
(Ext. neg. Shannon entropy)
Itakura-Saito divergence
(Burg entropy)
(
1
n
n
j =1 xj )
n
1
n
j =1 xj
∇F strictly increasing (like cumulative distribution functions)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
27/64
28. Bregman sided centroids [25]
¯
¯
Two sided centroids C and C ′ expressed using two θ/η coordinate
systems: = 4 equations.
¯
¯ ¯
C : θ , η′
′
¯
¯ ¯
C : θ′, η
¯
C :θ =
1
n
n
θi
i =1
¯
η ′ = ∇F (θ)
¯
C′ : η =
¯
1
n
n
ηi
i =1
¯
θ ′ = ∇F ∗ (¯)
η
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
28/64
29. Bregman centroids and Jacobian fields
Centroid θ zeroes the left-sided Jacobian vector field:
∇θ
wt A(θ : ηt )
t
Sum of tangent vectors of geodesics from centroid to points = 0
∇θ A(θ : η ′ ) = η − η ′
∇θ (
since
t
t
wt = 1, with η =
¯
wt A(θ : ηt )) = η − η
¯
t
wt ηt .
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
29/64
30. Bregman information [25]
Bregman information = minimum of loss function
IF (P) =
=
=
1
n
1
n
1
n
n
¯
BF (θi : θ)
i =1
n
i =1
n
i =1
¯
¯
¯
F (θi ) − F (θ) − θi − θ, ∇F (θ)
¯
F (θi ) − F (θ) −
1
n
n
i =1
¯
¯
θi − θ, ∇F (θ)
=0
= JF (θ1 , ..., θn )
Jensen diversity index (e.g., Jensen-Shannon for F (x) = x log x)
◮ For squared Euclidean distance, Bregman information =
cluster variance,
◮ For Kullback-Leibler divergence, Bregman information related
to mutual information.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
30/64
31. Bregman k-means clustering [4]
Bregman k-means: Find k centers C = {C1 , ..., Ck } that minimizes
the loss function:
LF (P : C) =
BF (P : C) =
P∈P
BF (P : C)
min
i ∈{1,...,k}
BF (P : Ci )
→ generalize Lloyd’ s quadratic error in Vector Quantization (VQ)
LF (P : C) = IF (P) − IF (C)
IF (P)
IF (C)
LF (P : C)
→
→
→
total Bregman information
between-cluster Bregman information
within-cluster Bregman information
total Bregman information = within-cluster Bregman information + between-cluster Bregman information
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
31/64
32. Bregman k-means clustering [4]
IF (P) = LF (P : C) + IF (C)
Bregman clustering amounts to find the partition C ∗ that minimizes
the information loss:
LF ∗ = LF (P : C ∗ ) = min(IF (P) − IF (C))
C
Bregman k-means :
◮ Initialize distinct seeds: C1 = P1 , ..., Ck = Pk
◮ Repeat until convergence
◮
Assign point Pi to its closest centroid:
Ci = {P ∈ P | BF (P : Ci ) ≤ BF (P : Cj ) ∀j = i}
◮
Update cluster centroids by taking their center of mass:
1
Ci = |Ci | P∈Ci P.
Loss function monotonically decreases and converges to a local
optimum. (Extend to weighted point sets using barycenters.)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
32/64
33. Bregman k-means++ [1]: Careful seeding (only?!)
(also called Bregman k-medians since min
Extend the D 2 -initialization of k-means++
i
1
BF (pi : x)).
Only seeding stage yields probabilistically guaranteed global
approximation factor:
Bregman k-means++:
◮
◮
Choose C = {Cl } for l uniformly random in {1, ..., n}
While |C| < k
◮
Choose P ∈ P with probability
BF (P:C)
BF (P:C)
n
BF (Pi :C) = LF (P:C)
i =1
→ Yields a O(log k) approximation factor (with high probability).
Constant in O(·) depends on ratio of min/max ∇2 F .
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
33/64
34. Anisotropic Voronoi diagram (for MVN MMs) [10, 14]
Learning mixtures, Talk of O. Schwander on EM, k-MLE
From the source color image (a), we buid a 5D GMM with k = 32
components, and color each pixel with the mean color of the
anisotropic Voronoi cell it belongs to. (∼ weighted squared
Mahalanobis distance per center)
log pF (x; θi ) ∝ −BF ∗ (t(x) : ηi ) + k(x)
Most likely component segmentation ≡ Bregman Voronoi diagram
(squared Mahalanobis for Gaussians)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
34/64
35. Voronoi diagrams in dually flat spaces...
Voronoi diagram, dual ⊥ Delaunay triangulation (general position)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
35/64
36. Bregman dual bisectors: Hyperplanes &
hypersurfaces [5, 24, 27]
Right-sided bisector: → Hyperplane (θ-hyperplane)
HF (p, q) = {x ∈ X | BF (x : p) = BF (x : q)}.
HF :
∇F (p) − ∇F (q), x + (F (p) − F (q) + q, ∇F (q) − p, ∇F (p) ) = 0
Left-sided bisector: → Hypersurface (η-hyperplane)
′
HF (p, q) = {x ∈ X | BF (p : x) = BF (q : x)}
′
HF : ∇F (x), q − p + F (p) − F (q) = 0
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
36/64
37. Visualizing Bregman bisectors
Primal coordinates θ
natural parameters
Dual coordinates η
expectation parameters
Gradient Space: Bernouilli
p’(1.93116855,-1.80332178) q’(2.56517944,0.45980247)
D*(p’,q’)=0.60649981 D*(q’,p’)=0.49561129
q
q’
Source Space: Logistic loss
p
p(0.87337870,0.14144719) q(0.92858669,0.61296731)
D(p,q)=0.49561129 D(q,p)=0.60649981
p’
Gradient Space: Itakura-Saito dual
p’(-1.88760873,-1.38808518) q’(-1.16516903,-3.43833618)
D*(p’,q’)=0.44835617 D*(q’,p’)=0.66969016
p’
p
Source Space: Itakura-Saito
p(0.52977081,0.72041688) q(0.85824458,0.29083834)
q’
q
D(p,q)=0.66969016 D(q,p)=0.44835617
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
37/64
38. Bregman Voronoi diagrams as minimization diagrams [5]
A subclass of affine diagrams which have all non-empty cells .
Minimization diagram of the n functions
Di (x) = BF (x : pi ) = F (x) − F (pi ) − x − pi , ∇F (pi ) .
≡ minimization of n linear functions:
Hi (x) = (pi − x)T ∇F (qi ) − F (pi )
⇐⇒
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
38/64
39. Bregman dual Delaunay triangulations
Delaunay
Exponential
◮
empty Bregman sphere property,
◮
Hellinger-like
geodesic triangles.
BVDs extends Euclidean Voronoi diagrams with similar complexity/algorithms.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
39/64
40. Non-commutative Bregman Orthogonality
3-point property (generalized law of cosines):
BF (p : r ) = BF (p : q) + BF (q : r ) − (p − q)T (∇F (r ) − ∇F (q)))
P
Γ∗ Q
P
DF (P ||R)
DF (P ||Q)
ΓQR
Q
DF (Q||R)
R
(pq)θ Bregman orthogonal to (qr )η iff.
BF (p : r ) = BF (p : q) + BF (q : r )
(Equivalent to θp − θq , ηr − ηq = 0)
Extend Pythagoras theorem
(pq)θ ⊥F (qr )η
→ ⊥F is not commutative...
... except in the squared Euclidean/Mahalanobis case,
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
40/64
41. Dually orthogonal Bregman Voronoi & Triangulations
Ordinary Voronoi diagram is perpendicular to Delaunay
triangulation.
Dual line segment geodesics:
(pq)θ = {θ = θp + (1 − λ)θq |λ ∈ [0, 1]}
(pq)η = {η = ηp + (1 − λ)ηq |λ ∈ [0, 1]}
Bisectors:
Bθ (p, q) :
Bη (p, q) :
x, θq − θp + F (θp ) − F (θq ) = 0
x, ηq − ηp + F ∗ (ηp ) − F ∗ (ηq ) = 0
Dual orthogonality:
Bη (p, q) ⊥ (pq)η
(pq)θ ⊥ Bθ (p, q)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
41/64
42. Dually orthogonal Bregman Voronoi & Triangulations
Bη (p, q) ⊥ (pq)η
(pq)θ ⊥ Bθ (p, q)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
42/64
43. Simplifying mixture: Kullback-Leibler projection theorem
An exponential family mixture model p = k=1 wi pF (x; θi )
˜
i
Right-sided KL barycenter p ∗ of components interpreted as the
¯
projection of the mixture model p ∈ P onto the model exponential
˜
family manifold EF [32]:
p
p ∗ = arg min KL(˜ : p)
¯
p∈EF
P
p
e-geodesic
p
˜
m-geodesic
p∗
¯
exponential family
EF mixture sub-manifold
manifold of probability distribution
Right-sided KL centroid = Left-sided Bregman centroid
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
43/64
44. A simple proof
argminθ KL(m(x) : pθ (x)) = Em [log m] − Em [log pθ ]
= Em [log m] − Em [k(x)] − Em [ t(x), θ − F (θ)]
≡ argmaxθ Em [ t(x), θ ] − F (θ)
= argmaxθ Em [t(x)], θ ] − F (θ)
Em [t(x)] =
wt Epθt [t(x)] =
t
wt ηt = η
¯
t
∇F (θ) = η = η
¯
˘
θ = ∇F ∗ (¯)
η
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
44/64
45. Left-sided or right-sided Kullback-Leibler centroids?
Left/right Bregman centroids=Right/left entropic centroids (KL of exp. fam.)
Left-sided/right-sided centroids: different (statistical) properties:
◮ Right-sided entropic centroid: zero-avoiding (cover support of
pdfs.)
◮ Left-sided entropic centroid: zero-forcing (captures highest
mode).
Left-side Kullback-Leibler centroid
(zero-forcing)
N2 = (5, 0.64)
Symmetrized centroid
Right-sided Kullback-Leibler centroid
zero-avoiding
N1 = N (−4, 4)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
45/64
46. Hierarchical clustering of GMMs (Burbea-Rao)
Hierarchical clustering of GMMs wrt. Bhattacharyya distance.
Simplify the number of components of an initial GMM.
(a) source
(b) k = 48
(c) k = 16
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
46/64
47. Two symmetrizations of Bregman divergences
◮
Jeffreys-Bregman divergences.
SF (p; q) =
=
◮
BF (p, q) + BF (q, p)
2
1
p − q, ∇F (p) − ∇F (q) ,
2
Jensen-Bregman divergences (diversity index).
JF (p; q) =
=
BF (p, p+q ) + BF (q, p+q )
2
2
2
p+q
F (p) + F (q)
−F
2
2
= BRF (p, q)
Skew Jensen divergence [21, 29]
(α)
(α)
JF (p; q) = αF (p)+(1−α)F (q)−F (αp +(1−α)q) = BRF (p; q)
(Jeffreys and Jensen-Shannon symmetrization of Kullback-Leibler)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
47/64
48. Burbea-Rao centroids (α-skewed Jensen centroids)
Minimum average divergence:
n
x
(α)
wi JF (x, pi ) = arg min L(x)
OPT : c = arg min
x
i =1
Equivalent to minimize:
n
n
E (c) = (
i =1
wi α)F (c) −
i =1
wi F (αc + (1 − α)pi )
Sum E = F + G of convex F + concave G function ⇒
Convex-ConCave Procedure (CCCP)
Start from arbitrary c0 , and iteratively update as:
∇F (ct+1 ) = −∇G (ct )
⇒ guaranteed convergence to a local minimum.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
48/64
49. ConCave Convex Procedure (CCCP)
minx E (x) = F (x) + G (x)
∇F (ct+1 ) = −∇G (ct )
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
49/64
50. Iterative algorithm for Burbea-Rao centroids
Apply CCCP scheme
n
∇F (ct+1 ) =
i =1
wi ∇F (αct + (1 − α)pi )
n
ct+1 = ∇F −1
i =1
wi ∇F (αct + (1 − α)pi )
Get arbitrarily fine approximations of the (skew) Burbea-Rao
centroids and barycenters.
Unique GLOBAL minimum when divergence is separable [21].
Unique GLOBAL minimum for matrix mean [23] for the logDet
divergence.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
50/64
51. Information-geometric computing on statistical manifolds
◮
Cram´r-Rao Lower Bound and Information Geometry [13]
e
(overview article)
◮
Chernoff information on statistical exponential family
manifold [19]
◮
Hypothesis testing and Bregman Voronoi diagrams [18]
◮
Jeffreys centroid [20]
◮
Learning mixtures with k-MLE [17]
◮
Closed-form divergences for statistical mixtures [16]
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
51/64
52. Summary
Information-geometric pattern recognition:
◮
◮
Statistical manifold (M, g ): Rao’s distance and Fisher-Rao
curved Riemannian geometry.
Statistical manifold (M, g , ∇, ∇∗ ): dually flat spaces,
Bregman divergences, geodesics are straight lines in either θ/η
parameter space. ±1-geometry, special case of α-geometry
◮
Clustering in dually flat spaces
◮
Software library: jMEF [8] (Java), pyMEF [31] (Python)
◮
... but also many other geometries to explore: Hilbertian,
Finsler [3], K¨hler, Wasserstein, Contact, Symplectic, etc. (it
a
is easy to require non-Euclidean geometry but then space is
wild open!)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
52/64
53. Edited book (MIG) and proceedings (GSI)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
53/64
54. Thank you.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
54/64
55. Exponential families & statistical distances
Universal density estimators [2] generalizing Gaussians/histograms
(single EF density approximates any smooth density)
Explicit formula for
◮ Shannon entropy, cross-entropy, and Kullback-Leibler
divergence [26]:
◮ R´nyi/Tsallis entropy and divergence [28]
e
◮ Sharma-Mittal entropy and divergence [30]. A 2-parameter
family extending extensive R´nyi (for β → 1) and
e
non-extensive Tsallis entropies (for β → α)
1
Hα,β (p) =
1−β
◮
◮
◮
α
p(x) dx
1−β
1−α
−1 ,
with α > 0, α = 1, β = 1.
Skew Jensen and Burbea-Rao divergence [21]
Chernoff information and divergence [15]
Mixtures: total Least square, Jensen-R´nyi, Cauchy-Schwarz
e
divergence [16].
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
55/64
56. Statistical invariance: Markov kernel
Probability family: p(x; θ).
(X , σ) and (X ′ , σ ′ ) two measurable spaces.
σ: A σ-algebra on X
(non-empty, closed under complementation and countable union).
Markov kernel = transition probability kernel
K : X × σ ′ → [0, 1]:
◮
◮
∀E ′ ∈ σ ′ , K (·, E ′ ) measurable map,
∀x ∈ X , K (x, ·) is a probability measure on (X ′ , σ ′ ).
p a pm. on (X , σ) induces Kp a pm., with
Kp(E ′ ) =
X
K (x, E ′ )p(dx), ∀E ′ ⊂ σ ′
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
56/64
57. Space of Bregman spheres and Bregman balls [5]
Dual Bregman balls (bounding Bregman spheres):
Ballr (c, r ) = {x ∈ X | BF (x : c) ≤ r }
F
and BalllF (c, r ) = {x ∈ X | BF (c : x) ≤ r }
Legendre duality:
BalllF (c, r ) = (∇F )−1 (Ballr ∗ (∇F (c), r ))
F
Illustration for Itakura-Saito divergence, F (x) = − log x
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
57/64
58. Space of Bregman spheres: Lifting map [5]
F : x → x = (x, F (x)), hypersurface in Rd+1 .
ˆ
Hp : Tangent hyperplane at p , z = Hp (x) = x − p, ∇F (p) + F (p)
ˆ
◮ Bregman sphere σ −→ σ with supporting hyperplane
ˆ
Hσ : z = x − c, ∇F (c) + F (c) + r .
(// to Hc and shifted vertically by r )
σ = F ∩ Hσ .
ˆ
◮ intersection of any hyperplane H with F projects onto X as a
Bregman sphere:
H : z = x, a +b → σ : BallF (c = (∇F )−1 (a), r = a, c −F (c)+b)
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
58/64
59. Bibliographic references I
Marcel R. Ackermann and Johannes Bl¨mer.
o
Bregman clustering for separable instances.
In Scandinavian Workshop on Algorithm Theory (SWAT), pages 212–223, 2010.
Yasemin Altun, Alexander J. Smola, and Thomas Hofmann.
Exponential families for conditional random fields.
In Uncertainty in Artificial Intelligence (UAI), pages 2–9, 2004.
Marc Arnaudon and Frank Nielsen.
Medians and means in Finsler geometry.
LMS Journal of Computation and Mathematics, 15, 2012.
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.
Clustering with Bregman divergences.
Journal of Machine Learning Research, 6:1705–1749, 2005.
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.
Bregman Voronoi diagrams.
Discrete & Computational Geometry, 44(2):281–307, 2010.
Nikolai Nikolaevich Chentsov.
Statistical Decision Rules and Optimal Inferences.
Transactions of Mathematics Monograph, numero 53, 1982.
Published in russian in 1972.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
59/64
60. Bibliographic references II
Jos´ Manuel Corcuera and Federica Giummol´.
e
e
A characterization of monotone and regular divergences.
Annals of the Institute of Statistical Mathematics, 50(3):433–450, 1998.
Vincent Garcia and Frank Nielsen.
Simplification and hierarchical representations of mixtures of exponential families.
Signal Processing (Elsevier), 90(12):3197–3212, 2010.
Vincent Garcia, Frank Nielsen, and Richard Nock.
Levels of details for Gaussian mixture models.
In Asian Conference on Computer Vision (ACCV), volume 2, pages 514–525, 2009.
Fran¸ois Labelle and Jonathan Richard Shewchuk.
c
Anisotropic Voronoi diagrams and guaranteed-quality anisotropic mesh generation.
In Proceedings of the nineteenth annual symposium on Computational geometry, SCG ’03, pages 191–200,
New York, NY, USA, 2003. ACM.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to shape retrieval.
In International Conference on Computer Vision (CVPR), pages 3463–3468, 2010.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
60/64
61. Bibliographic references III
Frank Nielsen.
Cram´r-rao lower bound and information geometry.
e
In Rajendra Bhatia, C. S. Rajan, and Ajit Iqbal Singh, editors, Connected at Infinity II: A selection of
mathematics by Indians. Hindustan Book Agency (Texts and Readings in Mathematics, TRIM).
arxiv 1301.3578.
Frank Nielsen.
Visual Computing: Geometry, Graphics, and Vision.
Charles River Media / Thomson Delmar Learning, 2005.
Frank Nielsen.
Chernoff information of exponential families.
arXiv, abs/1102.2684, 2011.
Frank Nielsen.
Closed-form information-theoretic divergences for statistical mixtures.
In International Conference on Pattern Recognition (ICPR), pages 1723–1726, 2012.
Frank Nielsen.
k-MLE: A fast algorithm for learning statistical mixture models.
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012.
preliminary, technical report on arXiv.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
61/64
62. Bibliographic references IV
Frank Nielsen.
Hypothesis testing, information divergence and computational geometry.
In F. Nielsen and F. Barbaresco, editors, Geometric Science of Information (GSI), volume 8085 of LNCS,
pages 241–248. Springer, 2013.
Frank Nielsen.
An information-geometric characterization of Chernoff information.
IEEE Signal Processing Letters, 20(3):269–272, 2013.
Frank Nielsen.
Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation
for frequency histograms.
Signal Processing Letters, IEEE, PP(99):1–1, 2013.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Vincent Garcia.
Statistical exponential families: A digest with flash cards, 2009.
arXiv.org:0911.4863.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
62/64
63. Bibliographic references V
Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri.
Jensen divergence based SPD matrix means and applications.
In International Conference on Pattern Recognition (ICPR), 2012.
Frank Nielsen and Richard Nock.
The dual Voronoi diagrams with respect to representational Bregman divergences.
In International Symposium on Voronoi Diagrams (ISVD), pages 71–78, 2009.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009.
Frank Nielsen and Richard Nock.
Entropies and cross-entropies of exponential families.
In International Conference on Image Processing (ICIP), pages 3621–3624, 2010.
Frank Nielsen and Richard Nock.
Hyperbolic Voronoi diagrams made easy.
In International Conference on Computational Science and its Applications (ICCSA), volume 1, pages
74–80, Los Alamitos, CA, USA, march 2010. IEEE Computer Society.
Frank Nielsen and Richard Nock.
On r´nyi and tsallis entropies and divergences for exponential families.
e
arXiv, abs/1105.3259, 2011.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
63/64
64. Bibliographic references VI
Frank Nielsen and Richard Nock.
Skew Jensen-Bregman Voronoi diagrams.
Transactions on Computational Science, 14:102–128, 2011.
Frank Nielsen and Richard Nock.
A closed-form expression for the Sharma-Mittal entropy of exponential families.
Journal of Physics A: Mathematical and Theoretical, 45(3), 2012.
Olivier Schwander and Frank Nielsen.
PyMEF - A framework for exponential families in Python.
In IEEE/SP Workshop on Statistical Signal Processing (SSP), 2011.
Olivier Schwander and Frank Nielsen.
Learning mixtures by simplifying kernel density estimators.
In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 403–426, 2012.
Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez.
Rayleigh mixture model for plaque characterization in intravascular ultrasound.
IEEE Transaction on Biomedical Engineering, 58(5):1314–1324, 2011.
Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, 2011.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.
64/64