Maximum likelihood estimation (MLE) is a popular method for parameter estimation in both applied probability and statistics but MLE cannot solve the problem of incomplete data or hidden data because it is impossible to maximize likelihood function from hidden data. Expectation maximum (EM) algorithm is a powerful mathematical tool for solving this problem if there is a relationship between hidden data and observed data. Such hinting relationship is specified by a mapping from hidden data to observed data or by a joint probability between hidden data and observed data. In other words, the relationship helps us know hidden data by surveying observed data. The essential ideology of EM is to maximize the expectation of likelihood function over observed data based on the hinting relationship instead of maximizing directly the likelihood function of hidden data. Pioneers in EM algorithm proved its convergence. As a result, EM algorithm produces parameter estimators as well as MLE does. This tutorial aims to provide explanations of EM algorithm in order to help researchers comprehend it. Moreover some improvements of EM algorithm are also proposed in the tutorial such as combination of EM and third-order convergence Newton-Raphson process, combination of EM and gradient descent method, and combination of EM and particle swarm optimization (PSO) algorithm.
Eigen-Decomposition: Eigenvalues and Eigenvectors.pdfNehaVerma933923
Eigenvalues and eigenvectors are numbers and vectors associated with square matrices. Together they provide the eigen-decomposition, which analyzes a matrix's structure. The eigen-decomposition expresses a matrix as a linear combination of orthogonal eigenvectors multiplied by the corresponding eigenvalues. For positive semi-definite matrices like correlation and covariance matrices, the eigen-decomposition is particularly useful because the eigenvalues are positive and the eigenvectors are orthogonal.
The document provides an overview of matrix theory, including:
1. The definition and notation of matrices, including that a matrix A is represented as Am×n, where m is the number of rows and n is the number of columns.
2. The different types of matrices and operations that can be performed on matrices, such as scalar multiplication, matrix multiplication, and properties like the distributive law.
3. Methods for solving systems of linear equations using matrices, including writing the system in matrix form, reducing the augmented matrix to echelon form, and determining the solution based on the rank.
The document discusses matrix representations of operators and changes of basis in quantum mechanics. Some key points:
- Matrix elements of an operator are computed using a basis of kets. The expectation value of an operator is computed from its matrix elements and the state vectors.
- If two operators commute, they have the same set of eigenkets.
- A change of basis is a unitary transformation that relates two different sets of basis kets that span the same space. It establishes a link between the two basis representations.
- Linear algebra concepts like linear independence of eigenvectors and Hermitian operators having real eigenvalues are important in quantum mechanics.
This document discusses linear transformations and matrices. It introduces how linear transformations on physical quantities are usually described by matrices, where a column vector u representing a physical quantity is transformed into another column vector Au by a transformation matrix A. As an example, it discusses orthogonal transformations, where the transformation matrix A is orthogonal. It proves that for an orthogonal transformation, the inner product of two vectors remains invariant. It also discusses properties of other types of matrices like Hermitian, skew-Hermitian and unitary matrices.
This document describes an undergraduate research project on iterative methods for computing eigenvalues and eigenvectors of matrices. It introduces the standard eigenvalue problem and defines key terms like eigenvalues, eigenvectors, and dominant eigenpairs. The body of the document reviews three iterative methods - the power method, inverse power method, and shifted inverse power method. It explains how these methods use repeated matrix-vector multiplications to approximate dominant, smallest, and intermediate eigenvalues and their corresponding eigenvectors. The document is structured with chapters on introduction, literature review, applications, and conclusion.
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...Ceni Babaoglu, PhD
The seminar series will focus on the mathematical background needed for machine learning. The first set of the seminars will be on "Linear Algebra for Machine Learning". Here are the slides of the fifth part which is discussing singular value decomposition and principal component analysis.
Here are the slides of the first part which was discussing linear systems: https://www.slideshare.net/CeniBabaogluPhDinMat/linear-algebra-for-machine-learning-linear-systems/1
Here are the slides of the second part which was discussing basis and dimension:
https://www.slideshare.net/CeniBabaogluPhDinMat/2-linear-algebra-for-machine-learning-basis-and-dimension
Here are the slides of the third part which is discussing factorization and linear transformations.
https://www.slideshare.net/CeniBabaogluPhDinMat/3-linear-algebra-for-machine-learning-factorization-and-linear-transformations-130813437
Here are the slides of the fourth part which is discussing eigenvalues and eigenvectors.
https://www.slideshare.net/CeniBabaogluPhDinMat/4-linear-algebra-for-machine-learning-eigenvalues-eigenvectors-and-diagonalization
The document discusses weighted nuclear norm minimization and its applications to image denoising. It provides background on key concepts from linear algebra and optimization theory needed to understand the denoising problem, such as convex optimization, affine transformations, singular value decomposition, and eigendecomposition. The objective of denoising is to extract the low-rank original image from a noisy high-dimensional image, modeled as the sum of the original image and white noise.
Numerical solution of eigenvalues and applications 2SamsonAjibola
This document provides an overview of eigenvalues and their applications. It discusses:
1) Eigenvalues arise in applications across science and engineering, including mechanics, control theory, and quantum mechanics. Numerical methods are used to solve increasingly large eigenvalue problems.
2) Common methods for small problems include the QR and power methods. For large, sparse problems, techniques like the Krylov subspace and Arnoldi methods are used to compute a few desired eigenvalues/eigenvectors.
3) The document outlines the structure of the thesis, which will investigate methods for finding eigenvalues like Krylov subspace, power, and QR. It will also explore applications in areas like biology, statistics, and engineering.
Eigen-Decomposition: Eigenvalues and Eigenvectors.pdfNehaVerma933923
Eigenvalues and eigenvectors are numbers and vectors associated with square matrices. Together they provide the eigen-decomposition, which analyzes a matrix's structure. The eigen-decomposition expresses a matrix as a linear combination of orthogonal eigenvectors multiplied by the corresponding eigenvalues. For positive semi-definite matrices like correlation and covariance matrices, the eigen-decomposition is particularly useful because the eigenvalues are positive and the eigenvectors are orthogonal.
The document provides an overview of matrix theory, including:
1. The definition and notation of matrices, including that a matrix A is represented as Am×n, where m is the number of rows and n is the number of columns.
2. The different types of matrices and operations that can be performed on matrices, such as scalar multiplication, matrix multiplication, and properties like the distributive law.
3. Methods for solving systems of linear equations using matrices, including writing the system in matrix form, reducing the augmented matrix to echelon form, and determining the solution based on the rank.
The document discusses matrix representations of operators and changes of basis in quantum mechanics. Some key points:
- Matrix elements of an operator are computed using a basis of kets. The expectation value of an operator is computed from its matrix elements and the state vectors.
- If two operators commute, they have the same set of eigenkets.
- A change of basis is a unitary transformation that relates two different sets of basis kets that span the same space. It establishes a link between the two basis representations.
- Linear algebra concepts like linear independence of eigenvectors and Hermitian operators having real eigenvalues are important in quantum mechanics.
This document discusses linear transformations and matrices. It introduces how linear transformations on physical quantities are usually described by matrices, where a column vector u representing a physical quantity is transformed into another column vector Au by a transformation matrix A. As an example, it discusses orthogonal transformations, where the transformation matrix A is orthogonal. It proves that for an orthogonal transformation, the inner product of two vectors remains invariant. It also discusses properties of other types of matrices like Hermitian, skew-Hermitian and unitary matrices.
This document describes an undergraduate research project on iterative methods for computing eigenvalues and eigenvectors of matrices. It introduces the standard eigenvalue problem and defines key terms like eigenvalues, eigenvectors, and dominant eigenpairs. The body of the document reviews three iterative methods - the power method, inverse power method, and shifted inverse power method. It explains how these methods use repeated matrix-vector multiplications to approximate dominant, smallest, and intermediate eigenvalues and their corresponding eigenvectors. The document is structured with chapters on introduction, literature review, applications, and conclusion.
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...Ceni Babaoglu, PhD
The seminar series will focus on the mathematical background needed for machine learning. The first set of the seminars will be on "Linear Algebra for Machine Learning". Here are the slides of the fifth part which is discussing singular value decomposition and principal component analysis.
Here are the slides of the first part which was discussing linear systems: https://www.slideshare.net/CeniBabaogluPhDinMat/linear-algebra-for-machine-learning-linear-systems/1
Here are the slides of the second part which was discussing basis and dimension:
https://www.slideshare.net/CeniBabaogluPhDinMat/2-linear-algebra-for-machine-learning-basis-and-dimension
Here are the slides of the third part which is discussing factorization and linear transformations.
https://www.slideshare.net/CeniBabaogluPhDinMat/3-linear-algebra-for-machine-learning-factorization-and-linear-transformations-130813437
Here are the slides of the fourth part which is discussing eigenvalues and eigenvectors.
https://www.slideshare.net/CeniBabaogluPhDinMat/4-linear-algebra-for-machine-learning-eigenvalues-eigenvectors-and-diagonalization
The document discusses weighted nuclear norm minimization and its applications to image denoising. It provides background on key concepts from linear algebra and optimization theory needed to understand the denoising problem, such as convex optimization, affine transformations, singular value decomposition, and eigendecomposition. The objective of denoising is to extract the low-rank original image from a noisy high-dimensional image, modeled as the sum of the original image and white noise.
Numerical solution of eigenvalues and applications 2SamsonAjibola
This document provides an overview of eigenvalues and their applications. It discusses:
1) Eigenvalues arise in applications across science and engineering, including mechanics, control theory, and quantum mechanics. Numerical methods are used to solve increasingly large eigenvalue problems.
2) Common methods for small problems include the QR and power methods. For large, sparse problems, techniques like the Krylov subspace and Arnoldi methods are used to compute a few desired eigenvalues/eigenvectors.
3) The document outlines the structure of the thesis, which will investigate methods for finding eigenvalues like Krylov subspace, power, and QR. It will also explore applications in areas like biology, statistics, and engineering.
The document discusses eigenvalue problems and algorithms for solving them. Eigenvalue problems involve finding the eigenvalues and eigenvectors of a matrix and occur across science and engineering. The properties of the eigenvalue problem, like whether the matrix is real or complex, affect the choice of algorithm. The Power Method is described as an iterative technique for determining the dominant eigenvalue and eigenvector of a matrix. It works by successively applying the matrix to a starting vector to isolate the component in the direction of the dominant eigenvector. Variants can find other eigenvalues like the smallest. General projection methods approximate eigenvectors within a subspace, while subspace iteration generalizes Power Method to compute multiple eigenvalues.
This document discusses techniques for setting linear algebra problems in a way that ensures relatively easy arithmetic. Some key techniques discussed include:
1. Using Pythagorean triples and sums of squares to generate vectors with integer norms in R2 and R3.
2. Using the PLU decomposition theorem to generate matrices with a given determinant, such as ±1, to avoid fractions.
3. Extending a basis for the kernel of a matrix to generate matrices with a given kernel.
4. Ensuring the coefficients for a Leontieff input-output model are nonnegative to generate a productive consumption matrix. Examples and Maple routines are provided.
Linear algebra concepts like vectors, matrices, and linear transformations are important for recommendation systems. Vectors represent items or users, matrices represent item-user preference data. Linear algebra allows analyzing this data to identify patterns and recommend new items. Key techniques include eigendecomposition to reduce dimensionality and identify important relationships in the data, and singular value decomposition to factor matrices for recommendations. These linear algebra concepts are essential mathematical tools for building personalized recommendation models.
Machine learning ppt and presentation codesharma239172
Principal Component Analysis (PCA) is a technique for dimensionality reduction that projects high-dimensional data onto a lower-dimensional space in a way that maximizes variance. It works by finding the directions (principal components) along which the variance of the data is highest. These principal components become the new axes of the reduced space. PCA involves computing the covariance matrix of the data, performing eigendecomposition on the covariance matrix to obtain its eigenvectors, and projecting the data onto the top K eigenvectors corresponding to the largest eigenvalues, where K is the target dimensionality. This projection both reduces dimensionality and maximizes retained variance.
This document describes the Neville elimination process for factorizing matrices. It proves that Neville elimination can be described mathematically using elementary matrices and that it results in a PLU factorization for any n x m matrix. Specifically:
1) Neville elimination produces zeros in a column by adding row multiples, allowing row swaps, resulting in an upper triangular matrix.
2) It can be described using elementary matrices Eij and permutation matrices Pij, yielding a product of bidiagonal and permutation matrices equaling the upper triangular matrix.
3) This realizes a PLU factorization for any n x m matrix, where P is an n x n permutation matrix, L is an n x n lower triangular matrix as a product of bid
The document presents information on matrices, including:
- Definitions of matrices as rectangular arrangements of numbers arranged in rows and columns
- Common matrix operations such as addition, subtraction, scalar multiplication, and matrix multiplication
- Determinants and inverses of matrices
- How matrices can represent systems of linear equations
- Unique properties of matrices, such as the product of two non-zero matrices possibly being zero
- Applications of matrices in fields like geology, statistics, economics, and animation
Beginning direct3d gameprogrammingmath05_matrices_20160515_jintaeksJinTaek Seo
This document provides an overview of linear systems and matrices. It defines key linear algebra concepts such as linear functions, linear maps, homogeneous and non-homogeneous linear systems, and plane equations. It also explains how to represent linear systems using matrices and describes common matrix operations including addition, scalar multiplication, transposition, and matrix multiplication. Finally, it discusses inverses, determinants, and using matrices to represent transformations such as rotations in 2D space.
Linear Algebra may be defined as the form of algebra in which there is a study of different kinds of solutions which are related to linear equations. In order to explain the Linear Algebra, it is important to explain that the title consists of two different terms. The very first term which is important to be considered in the same, is Linear. Linear may be defined as something which is straight. Linear equations can be used for the calculation of the equation in a xy plane where the straight lines has been defined. In addition to this, linear equations can be used to define something which is straight in a three dimensional perspective. Another view of linear equations may be defined as flatness which recognizes the set of points which can be used for giving the description related to the equations which are in a very simple forms. These are the equations which involves the addition and multiplication.
This document outlines topics related to matrices, including:
- Types of matrices such as real, square, row, column, null, sub, diagonal, scalar, unit, upper triangular, lower triangular, and singular matrices
- Characteristic equations, eigenvectors, and eigenvalues of matrices
- Properties of eigenvalues including that the sum of eigenvalues is the trace and the product is the determinant
- Examples of finding the sum and product of eigenvalues without directly calculating them
The document provides definitions and examples of key matrix concepts.
Definitions matrices y determinantes fula 2010 english subirHernanFula
The document discusses the history and properties of matrices. It describes how matrices were first introduced in 1850 and how their use has expanded. It then defines key matrix terms and concepts such as order, elements, types of matrices (e.g. triangular, diagonal), operations (e.g. addition, multiplication, inverse), and properties (e.g. of symmetric, banded and transpose matrices). It also provides examples of calculating the determinant of matrices using Sarrus' rule.
Handling missing data with expectation maximization algorithmLoc Nguyen
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the two sample statistical models which are concerned to hold missing values.
Lecture 5 inverse of matrices - section 2-2 and 2-3njit-ronbrown
The document discusses the inverse of matrices. It defines an inverse matrix A-1 as a matrix such that AA-1=I=A-1A, where I is the identity matrix. It provides methods for determining if a 2x2 matrix is invertible based on its determinant, and for finding the inverse of a matrix using elementary row operations. It also presents the Invertible Matrix Theorem, which provides equivalent conditions for a matrix to be invertible.
The document discusses pseudospectra as an alternative to eigenvalues for analyzing non-normal matrices and operators. It defines three equivalent definitions of pseudospectra: (1) the set of points where the resolvent is larger than ε-1, (2) the set of points that are eigenvalues of a perturbed matrix with perturbation smaller than ε, and (3) the set of points where the resolvent applied to a unit vector is larger than ε. It also shows that pseudospectra are nested sets and their intersection is the spectrum. The definitions extend to operators on Hilbert spaces using singular values.
This document discusses the rank of matrices and how it relates to the solvability of linear systems of equations. It contains the following key points:
1) The rank of a matrix is the number of leading entries in its row-reduced form and determines the number of independent variables in a linear system with that matrix as its coefficient matrix.
2) The rank of the coefficient matrix and augmented matrix determine whether a linear system has no solution, a unique solution, or infinitely many solutions.
3) Homogeneous systems always have at least one solution (the trivial solution of all zeros) and the rank of the coefficient matrix determines if that is the only solution or if there are infinitely many solutions.
Estimation Theory Class (Summary and Revision)Ahmad Gomaa
Summary of important theories and formulas in Estimation theory:
1) Cramer-Rao lower bound (CRLB)
2) Linear Model
3) Best Linear Unbiased Estimate (BLUE)
4) Maximum Likelihood Estimation (MLE)
5) Least Squares Estimation (LSE)
6) Bayesian Estimation and MMSE estimation
K-Notes are concise study materials intended for quick revision near the end of preparation for exams like GATE. Each K-Note covers the concepts from a subject in 40 pages or less. They are useful for final preparation and travel. Students should use K-Notes in the last 2 months before the exam, practicing questions after reviewing each note. The document then provides a summary of key concepts in linear algebra and matrices, including matrix properties, operations, inverses, and systems of linear equations.
Applications of matrices are found in most scientific fields. In every branch of physics, including classical mechanics, optics, electromagnetism, quantum mechanics, and quantum electrodynamics, they are used to study physical phenomena, such as the motion of rigid bodies.
This document summarizes key concepts regarding eigenvalues and eigenvectors of matrices:
- Eigenvalues are scalars such that there exist non-zero eigenvectors satisfying Ax = λx.
- The characteristic equation states that λ is an eigenvalue if and only if it satisfies det(A - λI) = 0.
- A matrix is diagonalizable if it can be written as A = PDP-1, where D is a diagonal matrix of eigenvalues and P is a matrix of corresponding eigenvectors. Diagonalizable matrices can easily compute powers by raising the eigenvalues to powers.
This document discusses three main topics: positive definite matrices, solving linear systems, and the least squares method.
Positive definite matrices are symmetric matrices where all eigenvalues are positive. Solving linear systems involves finding a single solution that satisfies two or more linear equations with the same variables.
The least squares method determines the line of best fit for a data set by minimizing the sum of the squared differences between the independent variable values and the dependent variable values predicted by the line or curve. It provides the closest approximate solution when a linear system has no exact solution.
This document provides an introduction to tensors. It discusses that familiar physics equations like F=ma are only strictly true for isotropic systems, and more general tensor forms are needed for anisotropic systems. It gives the example of a polarizability tensor to relate polarization and electric field for an anisotropic medium. It also discusses preliminaries of tensors, including how coordinate transformations define tensors and the Kronecker delta and Levi-Civita tensors used to represent identities and cross products.
Adversarial Variational Autoencoders to extend and improve generative model -...Loc Nguyen
Generative artificial intelligence (GenAI) has been developing with many incredible achievements like ChatGPT and Bard. Deep generative model (DGM) is a branch of GenAI, which is preeminent in generating raster data such as image and sound due to strong points of deep neural network (DNN) in inference and recognition. The built-in inference mechanism of DNN, which simulates and aims to synaptic plasticity of human neuron network, fosters generation ability of DGM which produces surprised results with support of statistical flexibility. Two popular approaches in DGM are Variational Autoencoders (VAE) and Generative Adversarial Network (GAN). Both VAE and GAN have their own strong points although they share and imply underline theory of statistics as well as incredible complex via hidden layers of DNN when DNN becomes effective encoding/decoding functions without concrete specifications. In this research, I try to unify VAE and GAN into a consistent and consolidated model called Adversarial Variational Autoencoders (AVA) in which VAE and GAN complement each other, for instance, VAE is a good data generator by encoding data via excellent ideology of Kullback-Leibler divergence and GAN is a significantly important method to assess reliability of data which is realistic or fake. In other words, AVA aims to improve accuracy of generative models, besides AVA extends function of simple generative models. In methodology this research focuses on combination of applied mathematical concepts and skillful techniques of computer programming in order to implement and solve complicated problems as simply as possible.
Conditional mixture model and its application for regression modelLoc Nguyen
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating statistical parameter when data sample contains hidden part and observed part. EM is applied to learn finite mixture model in which the whole distribution of observed variable is average sum of partial distributions. Coverage ratio of every partial distribution is specified by the probability of hidden variable. An application of mixture model is soft clustering in which cluster is modeled by hidden variable whereas each data point can be assigned to more than one cluster and degree of such assignment is represented by the probability of hidden variable. However, such probability in traditional mixture model is simplified as a parameter, which can cause loss of valuable information. Therefore, in this research I propose a so-called conditional mixture model (CMM) in which the probability of hidden variable is modeled as a full probabilistic density function (PDF) that owns individual parameter. CMM aims to extend mixture model. I also propose an application of CMM which is called adaptive regression model (ARM). Traditional regression model is effective when data sample is scattered equally. If data points are grouped into clusters, regression model tries to learn a unified regression function which goes through all data points. Obviously, such unified function is not effective to evaluate response variable based on grouped data points. The concept “adaptive” of ARM means that ARM solves the ineffectiveness problem by selecting the best cluster of data points firstly and then evaluating response variable within such best cluster. In orther words, ARM reduces estimation space of regression model so as to gain high accuracy in calculation.
Keywords: expectation maximization (EM) algorithm, finite mixture model, conditional mixture model, regression model, adaptive regression model (ARM).
The document discusses eigenvalue problems and algorithms for solving them. Eigenvalue problems involve finding the eigenvalues and eigenvectors of a matrix and occur across science and engineering. The properties of the eigenvalue problem, like whether the matrix is real or complex, affect the choice of algorithm. The Power Method is described as an iterative technique for determining the dominant eigenvalue and eigenvector of a matrix. It works by successively applying the matrix to a starting vector to isolate the component in the direction of the dominant eigenvector. Variants can find other eigenvalues like the smallest. General projection methods approximate eigenvectors within a subspace, while subspace iteration generalizes Power Method to compute multiple eigenvalues.
This document discusses techniques for setting linear algebra problems in a way that ensures relatively easy arithmetic. Some key techniques discussed include:
1. Using Pythagorean triples and sums of squares to generate vectors with integer norms in R2 and R3.
2. Using the PLU decomposition theorem to generate matrices with a given determinant, such as ±1, to avoid fractions.
3. Extending a basis for the kernel of a matrix to generate matrices with a given kernel.
4. Ensuring the coefficients for a Leontieff input-output model are nonnegative to generate a productive consumption matrix. Examples and Maple routines are provided.
Linear algebra concepts like vectors, matrices, and linear transformations are important for recommendation systems. Vectors represent items or users, matrices represent item-user preference data. Linear algebra allows analyzing this data to identify patterns and recommend new items. Key techniques include eigendecomposition to reduce dimensionality and identify important relationships in the data, and singular value decomposition to factor matrices for recommendations. These linear algebra concepts are essential mathematical tools for building personalized recommendation models.
Machine learning ppt and presentation codesharma239172
Principal Component Analysis (PCA) is a technique for dimensionality reduction that projects high-dimensional data onto a lower-dimensional space in a way that maximizes variance. It works by finding the directions (principal components) along which the variance of the data is highest. These principal components become the new axes of the reduced space. PCA involves computing the covariance matrix of the data, performing eigendecomposition on the covariance matrix to obtain its eigenvectors, and projecting the data onto the top K eigenvectors corresponding to the largest eigenvalues, where K is the target dimensionality. This projection both reduces dimensionality and maximizes retained variance.
This document describes the Neville elimination process for factorizing matrices. It proves that Neville elimination can be described mathematically using elementary matrices and that it results in a PLU factorization for any n x m matrix. Specifically:
1) Neville elimination produces zeros in a column by adding row multiples, allowing row swaps, resulting in an upper triangular matrix.
2) It can be described using elementary matrices Eij and permutation matrices Pij, yielding a product of bidiagonal and permutation matrices equaling the upper triangular matrix.
3) This realizes a PLU factorization for any n x m matrix, where P is an n x n permutation matrix, L is an n x n lower triangular matrix as a product of bid
The document presents information on matrices, including:
- Definitions of matrices as rectangular arrangements of numbers arranged in rows and columns
- Common matrix operations such as addition, subtraction, scalar multiplication, and matrix multiplication
- Determinants and inverses of matrices
- How matrices can represent systems of linear equations
- Unique properties of matrices, such as the product of two non-zero matrices possibly being zero
- Applications of matrices in fields like geology, statistics, economics, and animation
Beginning direct3d gameprogrammingmath05_matrices_20160515_jintaeksJinTaek Seo
This document provides an overview of linear systems and matrices. It defines key linear algebra concepts such as linear functions, linear maps, homogeneous and non-homogeneous linear systems, and plane equations. It also explains how to represent linear systems using matrices and describes common matrix operations including addition, scalar multiplication, transposition, and matrix multiplication. Finally, it discusses inverses, determinants, and using matrices to represent transformations such as rotations in 2D space.
Linear Algebra may be defined as the form of algebra in which there is a study of different kinds of solutions which are related to linear equations. In order to explain the Linear Algebra, it is important to explain that the title consists of two different terms. The very first term which is important to be considered in the same, is Linear. Linear may be defined as something which is straight. Linear equations can be used for the calculation of the equation in a xy plane where the straight lines has been defined. In addition to this, linear equations can be used to define something which is straight in a three dimensional perspective. Another view of linear equations may be defined as flatness which recognizes the set of points which can be used for giving the description related to the equations which are in a very simple forms. These are the equations which involves the addition and multiplication.
This document outlines topics related to matrices, including:
- Types of matrices such as real, square, row, column, null, sub, diagonal, scalar, unit, upper triangular, lower triangular, and singular matrices
- Characteristic equations, eigenvectors, and eigenvalues of matrices
- Properties of eigenvalues including that the sum of eigenvalues is the trace and the product is the determinant
- Examples of finding the sum and product of eigenvalues without directly calculating them
The document provides definitions and examples of key matrix concepts.
Definitions matrices y determinantes fula 2010 english subirHernanFula
The document discusses the history and properties of matrices. It describes how matrices were first introduced in 1850 and how their use has expanded. It then defines key matrix terms and concepts such as order, elements, types of matrices (e.g. triangular, diagonal), operations (e.g. addition, multiplication, inverse), and properties (e.g. of symmetric, banded and transpose matrices). It also provides examples of calculating the determinant of matrices using Sarrus' rule.
Handling missing data with expectation maximization algorithmLoc Nguyen
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the two sample statistical models which are concerned to hold missing values.
Lecture 5 inverse of matrices - section 2-2 and 2-3njit-ronbrown
The document discusses the inverse of matrices. It defines an inverse matrix A-1 as a matrix such that AA-1=I=A-1A, where I is the identity matrix. It provides methods for determining if a 2x2 matrix is invertible based on its determinant, and for finding the inverse of a matrix using elementary row operations. It also presents the Invertible Matrix Theorem, which provides equivalent conditions for a matrix to be invertible.
The document discusses pseudospectra as an alternative to eigenvalues for analyzing non-normal matrices and operators. It defines three equivalent definitions of pseudospectra: (1) the set of points where the resolvent is larger than ε-1, (2) the set of points that are eigenvalues of a perturbed matrix with perturbation smaller than ε, and (3) the set of points where the resolvent applied to a unit vector is larger than ε. It also shows that pseudospectra are nested sets and their intersection is the spectrum. The definitions extend to operators on Hilbert spaces using singular values.
This document discusses the rank of matrices and how it relates to the solvability of linear systems of equations. It contains the following key points:
1) The rank of a matrix is the number of leading entries in its row-reduced form and determines the number of independent variables in a linear system with that matrix as its coefficient matrix.
2) The rank of the coefficient matrix and augmented matrix determine whether a linear system has no solution, a unique solution, or infinitely many solutions.
3) Homogeneous systems always have at least one solution (the trivial solution of all zeros) and the rank of the coefficient matrix determines if that is the only solution or if there are infinitely many solutions.
Estimation Theory Class (Summary and Revision)Ahmad Gomaa
Summary of important theories and formulas in Estimation theory:
1) Cramer-Rao lower bound (CRLB)
2) Linear Model
3) Best Linear Unbiased Estimate (BLUE)
4) Maximum Likelihood Estimation (MLE)
5) Least Squares Estimation (LSE)
6) Bayesian Estimation and MMSE estimation
K-Notes are concise study materials intended for quick revision near the end of preparation for exams like GATE. Each K-Note covers the concepts from a subject in 40 pages or less. They are useful for final preparation and travel. Students should use K-Notes in the last 2 months before the exam, practicing questions after reviewing each note. The document then provides a summary of key concepts in linear algebra and matrices, including matrix properties, operations, inverses, and systems of linear equations.
Applications of matrices are found in most scientific fields. In every branch of physics, including classical mechanics, optics, electromagnetism, quantum mechanics, and quantum electrodynamics, they are used to study physical phenomena, such as the motion of rigid bodies.
This document summarizes key concepts regarding eigenvalues and eigenvectors of matrices:
- Eigenvalues are scalars such that there exist non-zero eigenvectors satisfying Ax = λx.
- The characteristic equation states that λ is an eigenvalue if and only if it satisfies det(A - λI) = 0.
- A matrix is diagonalizable if it can be written as A = PDP-1, where D is a diagonal matrix of eigenvalues and P is a matrix of corresponding eigenvectors. Diagonalizable matrices can easily compute powers by raising the eigenvalues to powers.
This document discusses three main topics: positive definite matrices, solving linear systems, and the least squares method.
Positive definite matrices are symmetric matrices where all eigenvalues are positive. Solving linear systems involves finding a single solution that satisfies two or more linear equations with the same variables.
The least squares method determines the line of best fit for a data set by minimizing the sum of the squared differences between the independent variable values and the dependent variable values predicted by the line or curve. It provides the closest approximate solution when a linear system has no exact solution.
This document provides an introduction to tensors. It discusses that familiar physics equations like F=ma are only strictly true for isotropic systems, and more general tensor forms are needed for anisotropic systems. It gives the example of a polarizability tensor to relate polarization and electric field for an anisotropic medium. It also discusses preliminaries of tensors, including how coordinate transformations define tensors and the Kronecker delta and Levi-Civita tensors used to represent identities and cross products.
Adversarial Variational Autoencoders to extend and improve generative model -...Loc Nguyen
Generative artificial intelligence (GenAI) has been developing with many incredible achievements like ChatGPT and Bard. Deep generative model (DGM) is a branch of GenAI, which is preeminent in generating raster data such as image and sound due to strong points of deep neural network (DNN) in inference and recognition. The built-in inference mechanism of DNN, which simulates and aims to synaptic plasticity of human neuron network, fosters generation ability of DGM which produces surprised results with support of statistical flexibility. Two popular approaches in DGM are Variational Autoencoders (VAE) and Generative Adversarial Network (GAN). Both VAE and GAN have their own strong points although they share and imply underline theory of statistics as well as incredible complex via hidden layers of DNN when DNN becomes effective encoding/decoding functions without concrete specifications. In this research, I try to unify VAE and GAN into a consistent and consolidated model called Adversarial Variational Autoencoders (AVA) in which VAE and GAN complement each other, for instance, VAE is a good data generator by encoding data via excellent ideology of Kullback-Leibler divergence and GAN is a significantly important method to assess reliability of data which is realistic or fake. In other words, AVA aims to improve accuracy of generative models, besides AVA extends function of simple generative models. In methodology this research focuses on combination of applied mathematical concepts and skillful techniques of computer programming in order to implement and solve complicated problems as simply as possible.
Conditional mixture model and its application for regression modelLoc Nguyen
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating statistical parameter when data sample contains hidden part and observed part. EM is applied to learn finite mixture model in which the whole distribution of observed variable is average sum of partial distributions. Coverage ratio of every partial distribution is specified by the probability of hidden variable. An application of mixture model is soft clustering in which cluster is modeled by hidden variable whereas each data point can be assigned to more than one cluster and degree of such assignment is represented by the probability of hidden variable. However, such probability in traditional mixture model is simplified as a parameter, which can cause loss of valuable information. Therefore, in this research I propose a so-called conditional mixture model (CMM) in which the probability of hidden variable is modeled as a full probabilistic density function (PDF) that owns individual parameter. CMM aims to extend mixture model. I also propose an application of CMM which is called adaptive regression model (ARM). Traditional regression model is effective when data sample is scattered equally. If data points are grouped into clusters, regression model tries to learn a unified regression function which goes through all data points. Obviously, such unified function is not effective to evaluate response variable based on grouped data points. The concept “adaptive” of ARM means that ARM solves the ineffectiveness problem by selecting the best cluster of data points firstly and then evaluating response variable within such best cluster. In orther words, ARM reduces estimation space of regression model so as to gain high accuracy in calculation.
Keywords: expectation maximization (EM) algorithm, finite mixture model, conditional mixture model, regression model, adaptive regression model (ARM).
Nghịch dân chủ luận (tổng quan về dân chủ và thể chế chính trị liên quan đến ...Loc Nguyen
Vũ trụ có vật chất và phản vật chất, xã hội có xung đột và hữu hão để phát triển và suy tàn rồi suy tàn và phát triển. Tôi dựa vào đó để biện minh cho một bài viết có tính chất phản động nghịch chuyển thời cuộc nhưng bạn đọc sẽ tự tìm ra ý nghĩa bất ly của các hình thái xã hội. Ngoài ra bài viết này không đi sâu vào nghiên cứu pháp luật, chỉ đưa ra một cách nhìn tổng quan về dân chủ và thể chế chính trị liên quan đến triết học và tôn giáo, mà theo đó đóng góp của bài viết là khái niệm “nương tạm” của tư pháp không thật sự từ bầu cử và cũng không thật sự từ bổ nhiệm.
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent ItemsetsLoc Nguyen
Collaborative filtering (CF) is a popular technique in recommendation study. Concretely, items which are recommended to user are determined by surveying her/his communities. There are two main CF approaches, which are memory-based and model-based. I propose a new CF model-based algorithm by mining frequent itemsets from rating database. Hence items which belong to frequent itemsets are recommended to user. My CF algorithm gives immediate response because the mining task is performed at offline process-mode. I also propose another so-called Roller algorithm for improving the process of mining frequent itemsets. Roller algorithm is implemented by heuristic assumption “The larger the support of an item is, the higher it’s likely that this item will occur in some frequent itemset”. It models upon doing white-wash task, which rolls a roller on a wall in such a way that is capable of picking frequent itemsets. Moreover I provide enhanced techniques such as bit representation, bit matching and bit mining in order to speed up recommendation process. These techniques take advantages of bitwise operations (AND, NOT) so as to reduce storage space and make algorithms run faster.
Simple image deconvolution based on reverse image convolution and backpropaga...Loc Nguyen
Deconvolution task is not important in convolutional neural network (CNN) because it is not imperative to recover convoluted image when convolutional layer is important to extract features. However, the deconvolution task is useful in some cases of inspecting and reflecting a convolutional filter as well as trying to improve a generated image when information loss is not serious with regard to trade-off of information loss and specific features such as edge detection and sharpening. This research proposes a duplicated and reverse process of recovering a filtered image. Firstly, source layer and target layer are reversed in accordance with traditional image convolution so as to train the convolutional filter. Secondly, the trained filter is reversed again to derive a deconvolutional operator for recovering the filtered image. The reverse process is associated with backpropagation algorithm which is most popular in learning neural network. Experimental results show that the proposed technique in this research is better to learn the filters that focus on discovering pixel differences. Therefore, the main contribution of this research is to inspect convolutional filters from data.
Technological Accessibility: Learning Platform Among Senior High School StudentsLoc Nguyen
The document discusses technological accessibility among senior high school students. It covers several topics:
- Returning to nature through technology and ensuring technology accessibility is a life skill strengthened by advantages while avoiding problems for youth.
- Solutions to the dilemma of technology accessibility include letting youth develop independently, using security controls, and encouraging compassion.
- Researching career is optional but following passion and balancing positive and negative emotions can contribute to success.
The document discusses how engineering and technology can be used for social impact. It makes three key points:
1. Technology stems from knowledge and science, but assessing the value of knowledge is difficult. However, the importance of technology is clear through its fruits. Education fosters gaining knowledge and leads to technological development.
2. While technology increases wealth, it also widens inequality gaps. However, technology can indirectly address inequality through education by providing universal access to learning and bringing universities to people everywhere.
3. Diversity among countries should be embraced, and late innovations building on existing technologies can be particularly impactful, as was the strategy of a racing champion who overtook opponents in the final round through close observation and
Harnessing Technology for Research EducationLoc Nguyen
The document discusses harnessing technology for research and education. It notes some paradoxes of technology, such as how it can both increase inequality but also help solve it through education. It discusses the future of education being supported by distance learning tools, AI, and virtual/augmented reality. Open universities are seen as an intermediate step towards "home universities" and as a place where students can become lifelong learners and teachers. Understanding, emotion, and compassion are discussed as being more important than just remembering facts. The document ends by discussing connecting different subjects and technologies like a jigsaw puzzle.
Future of education with support of technologyLoc Nguyen
The document discusses the future of education with the support of technology. It notes that while technology deepens inequality by benefiting the rich, it can indirectly address inequality through education by allowing anyone to access knowledge and study from anywhere. Emerging forms of educational support include distance learning using tools like Zoom, artificial intelligence assistants, and virtual/augmented reality. Blended teaching combining different methods is emphasized. Understanding is seen as more important than memorization, and education should foster understanding, emotional development, and compassion through nature-based activities. The document argues technology can help education promote love by connecting various topics through both fusion and by assembling them like a jigsaw puzzle.
The document discusses generative artificial intelligence (AI) and its applications, using the metaphor of a dragon's flight. It introduces digital generative AI models that can generate images, sounds, and motions. As an example, it describes a model that could generate possible orbits or movements for a dragon and tiger in an image along with a bamboo background. The document then discusses how generative AI could be applied creatively in education to generate new learning materials and methods. It argues that while technology and societies are changing rapidly, climate change is a unifying challenge that nations must work together to overcome.
Adversarial Variational Autoencoders to extend and improve generative modelLoc Nguyen
Generative artificial intelligence (GenAI) has been developing with many incredible achievements like ChatGPT and Bard. Deep generative model (DGM) is a branch of GenAI, which is preeminent in generating raster data such as image and sound due to strong points of deep neural network (DNN) in inference and recognition. The built-in inference mechanism of DNN, which simulates and aims to synaptic plasticity of human neuron network, fosters generation ability of DGM which produces surprised results with support of statistical flexibility. Two popular approaches in DGM are Variational Autoencoders (VAE) and Generative Adversarial Network (GAN). Both VAE and GAN have their own strong points although they share and imply underline theory of statistics as well as incredible complex via hidden layers of DNN when DNN becomes effective encoding/decoding functions without concrete specifications. In this research, I try to unify VAE and GAN into a consistent and consolidated model called Adversarial Variational Autoencoders (AVA) in which VAE and GAN complement each other, for instance, VAE is good at generator by encoding data via excellent ideology of Kullback-Leibler divergence and GAN is a significantly important method to assess reliability of data which is realistic or fake. In other words, AVA aims to improve accuracy of generative models, besides AVA extends function of simple generative models. In methodology this research focuses on combination of applied mathematical concepts and skillful techniques of computer programming in order to implement and solve complicated problems as simply as possible.
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...Loc Nguyen
Dyadic data which is also called co-occurrence data (COD) contains co-occurrences of objects. Searching for statistical models to represent dyadic data is necessary. Fortunately, finite mixture model is a solid statistical model to learn and make inference on dyadic data because mixture model is built smoothly and reliably by expectation maximization (EM) algorithm which is suitable to inherent spareness of dyadic data. This research summarizes mixture models for dyadic data. When each co-occurrence in dyadic data is associated with a value, there are many unaccomplished values because a lot of co-occurrences are inexistent. In this research, these unaccomplished values are estimated as mean (expectation) of random variable given partial probabilistic distributions inside dyadic mixture model.
Machine learning forks into three main branches such as supervised learning, unsupervised learning, and reinforcement learning where reinforcement learning is much potential to artificial intelligence (AI) applications because it solves real problems by progressive process in which possible solutions are improved and finetuned continuously. The progressive approach, which reflects ability of adaptation, is appropriate to the real world where most events occur and change continuously and unexpectedly. Moreover, data is getting too huge for supervised learning and unsupervised learning to draw valuable knowledge from such huge data at one time. Bayesian optimization (BO) models an optimization problem as a probabilistic form called surrogate model and then directly maximizes an acquisition function created from such surrogate model in order to maximize implicitly and indirectly the target function for finding out solution of the optimization problem. A popular surrogate model is Gaussian process regression model. The process of maximizing acquisition function is based on updating posterior probability of surrogate model repeatedly, which is improved after every iteration. Taking advantages of acquisition function or utility function is also common in decision theory but the semantic meaning behind BO is that BO solves problems by progressive and adaptive approach via updating surrogate model from a small piece of data at each time, according to ideology of reinforcement learning. Undoubtedly, BO is a reinforcement learning algorithm with many potential applications and thus it is surveyed in this research with attention to its mathematical ideas. Moreover, the solution of optimization problem is important to not only applied mathematics but also AI.
Support vector machine is a powerful machine learning method in data classification. Using it for applied researches is easy but comprehending it for further development requires a lot of efforts. This report is a tutorial on support vector machine with full of mathematical proofs and example, which help researchers to understand it by the fastest way from theory to practice. The report focuses on theory of optimization which is the base of support vector machine.
A Proposal of Two-step Autoregressive ModelLoc Nguyen
Autoregressive (AR) model and conditional autoregressive (CAR) model are specific regressive models in which independent variables and dependent variable imply the same object. They are powerful statistical tools to predict values based on correlation of time domain and space domain, which are useful in epidemiology analysis. In this research, I combine them by the simple way in which AR and CAR is estimated in two separate steps so as to cover time domain and space domain in spatial-temporal data analysis. Moreover, I integrate logistic model into CAR model, which aims to improve competence of autoregressive models.
Extreme bound analysis based on correlation coefficient for optimal regressio...Loc Nguyen
Regression analysis is an important tool in statistical analysis, in which there is a demand of discovering essential independent variables among many other ones, especially in case that there is a huge number of random variables. Extreme bound analysis is a powerful approach to extract such important variables called robust regressors. In this research, a so-called Regressive Expectation Maximization with RObust regressors (REMRO) algorithm is proposed as an alternative method beside other probabilistic methods for analyzing robust variables. By the different ideology from other probabilistic methods, REMRO searches for robust regressors forming optimal regression model and sorts them according to descending ordering given their fitness values determined by two proposed concepts of local correlation and global correlation. Local correlation represents sufficient explanatories to possible regressive models and global correlation reflects independence level and stand-alone capacity of regressors. Moreover, REMRO can resist incomplete data because it applies Regressive Expectation Maximization (REM) algorithm into filling missing values by estimated values based on ideology of expectation maximization (EM) algorithm. From experimental results, REMRO is more accurate for modeling numeric regressors than traditional probabilistic methods like Sala-I-Martin method but REMRO cannot be applied in case of nonnumeric regression model yet in this research.
The document proposes a "jagged stock investment" strategy where stocks are repeatedly purchased at intervals where the price is increasing, similar to rises and falls in a saw blade. It shows mathematically that this strategy can achieve equal or higher returns than bank depositing if the average growth rate and interest rate are the same. The strategy models purchasing a share and its replications over time periods where each replication is bought using the profits from the previous rise. It provides formulas to calculate the overall value and return on investment from this jagged stock investment approach.
This document presents a study on minima distribution and its application to explaining the convergence and convergence speed of optimization algorithms. The author proposes weaker conditions for the convergence and monotonicity properties of minima distribution compared to previous work. Specifically, the convergence condition requires the function τ(x) to be positive and non-minimizing everywhere except at minimizers of the target function f(x). The monotonicity condition requires f(x) and the derivative of τ(x) to be inversely proportional. The author then defines convergence speed as the derivative of the integral of f(x) with respect to the optimization iterations and shows it depends on the slope of the derivative of τ(x).
This is chapter 4 “Variants of EM algorithm” in my book “Tutorial on EM algorithm”, which focuses on EM variants. The main purpose of expectation maximization (EM) algorithm, also GEM algorithm, is to maximize the log-likelihood L(Θ) = log(g(Y|Θ)) with observed data Y by maximizing the conditional expectation Q(Θ’|Θ). Such Q(Θ’|Θ) is defined fixedly in E-step. Therefore, most variants of EM algorithm focus on how to maximize Q(Θ’|Θ) in M-step more effectively so that EM is faster or more accurate.
This is the chapter 3 "Properties and convergence of EM algorithm" in my book “Tutorial on EM algorithm”, which focuses on mathematical explanation of the convergence of GEM algorithm given by DLR (Dempster, Laird, & Rubin, 1977, pp. 6-9).
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The debris of the ‘last major merger’ is dynamically young
Tutorial on EM algorithm – Part 1
1. Tutorial on EM algorithm – Part 1
Prof. Dr. Loc Nguyen, PhD, PostDoc
Loc Nguyen’s Academic Network, Vietnam
Email: ng_phloc@yahoo.com
Homepage: www.locnguyen.net
EM Tutorial P1 - Loc Nguyen
27/04/2022 1
2. Abstract
Maximum likelihood estimation (MLE) is a popular method for parameter estimation in both
applied probability and statistics but MLE cannot solve the problem of incomplete data or
hidden data because it is impossible to maximize likelihood function from hidden data.
Expectation maximum (EM) algorithm is a powerful mathematical tool for solving this problem
if there is a relationship between hidden data and observed data. Such hinting relationship is
specified by a mapping from hidden data to observed data or by a joint probability between
hidden data and observed data. In other words, the relationship helps us know hidden data by
surveying observed data. The essential ideology of EM is to maximize the expectation of
likelihood function over observed data based on the hinting relationship instead of maximizing
directly the likelihood function of hidden data. Pioneers in EM algorithm proved its
convergence. As a result, EM algorithm produces parameter estimators as well as MLE does.
This tutorial aims to provide explanations of EM algorithm in order to help researchers
comprehend it. Moreover some improvements of EM algorithm are also proposed in the tutorial
such as combination of EM and third-order convergence Newton-Raphson process, combination
of EM and gradient descent method, and combination of EM and particle swarm optimization
(PSO) algorithm.
2
EM Tutorial P1 - Loc Nguyen
27/04/2022
3. Table of contents
This report focuses on introduction of EM and related concepts.
1. Basic concepts
2. Parameter estimation
3. EM algorithm
3
EM Tutorial P1 - Loc Nguyen
27/04/2022
4. 1. Basic concepts
Literature of expectation maximization (EM) algorithm in this tutorial is mainly extracted from the preeminent article
“Maximum Likelihood from Incomplete Data via the EM Algorithm” by Arthur P. Dempster, Nan M. Laird, and
Donald B. Rubin (Dempster, Laird, & Rubin, 1977). For convenience, let DLR be reference to such three authors. We
begin a review of EM algorithm with some basic concepts. By default, vectors are column vectors although a vector
can be column vector or row vector. For example, given two vectors X and Y and two matrices A and B:
𝑋 =
𝑥1
𝑥2
⋮
𝑥𝑟
𝑌 =
𝑦1
𝑦2
⋮
𝑦𝑟
𝐴 =
𝑎11 𝑎12 ⋯ 𝑎1𝑛
𝑎21 𝑎22 ⋯ 𝑎2𝑛
⋮ ⋮ ⋱ ⋮
𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛
𝐵 =
𝑏11 𝑏12 ⋯ 𝑏1𝑘
𝑏21 𝑏22 ⋯ 𝑏2𝑘
⋮ ⋮ ⋱ ⋮
𝑏𝑛1 𝑏𝑛2 ⋯ 𝑎𝑛𝑘
X and Y above are column vectors. A row vector is represented as follows: 𝑍 = 𝑧1, 𝑧2, … , 𝑧𝑟 . Zero vector is denoted
as 𝟎 =
0
0
⋮
0
. Zero matrix is denoted as 𝟎 = 0 𝑚x𝑛 =
0 0 ⋯ 0
0 0 ⋯ 0
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 0
27/04/2022 EM Tutorial P1 - Loc Nguyen 4
5. 1. Basic concepts
Matrix Λ is diagonal if it is square and its elements outside the main diagonal are zero, Λ =
𝜆1 0 ⋯ 0
0 𝜆2 ⋯ 0
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 𝜆𝑟
. Let I be identity matrix or unit matrix, 𝐼 =
1 0 ⋯ 0
0 1 ⋯ 0
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 1
. Vector addition and matrix
addition are defined like numerical addition:
𝑋 ± 𝑌 =
𝑥1 ± 𝑦1
𝑥2 ± 𝑦2
⋮
𝑥𝑟 ± 𝑦𝑟
, 𝐴 ± 𝐵 =
𝑎11 ± 𝑏11 𝑎12 ± 𝑏12 ⋯ 𝑎1𝑛 ± 𝑏1𝑛
𝑎21 ± 𝑏21 𝑎22 ± 𝑏22 ⋯ 𝑎2𝑛 ± 𝑏2𝑛
⋮ ⋮ ⋱ ⋮
𝑎𝑚1 ± 𝑏𝑚1 𝑎𝑚2 ± 𝑏𝑚2 ⋯ 𝑎𝑚𝑛 ± 𝑏𝑚𝑛
Vector and matrix can be multiplied with a scalar 𝑘𝑋 =
𝑘𝑥1
𝑘𝑥2
⋮
𝑘𝑥𝑟
, 𝑘𝐴 =
𝑘𝑎11 𝑘𝑎12 ⋯ 𝑘𝑎1𝑛
𝑘𝑎21 𝑘𝑎22 ⋯ 𝑘𝑎2𝑛
⋮ ⋮ ⋱ ⋮
𝑘𝑎𝑚1 𝑘𝑎𝑚2 ⋯ 𝑘𝑎𝑚𝑛
27/04/2022 EM Tutorial P1 - Loc Nguyen 5
6. 1. Basic concepts
Let superscript “T” denote transposition operator for vector and matrix, as follows
𝑋𝑇
= 𝑥1, 𝑥2, … , 𝑥𝑟
𝐴𝑇
=
𝑎11 𝑎21 ⋯ 𝑎𝑟1
𝑎12 𝑎22 ⋯ 𝑎𝑟2
⋮ ⋮ ⋱ ⋮
𝑎1𝑝 𝑎2𝑝 ⋯ 𝑎𝑟𝑝
A square matrix A is symmetric if and only AT = A. Transposition operator is linear with addition operator as follows
𝑋 + 𝑌 𝑇
= 𝑋𝑇
+ 𝑌𝑇
𝐴 + 𝐵 𝑇
= 𝐴𝑇
+ 𝐵𝑇
Dot product or scalar product of two vectors can be written with transposition operator, as follows: 𝑋𝑇
𝑌 = 𝑖=1
𝑟
𝑥𝑖𝑦𝑖.
However, the product XYT results out a symmetric matrix as follows:
𝑋𝑌𝑇
= 𝑌𝑋𝑇
=
𝑥1𝑦1 𝑥1𝑦2 ⋯ 𝑥1𝑦𝑟
𝑥2𝑦1 𝑥2𝑦2 ⋯ 𝑥2𝑦𝑟
⋮ ⋮ ⋱ ⋮
𝑥𝑟𝑦1 𝑥𝑟𝑦2 ⋯ 𝑥𝑟𝑦𝑟
27/04/2022 EM Tutorial P1 - Loc Nguyen 6
7. 1. Basic concepts
The length of module of vector X in Euclidean space is:
𝑋 = 𝑋𝑇𝑋 =
𝑖=1
𝑟
𝑥𝑖
2
The notation |.| also denotes absolute value of scalar and determinant of square matrix. Note,
determinant is only defined for square matrix. Let A and B be two square nxn matrices, we have:
|cA| = cn|A| where c is scalar
|AT| = |A|
|AB| = |A||B|
If A has nonzero determinant (≠0), there exists its inverse denoted A–1 such that:
𝐴𝐴−1
= 𝐴−1
𝐴 = 𝐼
Where I is identity matrix. If matrix A has its inverse, A is called invertible or non-singular. Let A and
B be two invertible matrices, we have:
(AB)–1 = B–1A–1
|A–1| = |A|–1 = 1 / |A|
(AT)–1 = (A–1)T
27/04/2022 EM Tutorial P1 - Loc Nguyen 7
8. 1. Basic concepts
Given invertible matrix A, it is called orthogonal matrix if A–1 = AT, which means AA–1 = A–1A = AAT = ATA = I. Of course,
orthogonal matrix is symmetric. Product (multiplication operator) of two matrices Amxn and Bnxk is:
𝐴𝐵 = 𝐶 =
𝑐11 𝑐12 ⋯ 𝑐1𝑘
𝑐21 𝑐22 ⋯ 𝑐2𝑘
⋮ ⋮ ⋱ ⋮
𝑐𝑚1 𝑐𝑚2 ⋯ 𝑐𝑚𝑘
𝑐𝑖𝑗 =
𝑣=1
𝑛
𝑎𝑖𝑣𝑏𝑣𝑗
Square matrix A is symmetric if aij = aji for all i and j. If A is symmetric then, AT = A. If both A and B are symmetric with the
same number of rows and columns then, they are commutative such that AB = BA with note that the product AB and BA
produces a symmetric matrix. Given invertible matrix A, if it is symmetric, its inverse A–1 is symmetric too. Given N
matrices Ai such that their product (multiplication operator) is valid, we have:
𝑖=1
𝑁
𝐴𝑖
𝑇
= 𝐴1𝐴2 … 𝐴𝑁
𝑇 =
𝑖=𝑁
1
𝐴𝑖
𝑇
= 𝐴𝑁
𝑇
𝐴𝑁−1
𝑇
… 𝐴1
𝑇
Given square matrix A, tr(A) is trace operator which takes sum of its diagonal elements.
tr 𝐴 =
𝑖
𝑎𝑖𝑖
27/04/2022 EM Tutorial P1 - Loc Nguyen 8
9. 1. Basic concepts
Given invertible matrix A (n rows and n columns), Jordan decomposition theorem (Hardle & Simar, 2013, p.
63) stated that A is always decomposed as follows:
𝐴 = 𝑈Λ𝑈−1 = 𝑈−1Λ𝑈 = 𝑈Λ𝑈𝑇 = 𝑈𝑇Λ𝑈
Where U is orthogonal matrix composed of eigenvectors. Hence, U is called eigenvector matrix.
𝑈 =
𝑢11 𝑢21
𝑢12 𝑢22
… 𝑢𝑛1
… 𝑢𝑛2
⋮ ⋮
𝑢1𝑛 𝑢2𝑛
⋮ ⋮
… 𝑢𝑛𝑛
There are n column eigenvectors ui = (u11, u12,…, u1n) in U and they are mutually orthogonal, ui
Tuj = 0.
Where Λ is diagonal matrix composed of eigenvalues. Hence, Λ is called eigenvalue matrix.
Λ =
𝜆1 0 ⋯ 0
0 𝜆2 ⋯ 0
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 𝜆𝑟
Where λi are eigenvalues. When invertible matrix A is decomposed according to Jordan decomposition, we
call A is diagonalized. If A can be diagonalized, it is called diagonalizable matrix. Of course, if A is
invertible, A is diagonalizable. There are many documents for matrix diagonalization.
27/04/2022 EM Tutorial P1 - Loc Nguyen 9
10. 1. Basic concepts
Given two diagonalizable matrices A and B are equal size (nxn) then, they are simultaneously
diagonalizable (Wikipedia, Commuting matrices, 2017) and hence, there exists an orthogonal
eigenvector matrix U such that (Wikipedia, Diagonalizable matrix, 2017) (StackExchange,
2013):
𝐴 = 𝑈Γ𝑈−1
= 𝑈Γ𝑈𝑇
𝐵 = 𝑈Λ𝑈−1
= 𝑈Λ𝑈𝑇
Where Γ and Λ are eigenvalue matrices of A and B, respectively. Given symmetric matrix A, it is
positive (negative) definite if and only if XTAX > 0 (XTAX < 0) for all vector X≠0T. It is positive
(negative) semi-definite if and only if XTAX ≥ 0 (XTAX ≤ 0) for all vector X. When
diagonalizable A is diagonalized into UΛUT, it is positive (negative) definite if and only if all
eigenvalues in Λ are positive (negative). Similarly, it is positive (negative) semi-definite if and
only if all eigenvalues in Λ are non-negative (non-positive). If A is degraded as a scalar,
concepts “positive definite”, “positive semi-definite”, “negative definite”, and “negative semi-
definite” becomes concepts “positive”, “non-negative”, “negative”, and “non-positive”,
respectively.
27/04/2022 EM Tutorial P1 - Loc Nguyen 10
11. 1. Basic concepts
Suppose f(X) is scalar-by-vector function, for instance, f: Rn → R where Rn is n-dimensional real vector space. The first-
order derivative of f(X) is gradient vector as follows:
𝑓′
𝑋 = ∇𝑓 𝑋 =
d𝑓 𝑋
d𝑋
= 𝐷𝑓 𝑋 =
𝜕𝑓 𝑋
𝜕𝑥1
,
𝜕𝑓 𝑋
𝜕𝑥2
, … ,
𝜕𝑓 𝑋
𝜕𝑥𝑛
Where
𝜕𝑓 𝑋
𝜕𝑥𝑖
is partial first-order derivative of f with regard to xi. So gradient is row vector. The second-order derivative of
f(X) is called Hessian matrix as follows:
𝑓′′
𝑋 =
d2
𝑓 𝑋
d𝑋2 = 𝐷2
𝑓 𝑋 =
𝜕2
𝑓 𝑋
𝜕𝑥1
2
𝜕2
𝑓 𝑋
𝜕𝑥1𝜕𝑥2
⋯
𝜕2
𝑓 𝑋
𝜕𝑥1𝜕𝑥𝑛
𝜕2
𝑓 𝑋
𝜕𝑥2𝜕𝑥1
𝜕2
𝑓 𝑋
𝜕𝑥2
2 ⋯
𝜕2
𝑓 𝑋
𝜕𝑥2𝜕𝑥𝑛
⋮ ⋮ ⋱ ⋮
𝜕2
𝑓 𝑋
𝜕𝑥𝑛𝜕𝑥1
𝜕2
𝑓 𝑋
𝜕𝑥𝑛𝜕𝑥2
⋯
𝜕2
𝑓 𝑋
𝜕𝑥𝑛
2
Where,
𝜕2𝑓 𝑋
𝜕𝑥𝑖𝜕𝑥𝑗
=
𝜕
𝜕𝑥𝑖
𝜕𝑓 𝑋
𝜕𝑥𝑗
𝜕2𝑓 𝑋
𝜕𝑥𝑖
2 =
𝜕2𝑓 𝑋
𝜕𝑥𝑖𝜕𝑥𝑖
. Second-order partial derivatives of xi (s) are on diagonal of the Hessian matrix.
27/04/2022 EM Tutorial P1 - Loc Nguyen 11
12. 1. Basic concepts
In general, vector calculus is a complex subject. Here we focus on scalar-by-vector function with some
properties. Let c, A, B, and M be scalar constant, vector constant, vector constant, and matrix constant,
respectively, suppose vector and matrix operators are valid we have:
d𝑐
d𝑋
= 𝟎𝑇
,
d𝐴𝑇
𝑋
d𝑋
=
d𝑋𝑇
𝑎
d𝑋
= 𝐴𝑇
,
d𝑋𝑇
𝑋
d𝑋
= 2𝑋𝑇
d𝐴𝑇
𝑋𝑋𝑇
𝐵
d𝑋
= 𝑋𝑇 𝐴𝐵𝑇 + 𝐵𝐴𝑇 ,
d𝐴𝑇
𝑀𝑋
d𝑋
=
d𝑋𝑇
𝑀𝑇
𝐴
d𝑋
= 𝐴𝑇𝑀
If M is a square matrix constant, we have:
d𝑋𝑇𝑀𝑋
d𝑋
= 𝑥𝑇
𝑀 + 𝑀𝑇
,
d2𝑋𝑇𝑀𝑋
d𝑋2 = 𝑀 + 𝑀𝑇
Function f(X) is called nth-order analytic function or nth-order smooth function if there is existence and
continuity of kth-order derivatives of f(X) where k = 1, 2,…, K. Function f(X) is called smooth enough
function if K is large enough. According to Schwarz’s theorem (Wikipedia, Symmetry of second
derivatives, 2018), if f(X) is second-order smooth function then, its Hessian matrix is symmetric.
𝜕2𝑓 𝑋
𝜕𝑥𝑖𝜕𝑥𝑗
=
𝜕2𝑓 𝑋
𝜕𝑥𝑗𝜕𝑥𝑖
27/04/2022 EM Tutorial P1 - Loc Nguyen 12
13. 1. Basic concepts
Given f(X) being second-order smooth function, f(X) is convex (strictly convex) in domain X if and only if its Hessian
matrix is semi-positive definite (positive definite) in X. Similarly, f(X) is concave (strictly concave) in domain X if and
only if its Hessian matrix is semi-negative definite (negative definite) in X. Extreme point, optimized point, optimal
point, or optimizer X* is minimum point (minimizer) of convex function and is maximum point (maximizer) of concave
function.
𝑋∗
= argmin
𝑋∈𝑿
𝑓 𝑋 if 𝑓 convex in 𝑿
𝑋∗
= argmax
𝑋∈𝑿
𝑓 𝑋 if 𝑓 concave in 𝑿
Given second-order smooth function f(X), function f(X) has stationary point X* if its gradient vector at X* is zero, Df(X*)
= 0T. The stationary point X* is local minimum point if Hessian matrix at X* that is D2f(X*) is positive definite.
Otherwise, the stationary point X* is local maximum point if Hessian matrix at X* that is D2f(X*) is negative definite. If a
stationary point X* is not extreme point, it is saddle point in which Df(X*) = 0T and D2f(X*) = (0). Finding extreme point
is optimization problem. Therefore, if f(X) is second-order smooth function and its gradient vector Df(X) and Hessian
matrix D2f(X) and are both determined, the optimization problem is processed by solving the equation created from
setting the gradient Df(X) to be zero (Df(X)=0T) and then checking if the Hessian matrix Df(X*) is positive definite or
negative definite where X* is solution of equation Df(X)=0T. If such equation cannot be solved due to its complexity,
there are some popular methods to solve optimization problem such as Newton-Raphson (Burden & Faires, 2011, pp.
67-71) and gradient descent (Ta, 2014).
27/04/2022 EM Tutorial P1 - Loc Nguyen 13
14. 1. Basic concepts
A short description of Newton-Raphson method is necessary because it is helpful to solve the
equation Df(X)=0T for optimization problem in practice. Suppose f(X) is second-order smooth
function, according to first-order Taylor series expansion of Df(X) at X=X0 with very small
residual, we have:
𝐷𝑓 𝑋 ≈ 𝐷𝑓 𝑋0 + 𝑋 − 𝑋0
𝑇 𝐷2𝑓 𝑋0
𝑇
Because f(X) is second-order smooth function, D2f(X0) is symmetric matrix according to
Schwarz’s theorem (Wikipedia, Symmetry of second derivatives, 2018), which implies:
D2f(X0) = (D2f(X0))T
So, we have:
𝐷𝑓 𝑋 ≈ 𝐷𝑓 𝑋0 + 𝑋 − 𝑋0
𝑇𝐷2𝑓 𝑋0
We expect that Df(X) = 0T so that X is a solution.
𝟎𝑇
= 𝐷𝑓 𝑋 ≈ 𝐷𝑓 𝑋0 + 𝑋 − 𝑋0
𝑇
𝐷2
𝑓 𝑋0
This means:
𝑋 ≈ 𝑋0 − 𝐷2
𝑓 𝑋0
−1
𝐷𝑓 𝑋0
𝑇
27/04/2022 EM Tutorial P1 - Loc Nguyen 14
15. 1. Basic concepts
Therefore, Newton-Raphson method starts with an arbitrary value of
X0 as a solution candidate and then goes through some iterations.
Suppose at kth iteration, the current value is Xk and the next value Xk+1
is calculated based on following equation:
𝑋𝑘+1 ≈ 𝑋𝑘 − 𝐷2
𝑓 𝑋𝑘
−1
𝐷𝑓 𝑋𝑘
𝑇
The value Xk is solution of Df(X)=0T if Df(Xk)=0T which means that
Xk+1=Xk after some iterations. At that time Xk+1 = Xk = X* is the local
optimized point (local extreme point). So, the terminated condition of
Newton-Raphson method is Df(Xk)=0T. Note, the X* resulted from
Newton-Raphson method is local minimum point (local maximum
point) if f(X) is convex function (concave function) in current domain.
27/04/2022 EM Tutorial P1 - Loc Nguyen 15
16. 1. Basic concepts
Newton-Raphson method computes second-order derivative D2f(X) but gradient descent method (Ta,
2014) does not. This difference is not significant but a short description of gradient descent method is
necessary because it is also an important method to solve the optimization problem in case that
solving directly the equation Df(X)=0T is too complicated. Gradient descent method is also iterative
method starting with an arbitrary value of X0 as a solution candidate. Suppose at kth iteration, the next
candidate point Xk+1 is computed based on the current Xk as follows (Ta, 2014):
𝑋𝑘+1 = 𝑋𝑘 + 𝑡𝑘𝒅𝑘
The direction dk is called descending direction, which is the opposite of gradient of f(X). Hence, we
have dk = –Df(Xk). The value tk is the length of the descending direction dk. The value tk is often
selected a minimizer (maximizer) of function g(t) = f(Xk + tdk) for minimization (maximization)
where Xk and dk are known at kth iteration. Alternately, tk is selected by some advanced condition such
as Barzilai–Borwein condition (Wikipedia, Gradient descent, 2018). After some iterations, point Xk
converges to the local optimizer X* when dk = 0T. At that time is we have Xk+1 = Xk = X*. So, the
terminated condition of Newton-Raphson method is dk=0T. Note, the X* resulted from gradient
descent method is local minimum point (local maximum point) if f(X) is convex function (concave
function) in current domain.
27/04/2022 EM Tutorial P1 - Loc Nguyen 16
17. 1. Basic concepts
In the case that the optimization problem has some constraints, Lagrange duality (Jia, 2013) is applied to solve this
problem. Given first-order smooth function f(X) and constraints gi(X) ≤ 0 and hj(X) = 0, the optimization problem is:
Optimize 𝑓 𝑋
𝑔𝑖 𝑋 ≤ 0 for 𝑖 = 1, 𝑚
ℎ𝑗 𝑋 = 0 for 𝑗 = 1, 𝑛
A so-called Lagrange function la(X, λ, μ) is established as sum of f(X) and constraints multiplied by Lagrange
multipliers λ and μ. In case of minimization problem, la(X, λ, μ) is
𝑙𝑎 𝑋, 𝜆, 𝜇 = 𝑓 𝑋 +
𝑖=1
𝑚
𝜆𝑖𝑔 𝑋 +
𝑗=1
𝑛
𝜇𝑗ℎ 𝑋
In case of maximization problem, la(X, λ, μ) is
𝑙𝑎 𝑋, 𝜆, 𝜇 = 𝑓 𝑋 −
𝑖=1
𝑚
𝜆𝑖𝑔 𝑋 −
𝑗=1
𝑛
𝜇𝑗ℎ 𝑋
Where all λi ≥ 0. Note, λ = (λ1, λ2,…, λm)T and μ = (μ1, μ2,…, μm)T are called Lagrange multipliers and la(X, λ, μ) is
function of X, λ, and μ. Thus, optimizing f(X) with subject to constraints gi(X) ≤ 0 and hj(X) = 0 is equivalent to
optimize la(X, λ, μ), which is the reason that this method is called Lagrange duality.
27/04/2022 EM Tutorial P1 - Loc Nguyen 17
18. 1. Basic concepts
In case of minimization problem, the gradient of la(X, λ, μ) w.r.t. X is 𝐷𝑙𝑎 𝑋, 𝜆, 𝜇 = 𝐷𝑓 𝑋 + 𝑖=1
𝑚
𝜆𝑖𝐷𝑔 𝑋 + 𝑗=1
𝑛
𝜇𝑗𝐷ℎ 𝑋 . In
case of maximization problem, the gradient of la(X, λ, μ) w.r.t. X is 𝐷𝑙𝑎 𝑋, 𝜆, 𝜇 = 𝐷𝑓 𝑋 − 𝑖=1
𝑚
𝜆𝑖𝐷𝑔 𝑋 − 𝑗=1
𝑛
𝜇𝑗𝐷ℎ 𝑋 .
According to KKT condition, a local optimized point X* is solution of the following equation system:
𝐷𝑙𝑎 𝑋, 𝜆, 𝜇 = 𝟎𝑇
𝑔𝑖 𝑋 ≤ 0 for 𝑖 = 1, 𝑚
ℎ𝑗 𝑋 = 0 for 𝑗 = 1, 𝑛
𝜆𝑖 ≥ 0 for 𝑖 = 1, 𝑚
𝑖=1
𝑚
𝜆𝑖𝑔 𝑋 = 0
The last equation in the KKT system above is called complementary slackness. The main task of KKT problem is to solve the first
equation Dla(X, λ, μ) = 0T. Again some practical methods such as Newton-Raphson method can be used to solve the equation
Dla(X, λ, μ) = 0T. Alternately, gradient descent method can be used to optimize la(X, λ, μ) with constraints specified in KKT system
𝑔𝑖 𝑋 ≤ 0 for 𝑖 = 1, 𝑚
ℎ𝑗 𝑋 = 0 for 𝑗 = 1, 𝑛
𝜆𝑖 ≥ 0 for 𝑖 = 1, 𝑚
𝑖=1
𝑚
𝜆𝑖𝑔 𝑋 = 0
27/04/2022 EM Tutorial P1 - Loc Nguyen 18
19. 1. Basic concepts
We need to skim some essential probabilistic rules such as additional rule, multiplication
rule, total probability rule, and Bayes’ rule. Given two random events (or random
variables) X and Y, additional rule (Montgomery & Runger, 2003, p. 33) and
multiplication rule (Montgomery & Runger, 2003, p. 44) are expressed as follows:
𝑃 𝑋 ∪ 𝑌 = 𝑃 𝑋 + 𝑃 𝑌 − 𝑃 𝑋 ∩ 𝑌
𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋
Where notations ∪ and ∩ denote union operator and intersection operator in set theory
(Wikipedia, Set (mathematics), 2014), (Rosen, 2012, pp. 1-12). The probability P(X, Y) is
known as joint probability. The probability P(X|Y) is called conditional probability of X
given Y:
𝑃 𝑋 𝑌 =
𝑃 𝑋, 𝑌
𝑃 𝑌
=
𝑃 𝑋 ∩ 𝑌
𝑃 𝑌
=
𝑃 𝑌 𝑋 𝑃 𝑋
𝑃 𝑌
Conditional probability is base of Bayes’ rule mentioned later. If X and Y are mutually
exclusive (𝑋 ∩ 𝑌 = ∅) then, 𝑋 ∪ 𝑌 is often denoted as X+Y and we have:
𝑃 𝑋 + 𝑌 = 𝑃 𝑋 + 𝑃 𝑌
27/04/2022 EM Tutorial P1 - Loc Nguyen 19
20. 1. Basic concepts
X and Y are mutually independent if and only if one of three following conditions is satisfied:
𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑋 𝑃 𝑌
𝑃 𝑋 𝑌 = 𝑃 𝑋
𝑃 𝑌 𝑋 = 𝑃 𝑌
When X and Y are mutually independent, 𝑋 ∩ 𝑌 are often denoted as XY and we have:
𝑃 𝑋𝑌 = 𝑃 𝑋, 𝑌 = 𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑋 𝑃 𝑌
Given a complete set of mutually exclusive events X1, X2,…, Xn such that
𝑋1 ∪ 𝑋2 ∪ ⋯ ∪ 𝑋𝑛 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 = Ω where Ω is probability space
𝑋𝑖 ∩ 𝑋𝑗 = ∅, ∀𝑖, 𝑗
The total probability rule (Montgomery & Runger, 2003, p. 44) is specified as follows:
𝑃 𝑌 = 𝑃 𝑌 𝑋1 𝑃 𝑋1 + 𝑃 𝑌 𝑋2 𝑃 𝑋2 + ⋯ + 𝑃 𝑌 𝑋𝑛 𝑃 𝑋𝑛 =
𝑖=1
𝑛
𝑃 𝑌 𝑋𝑖 𝑃 𝑋𝑖
Where 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 = Ω and 𝑋𝑖 ∩ 𝑋𝑗 = ∅, ∀𝑖, 𝑗
If X and Y are continuous variables, the total probability rule is re-written in integral form as follows:
𝑃 𝑌 =
𝑋
𝑃 𝑌 𝑋 𝑃 𝑋 d𝑋
Note, P(Y|X) and P(X) are continuous functions known as probability density functions mentioned later.
27/04/2022 EM Tutorial P1 - Loc Nguyen 20
21. 1. Basic concepts
A variable X is called random variable if it conforms a probabilistic distribution which is specified by a probability
density function (PDF) or a cumulative distribution function (CDF) (Montgomery & Runger, 2003, p. 64)
(Montgomery & Runger, 2003, p. 102) but CDF and PDF have the same meaning and they share interchangeable
property when PDF is derivative of CDF. In practical statistics, PDF is used more common than CDF is used. When X
is discrete, PDF is degraded as probability of X. Let F(X) and f(X) be CDF and PDF, respectively, equation 1.1 is
definition of CDF and PDF.
Continuous case:
𝐹 𝑋0 = 𝑃 𝑋 ≤ 𝑋0 =
−∞
𝑋0
𝑓 𝑋 𝑑𝑋
−∞
+∞
𝑓 𝑋 d𝑋 = 1
Discrete case:
𝐹 𝑋0 = 𝑃 𝑋 ≤ 𝑋0 =
𝑋≤𝑋0
𝑃 𝑋
𝑓 𝑋 = 𝑃 𝑋 and
𝑋
𝑃 𝑋 = 1
(1.1)
In discrete case, probability at a single point X0 is determined as P(X0) = f(X0) but in continuous case, probability is
determined in an interval [a, b], (a, b), [a, b), or (a, b] where a, b, and X are real as integral of the PDF in such interval:
𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 =
𝑎
𝑏
𝑓 𝑋 𝑑𝑋
27/04/2022 EM Tutorial P1 - Loc Nguyen 21
22. 1. Basic concepts
It is easy to expend equation 1.1 for multivariate variable when X is vector. Let X = (x1, x2,…, xn)T be n-
dimension random vector, its CDF and PDF are re-defined as follows:
Continuous case (equation 1.2):
𝐹 𝑋0 = 𝑃 𝑋 ≤ 𝑋0 = 𝑃 𝑥1 ≤ 𝑥01, 𝑥2 ≤ 𝑥02, … , 𝑥𝑛 ≤ 𝑥0𝑛 =
−∞
𝑋0
𝑓 𝑋 𝑑𝑋 =
−∞
𝑋0
−∞
𝑋0
…
−∞
𝑋0
𝑓 𝑋 d𝑥1d𝑥2 … d𝑥𝑛
−∞
+∞
𝑓 𝑋 𝑑𝑋 =
−∞
+∞
−∞
+∞
…
−∞
+∞
𝑓 𝑋 d𝑥1d𝑥2 … d𝑥𝑛 = 1
Discrete case (equation 1.2):
𝐹 𝑋0 = 𝑃 𝑋 ≤ 𝑋0 = 𝑃 𝑥1 ≤ 𝑥01, 𝑥2 ≤ 𝑥02, … , 𝑥𝑛 ≤ 𝑥0𝑛 =
𝑋≤𝑋0
𝑃 𝑋 =
𝑥1≤𝑥01 𝑥2≤𝑥02
…
𝑥𝑛≤𝑥0𝑛
𝑃 𝑋
𝑓 𝑋 = 𝑃 𝑋
𝑋
𝑃 𝑋 =
𝑥1≤𝑥01 𝑥2≤𝑥02
…
𝑥𝑛≤𝑥0𝑛
𝑃 𝑋 = 1
27/04/2022 EM Tutorial P1 - Loc Nguyen 22
23. 1. Basic concepts
Marginal PDF of partial variable xi where xi is a component of X is the integral of f(X) over all xj
except xi.
𝑓𝑥𝑖
𝑥𝑖 =
−∞
+∞
−∞
+∞
…
−∞
+∞
𝑓 𝑋 d𝑥1 … d𝑥𝑖−1d𝑥𝑖+1 … d𝑥𝑛
Where −∞
+∞
𝑓𝑥𝑖
𝑥𝑖 d𝑥𝑖 = 1. Joint PDF of xi and xj is defined as the integral of f(X) over all xk except
xi and xj.
𝑓𝑥𝑖𝑥𝑗
𝑥𝑖, 𝑥𝑗 =
−∞
+∞
−∞
+∞
…
−∞
+∞
𝑓 𝑋 d𝑥1 … d𝑥𝑖−1d𝑥𝑖+1 … d𝑥𝑗−1d𝑥𝑗+1 … d𝑥𝑛
Where, −∞
+∞
−∞
+∞
𝑓𝑥𝑖𝑥𝑗
𝑥𝑖, 𝑥𝑗 d𝑥𝑖d𝑥𝑗 = 1. Conditional PDF of xi given xj is defined as follows:
𝑓𝑥𝑖 𝑥𝑗
𝑥𝑖 =
𝑓𝑥𝑖𝑥𝑗
𝑥𝑖, 𝑥𝑗
𝑓𝑥𝑗
𝑥𝑗
Indeed, conditional PDF implies conditional probability.
27/04/2022 EM Tutorial P1 - Loc Nguyen 23
24. 1. Basic concepts
Given random variable X and its PDF f(X), theoretical expectation E(X) and theoretical variance V(X)
of X are:
𝐸 𝑋 =
𝑋
𝑋𝑓 𝑋 d𝑋 (1.3)
𝑉 𝑋 = 𝐸 𝑋 − 𝐸 𝑋 𝑋 − 𝐸 𝑋
𝑇
=
𝑋
𝑋 − 𝐸 𝑋 𝑋 − 𝐸 𝑋
𝑇
𝑓 𝑋 d𝑋
= 𝐸 𝑋𝑋𝑇 − 𝐸 𝑋 𝐸 𝑋 𝑇 (1.4)
The expectation E(X) of X is often called theoretical mean. When X is multivariate vector, E(X) is
mean vector and V(X) is covariance matrix which is always symmetric. When X = (x1, x2,…, xn)T is
multivariate, E(X) and V(X) have following forms:
𝐸 𝑋 =
𝐸 𝑥1
𝐸 𝑥2
⋮
𝐸 𝑥𝑛
, 𝑉 𝑋 =
𝑉 𝑥1 𝑉 𝑥1, 𝑥2 ⋯ 𝑉 𝑥1, 𝑥𝑛
𝑉 𝑥2, 𝑥1 𝑉 𝑥2 ⋯ 𝑉 𝑥2, 𝑥𝑛
⋮ ⋮ ⋱ ⋮
𝑉 𝑥𝑛, 𝑥1 𝑉 𝑥𝑛, 𝑥2 ⋯ 𝑉 𝑥𝑛
27/04/2022 EM Tutorial P1 - Loc Nguyen 24
25. 1. Basic concepts
Therefore, theoretical means and variances of partial variables xi can be determined separately. For instance, each E(xi) is
theoretical mean of partial variable xi given marginal PDF 𝑓𝑥𝑗
𝑥𝑗 .
𝐸 𝑥𝑖 =
𝑋
𝑥𝑖𝑓 𝑋 d𝑋 =
𝑥𝑖
𝑥𝑖𝑓𝑥𝑖
𝑥𝑖 d𝑥𝑖
Each V(xi, xj) is theoretical covariance of partial variables xi and xj given joint PDF 𝑓𝑥𝑖𝑥𝑗
𝑥𝑖, 𝑥𝑗 .
𝑉 𝑥𝑖, 𝑥𝑗 = 𝑉 𝑥𝑗, 𝑥𝑖 = 𝐸 𝑥𝑖 − 𝐸 𝑥𝑖 𝑥𝑗 − 𝐸 𝑥𝑗 =
𝑋
𝑥𝑖 − 𝐸 𝑥𝑖 𝑥𝑗 − 𝐸 𝑥𝑗 𝑓 𝑋 d𝑋
=
𝑥𝑖 𝑥𝑗
𝑥𝑖 − 𝐸 𝑥𝑖 𝑥𝑗 − 𝐸 𝑥𝑗 𝑓𝑥𝑖𝑥𝑗
𝑥𝑖, 𝑥𝑗 d𝑥𝑖d𝑥𝑗
Note, 𝐸 𝑥𝑖𝑥𝑗 = 𝑋
𝑥𝑖𝑥𝑗𝑓 𝑋 d𝑋 = 𝑥𝑖 𝑥𝑗
𝑥𝑖𝑥𝑗𝑓𝑥𝑖𝑥𝑗
𝑥𝑖, 𝑥𝑗 d𝑥𝑖d𝑥𝑗 = 𝑉 𝑥𝑖, 𝑥𝑗 + 𝐸 𝑥𝑖 𝐸 𝑥𝑗 . Each V(xi) on diagonal of V(X)
is theoretical variance of partial variable xi given marginal PDF 𝑓𝑥𝑗
𝑥𝑗 .
𝑉 𝑥𝑖 = 𝐸 𝑥𝑖 − 𝐸 𝑥𝑖
2
=
𝑋
𝑥𝑖 − 𝐸 𝑥𝑖
2
𝑓 𝑋 d𝑋 =
𝑥𝑖
𝑥𝑖 − 𝐸 𝑥𝑖
2
𝑓𝑥𝑖
𝑥𝑖 d𝑥𝑖
Note, 𝐸 𝑥𝑖
2
= 𝑋
𝑥𝑖
2
𝑓 𝑋 d𝑋 = 𝑥𝑖
𝑥𝑖
2
𝑓𝑥𝑖
𝑥𝑖 d𝑥𝑖 = 𝑉 𝑥𝑖 + 𝐸 𝑥𝑖
2
27/04/2022 EM Tutorial P1 - Loc Nguyen 25
26. 1. Basic concepts
Given two random variables X and Y along with a joint PDF f(X, Y), theoretical covariance of X and Y is defined as follows:
𝑉 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸 𝑋 𝑌 − 𝐸 𝑌
𝑇
=
𝑋 𝑌
𝑋 − 𝐸 𝑋 𝑌 − 𝐸 𝑌
𝑇
𝑓 𝑋, 𝑌 d𝑋d𝑌 (1.5)
If the random variables X and Y are mutually independent given the joint PDF f(X, Y), its covariance is zero, V(X, Y)=0.
However, it is not sure that X and Y are mutually independent if V(X, Y)=0. Note, joint PDF is the PDF having two or more
random variables. If X and Y are multivariate vectors, V(X, Y) is theoretical covariance matrix of X and Y given the joint
PDF f(X, Y). When X = (x1, x2,…, xm)T and Y = (y1, y2,…, yn)T are multivariate, V(X, Y) has following form:
𝑉 𝑋, 𝑌 =
𝑉 𝑥1, 𝑦1 𝑉 𝑥1, 𝑦2 ⋯ 𝑉 𝑥1, 𝑦𝑛
𝑉 𝑥2, 𝑦1 𝑉 𝑥2, 𝑦2 ⋯ 𝑉 𝑥2, 𝑦𝑛
⋮ ⋮ ⋱ ⋮
𝑉 𝑥𝑚, 𝑦1 𝑉 𝑥𝑚, 𝑦2 ⋯ 𝑉 𝑥𝑚, 𝑦𝑛
Where V(xi, yj) is covariance of xi and yj. We have:
𝑉 𝑥𝑖, 𝑦𝑗 = 𝐸 𝑥𝑖 − 𝐸 𝑥𝑖 𝑦𝑗 − 𝐸 𝑦𝑗 =
𝑋 𝑌
𝑥𝑖 − 𝐸 𝑥𝑖 𝑦𝑗 − 𝐸 𝑦𝑗 𝑓 𝑋, 𝑌 d𝑋d𝑌
=
𝑥𝑖 𝑦𝑗
𝑥𝑖 − 𝐸 𝑥𝑖 𝑥𝑗 − 𝐸 𝑦𝑗 𝑓𝑥𝑖𝑦𝑗
𝑥𝑖, 𝑦𝑗 d𝑥𝑖d𝑦𝑗 = 𝐸 𝑥𝑖𝑦𝑗 − 𝐸 𝑥𝑖 𝐸 𝑦𝑗
27/04/2022 EM Tutorial P1 - Loc Nguyen 26
28. 1. Basic concepts
When X is univariate, Σ is often denoted as σ2 (if it is parameter of PDF). For example, if X is univariate and follows normal
distribution, its PDF is 𝑓 𝑋 =
1
2𝜋𝜎2
exp −
1
2
𝑋−𝜇 2
𝜎2 . If X = (x1, x2,…, xn)T is multivariate and follows multinormal
(multivariate normal) distribution, its PDF is:
𝑓 𝑋 = 2𝜋 −
𝑛
2 Σ −
1
2exp −
1
2
𝑋 − 𝜇 𝑇
Σ−1
(𝑋 − 𝜇)
In this case, parameters μ and Σ have following forms:
𝜇 = 𝐸 𝑋 =
𝜇1
𝜇2
⋮
𝜇𝑛
, Σ = 𝑉 𝑋 =
𝜎11 𝜎12 ⋯ 𝜎1𝑛
𝜎21 𝜎22 ⋯ 𝜎2𝑛
⋮ ⋮ ⋱ ⋮
𝜎𝑛1 𝜎𝑛2 ⋯ 𝜎𝑛𝑛
Each μi is theoretical mean of partial variable xi as usual.
𝜇𝑖 = 𝐸 𝑥𝑖
Each σij where i≠j is theoretical covariance of partial variables xi and xj as usual.
𝜎𝑖𝑗 = 𝜎𝑗𝑖 = 𝑉 𝑥𝑖, 𝑥𝑗 = 𝑉 𝑥𝑗, 𝑥𝑖
Note, 𝐸 𝑥𝑖𝑥𝑗 = 𝜎𝑖𝑗 + 𝜇𝑖𝜇𝑗. Each σii on diagonal of Σ is theoretical variance of partial variable xi as usual.
𝜎𝑖𝑖 = 𝜎𝑖
2
= 𝑉 𝑥𝑖
Note, 𝐸 𝑥𝑖
2
= 𝜎𝑖
2
+ 𝜇𝑖
2
27/04/2022 EM Tutorial P1 - Loc Nguyen 28
29. 1. Basic concepts
Let a and A be scalar constant and vector constant, respectively, we have:
𝐸 𝑎𝑋 + 𝐴 = 𝑎𝐸 𝑋 + 𝐴
𝑉 𝑎𝑋 + 𝐴 = 𝑎2
𝑉 𝑋
Given a set of random variables 𝒳 = {X1, X2,…, XN) and N scalar constants ci (s), we have:
𝐸
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖𝐸 𝑋𝑖
𝑉
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖
2
𝑉 𝑋𝑖 + 2
𝑖=1
𝑁−1
𝑗=𝑖+1
𝑁
𝑐𝑖𝑐𝑗𝑉 𝑋𝑖, 𝑋𝑗
Where V(Xi, Xj) is covariance of Xi and Xj. If all Xi (s) are mutually independent, then
𝐸
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖𝐸 𝑋𝑖
𝑉
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖
2
𝑉 𝑋𝑖
27/04/2022 EM Tutorial P1 - Loc Nguyen 29
30. 1. Basic concepts
Note, given joint PDF f(X1, X2,…, XN), two random variables Xi and Xj are mutually independent if f(Xi, Xj) = f(Xi)f(Xj) where
f(Xi, Xj), f(Xi), and f(Xj) are defined as aforementioned integrals of f(X1, X2,…, XN). Therefore, if only one PDF f(X) is defined
then, of course X1, X2,…, and XN are mutually independent and moreover, they are identically distributed. If all Xi (s) are
identically distributed, which implies that every Xi has the same distribution (the same PDF) with the same parameter, then
𝐸
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖 𝐸 𝑋
𝑉
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖
2
𝑉 𝑋 + 2
𝑖=1
𝑁−1
𝑗=𝑖+1
𝑁
𝑐𝑖𝑐𝑗 𝑉 𝑋
Note, if all Xi (s) are identically distributed, every Xi can be represented by the same random variable X. If all Xi (s) are
mutually independent and identically distributed (iid), then
𝐸
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖 𝐸 𝑋
𝑉
𝑖=1
𝑁
𝑐𝑖𝑋𝑖 =
𝑖=1
𝑁
𝑐𝑖
2
𝑉 𝑋
27/04/2022 EM Tutorial P1 - Loc Nguyen 30
31. 2. Parameter estimation
Because EM algorithm is essentially an advanced version of maximum likelihood estimation (MLE) method, it is necessary to
describe MLE in short. Suppose random variable X conforms to a distribution specified by the PDF denoted f(X | Θ) with parameter
Θ. For example, if X is vector and follows normal distribution then,
𝑓 𝑋 Θ = 2𝜋 −
𝑛
2 Σ −
1
2exp −
1
2
𝑋 − 𝜇 𝑇Σ−1(𝑋 − 𝜇)
Where μ and Σ are theoretical mean vector and covariance matrix, respectively with note that Θ = (μ, Σ)T. The notation |.| denotes
determinant of given matrix and the notation Σ–1 denotes inverse of matrix Σ. Note, Σ is invertible and symmetric. Parameter of
normal distribution is theoretical mean and theoretical variance,
𝜇 = 𝐸 𝑋
Σ = 𝑉 𝑋 = 𝐸 𝑋 − 𝜇 𝑋 − 𝜇 𝑇
But parameters of different distributions can be different from such mean and variance. Anyhow theoretical mean and theoretical
variance can be calculated based on parameter Θ. For example, suppose X = (x1, x2,…, xn)T follows multinomial distribution of K
trials, its PDF is:
𝑓 𝑋 Θ =
𝐾!
𝑗=1
𝑛
𝑥𝑗!
𝑗=1
𝑛
𝑝𝑗
𝑥𝑗
Where xj are integers and Θ = (p1, p2,…, pn)T is the set of probabilities such that: 𝑗=1
𝑛
𝑝𝑗 = 1, 𝑗=1
𝑛
𝑥𝑗 = 𝐾, 𝑥𝑗 ∈ 0,1, … , 𝐾 . Note, xj
is the number of trials generating nominal value j. Obviously, the parameter Θ = (p1, p2,…, pn)T does not include theoretical mean
E(X) and theoretical variance V(X) but E(X) and V(X) of multinomial distribution can be calculated based on Θ as follows:
𝐸 𝑥𝑗 = 𝐾𝑝𝑗 and 𝑉 𝑥𝑗 = 𝐾𝑝𝑗 1 − 𝑝𝑗
27/04/2022 EM Tutorial P1 - Loc Nguyen 31
32. 2. Parameter estimation
When random variable X is considered as an observation, a statistic denoted τ(X) is function of X. For example, τ(X) = X, τ(X) = aX + A
where a is scalar constant and A is vector constant, and τ(X) = XXT are statistics of X. Statistic τ(X) can be vector-by-vector functions, for
example, τ(X) = (X, XXT)T is a very popular statistic of X. In practice, if X is replaced by sample 𝒳 = {X1, X2,…, XN} including N
observation Xi, a statistic is now function of Xi (s), for instance, quantities 𝑋 and S defined below are statistics:
𝑋 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖
𝑆 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖 − 𝑋 𝑋𝑖 − 𝑋 𝑇
=
1
𝑁
𝑖=1
𝑁
𝑋𝑖𝑋𝑖
𝑇
− 𝑋𝑋𝑇
For multinormal distribution, 𝑋 and S are estimates of theoretical mean μ and theoretical covariance matrix Σ. They are called sample
mean and sample variance, respectively. Essentially, X is special case of 𝒳 when 𝒳 has only one observation as 𝒳 = {X}.
Statistic τ(X) is called sufficient statistic if it has all and only information to estimate parameter Θ. For example, sufficient statistic of
normal PDF is τ(X) = (X, XXT)T. In fact, parameter Θ = (μ, Σ)T of normal PDF, which includes theoretical mean μ and theoretical
covariance matrix Σ, is totally determined based on all and only X and XXT (there is no redundant information in τ(X)) where X is
observation considered as random variable, as follows:
𝜇 = 𝐸 𝑋 =
𝑋
𝑋𝑓 𝑋 Θ d𝑋
Σ = 𝐸 𝑋 − 𝜇 𝑋 − 𝜇 𝑇
= 𝐸 𝑋𝑋𝑇
− 𝜇𝜇𝑇
Similarly, given X = (x1, x2,…, xn)T, sufficient statistic of multinomial PDF of K trials is τ(X) = (x1, x2,…, xn)T due to 𝑝𝑗 =
𝐸 𝑥𝑗
𝐾
, ∀𝑗 = 1, 𝑛
27/04/2022 EM Tutorial P1 - Loc Nguyen 32
33. 2. Parameter estimation
Given a sample containing observations, purpose of point estimation is to estimate unknown
parameter Θ based on such sample. The result of estimation process is the estimate Θ as
approximation of unknown Θ. Formula to calculate Θ based on sample is called estimator of
Θ. As a convention, estimator of Θ is denoted Θ 𝑋 or Θ 𝒳 where X is an observation and
𝒳 is sample including many observations. Actually, Θ 𝑋 or Θ 𝒳 is the same to Θ but the
notation Θ 𝑋 or Θ 𝒳 implies that Θ is calculated based on observations. For example,
given sample 𝒳 = {X1, X2,…, XN} including N observations iid Xi, estimator of theoretical
mean μ of normal distribution is:
𝜇 = 𝜇 𝒳 = 𝑋 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖
As usual, estimator of Θ is determined based on sufficient statistics which in turn are
functions of observations where observations are considered as random variables.
Estimation methods mentioned in this research are MLE, Maximum A Posteriori (MAP),
and EM in which MAP and EM are variants of MLE.
27/04/2022 EM Tutorial P1 - Loc Nguyen 33
34. 2. Parameter estimation
According to viewpoint of Bayesian statistics, the parameter Θ is random variable and it conforms some
distribution. In some research, Θ represents a hypothesis. Equation 1.6 specifies Bayes’ rule in which f(Θ|ξ) is
called prior PDF (prior distribution) of Θ whereas f(Θ|X) is called posterior PDF (posterior distribution) of Θ
given observation X. Note, ξ is parameter of the prior f(Θ|ξ), which is known as second-level parameter. For
instance, if the prior f(Θ|ξ) is multinormal (multivariate normal) PDF, we have ξ = (μ0, Σ0
2)T which are
theoretical mean and theoretical covariance matrix of random variable Θ. Because ξ is constant, the prior PDF
f(Θ|ξ) can be denoted f(Θ). The posterior PDF f(Θ|X) ignores ξ because ξ is constant in f(Θ|X).
𝑓 Θ 𝑋 =
𝑓 𝑋 Θ 𝑓 Θ 𝜉
Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉
(1.6)
In Bayes’ rule, the PDF f(X | Θ) is called likelihood function. If posterior PDF f(Θ|X) has the same form of
prior PDF f(Θ|ξ), such posterior PDF and prior PDF are called conjugate PDFs (conjugate distributions,
conjugate probabilities) and f(Θ|ξ) is called conjugate prior (Wikipedia, Conjugate prior, 2018) for likelihood
function f(X|Θ). Such pair of f(Θ|ξ) and f(X|Θ) is called conjugate pair. For example, if prior PDF f(Θ|ξ) is beta
distribution and likelihood function P(X|Θ) follows binomial distribution then, posterior PDF f(Θ|X) is beta
distribution and hence, f(Θ|ξ) and f(Θ|X) are conjugate distributions. Shortly, whether posterior PDF and prior
PDF are conjugate PDFs depends on prior PDF and likelihood function.
27/04/2022 EM Tutorial P1 - Loc Nguyen 34
35. 2. Parameter estimation
There is a special conjugate pair that both prior PDF f(Θ|ξ) and likelihood function f(X|Θ) are multinormal,
which results that posterior PDF f(Θ|X) is multinormal. For instance, when X = (x1, x2,…, xn)T, the likelihood
function f(X|Θ) is multinormal as follows:
𝑓 𝑋 Θ = 𝒩 𝜇, Σ = 2𝜋 −
𝑛
2 Σ −
1
2exp −
1
2
𝑋 − 𝜇 𝑇
Σ−1
(𝑋 − 𝜇)
Where Θ = (μ, Σ)T and μ = (μ1, μ2,…, μn)T. Suppose only μ is random variable which follows multinormal
distribution with parameter ξ = (μ0, Σ0)T where μ0 = (μ01, μ02,…, μ0n)T. The prior PDF f(Θ|ξ) is:
𝑓 Θ 𝜉 = 𝑓 𝜇 𝜉 = 𝒩 𝜇0, Σ0 = 2𝜋 −
𝑛
2 Σ0
−
1
2exp −
1
2
𝜇 − 𝜇0
𝑇Σ0
−1
(𝜇 − 𝜇0)
It is proved that the posterior PDF f(Θ|X)=f(μ|X) distributes normally with theoretical mean Mμ and covariance
matrix Σμ as follows (Steorts, 2018, p. 13):
𝑓 Θ 𝑋 = 𝑓 𝜇 𝑋 = 𝑓 𝑋 Θ 𝑓 𝜇 𝜉 ∝ 𝒩 𝑀𝜇, Σ𝜇 = 2𝜋 −
𝑛
2 Σ𝜇
−
1
2exp −
1
2
𝜇 − 𝑀𝜇
𝑇
Σ𝜇
−1(𝜇 − 𝑀𝜇)
The sign “∝” indicates proportion. Note, where (Steorts, 2018, p. 13),
𝑀𝜇 = Σ−1 + Σ0
−1 −1 Σ𝜇0 + Σ0𝑋
Σ𝜇 = Σ−1 + Σ0
−1 −1
27/04/2022 EM Tutorial P1 - Loc Nguyen 35
36. 2. Parameter estimation
When X is evaluated as observation, let Θ be estimate of Θ. It is calculated as a maximizer of the posterior PDF
f(Θ|X) given X. Here data sample 𝒳 has only one observation X as 𝒳 = {X}, in other words, X is special case of
𝒳 here.
Θ = argmax
Θ
𝑓 Θ 𝑋 = argmax
Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉
Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉
Because the prior PDF f(Θ|ξ) is assumed to be fixed and the denominator Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉 is constant with
regard to Θ, we have:
Θ = argmax
Θ
𝑓 Θ 𝑋 = argmax
Θ
𝑓 𝑋 Θ
Obviously, MLE method determines Θ as a maximizer of the likelihood function f(X | Θ) with regard to Θ when X
is evaluated as observation. It is interesting that the likelihood function f(X|Θ) is the PDF of X with parameter Θ.
For convenience, MLE maximizes the natural logarithm of the likelihood function denoted l(Θ) instead of
maximizing the likelihood function.
Θ = argmax
Θ
𝑙 Θ = argmax
Θ
log 𝑓 𝑋 Θ (1.7)
Where l(Θ) = log(f(X | Θ)) is called log-likelihood function of Θ. Recall that equation 1.7 implies the optimization
problem. Note, l(Θ) is function of Θ if X is evaluated as observation.
𝑙 Θ = 𝑙 Θ 𝑋 = log 𝑓 𝑋 Θ (1.8)
27/04/2022 EM Tutorial P1 - Loc Nguyen 36
37. 2. Parameter estimation
Equation 1.7 is the simple result of MLE for estimating parameter based on observed sample. The notation
l(Θ|X) implies that l(Θ) is determined based on X. If the log-likelihood function l(Θ) is first-order smooth
function then, from equation 1.7, the estimate Θ can be solution of the equation created by setting the first-
order derivative of l(Θ) regarding Θ to be zero, Dl(Θ)=0T. If solving such equation is too complex or
impossible, some popular methods to solve optimization problem are Newton-Raphson (Burden & Faires,
2011, pp. 67-71), gradient descent (Ta, 2014), and Lagrange duality (Wikipedia, Karush–Kuhn–Tucker
conditions, 2014). Note, solving the equation Dl(Θ)=0T may be incorrect in some case, for instance, in theory,
Θ such that Dl(Θ)=0T may be a saddle point (not a maximizer).
For example, suppose X = (x1, x2,…, xn)T is vector and follows multinormal distribution,
𝑓 𝑋 Θ = 2𝜋 −
𝑛
2 Σ −
1
2exp −
1
2
𝑋 − 𝜇 𝑇Σ−1(𝑋 − 𝜇)
Then the log-likelihood function is
𝑙 Θ = −
𝑛
2
log 2π −
1
2
log Σ −
1
2
𝑋 − 𝜇 𝑇Σ−1 𝑋 − 𝜇
Where μ and Σ are mean vector and covariance matrix of f(X | Θ), respectively with note that Θ = (μ, Σ)T.
From equation 1.7, the estimate Θ = 𝜇, Σ
𝑇
is solution of the equation created by setting the first-order of
l(Θ) regarding μ and Σ to be zero.
27/04/2022 EM Tutorial P1 - Loc Nguyen 37
38. 2. Parameter estimation
The first-order partial derivative of l(Θ) with respect to μ is (Nguyen, 2015, p. 35)
𝜕𝑙 Θ
𝜕𝜇
= 𝑋 − 𝜇 𝑇
Σ−1
. Setting this partial derivative to
be zero, we obtain:
𝑋 − 𝜇 𝑇
Σ−1
= 0 ⇒ 𝑋 − 𝜇 ⇒ 𝜇 = 𝑋
The first-order partial derivative of l(Θ) with respect to Σ is
𝜕𝑙 Θ
𝜕Σ
= −
1
2
Σ−1
+
1
2
Σ−1
𝑋 − 𝜇 𝑋 − 𝜇 𝑇
Σ−1
. Due to
𝜕 log Σ
𝜕Σ
= Σ−1
and
𝜕 𝑋−𝜇 𝑇Σ−1 𝑋−𝜇
𝜕Σ
=
𝜕tr 𝑋−𝜇 𝑋−𝜇 𝑇Σ−1
𝜕Σ
. Because Bilmes (Bilmes, 1998, p. 5) mentioned:
𝑋 − 𝜇 𝑇
Σ−1
𝑋 − 𝜇 = tr 𝑋 − 𝜇 𝑋 − 𝜇 𝑇
Σ−1
Where tr(A) is trace operator which takes sum of diagonal elements of square matrix, tr 𝐴 = 𝑖 𝑎𝑖𝑖. This implies (Nguyen, 2015, p. 45):
𝜕 𝑋 − 𝜇 𝑇
Σ−1
𝑋 − 𝜇
𝜕Σ
=
𝜕tr 𝑋 − 𝜇 𝑋 − 𝜇 𝑇
Σ−1
𝜕Σ
= −Σ−1
𝑋 − 𝜇 𝑋 − 𝜇 𝑇
Σ−1
Substituting the estimate 𝜇 into the first-order partial derivative of l(Θ) with respect to Σ, we have:
𝜕𝑙 Θ
𝜕Σ
= −
1
2
Σ−1
+
1
2
Σ−1
𝑋 − 𝜇 𝑋 − 𝜇 𝑇
Σ−1
The estimate Σ is the solution of equation formed by setting this partial derivative
𝜕𝑙 Θ
𝜕Σ
to be zero matrix (0), we have:
𝜕𝑙 Θ
𝜕Σ
= 𝟎 ⇒ Σ = 𝑋 − 𝜇 𝑋 − 𝜇 𝑇
Finally, MLE results out the estimate Θ for normal distribution given observation X as follows: Θ = 𝜇 = 𝑋, Σ = 𝑋 − 𝜇 𝑋 − 𝜇 𝑇 𝑇
27/04/2022 EM Tutorial P1 - Loc Nguyen 38
39. 2. Parameter estimation
When 𝜇 = 𝑋 then Σ = 𝟎 , which implies that the estimate Σ of covariance
matrix is arbitrary with constraint that it is symmetric and invertible. This is
reasonable because the sample is too small with only one observation X. When X
is replaced by a sample 𝒳 = {X1, X2,…, XN} in which all Xi (s) are mutually
independent and identically distributed (iid), it is easy to draw the following
result by the similar way with equation 1.11 mentioned later.
𝜇 = 𝑋 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖
Σ = 𝑆 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖 − 𝜇 𝑋𝑖 − 𝜇 𝑇
=
1
𝑁
𝑖=1
𝑁
𝑋𝑖𝑋𝑖
𝑇
− 𝜇𝜇𝑇
Here, 𝜇 and Σ are sample mean and sample variance ■
27/04/2022 EM Tutorial P1 - Loc Nguyen 39
40. 2. Parameter estimation
In practice and in general, if X is observed as N particular observations X1, X2,…, XN. Let 𝒳 = {X1, X2,…, XN} be
the observed sample of size N in which all Xi (s) are iid. The Bayes’ rule specified by equation 1.6 is re-written as
follows 𝑓 Θ 𝒳 =
𝑓 𝒳 Θ 𝑓 Θ 𝜉
Θ
𝑓 𝒳 Θ 𝑓 Θ 𝜉
. However, the meaning of Bayes’ rule does not change. Because all Xi (s) are
iid, the likelihood function becomes product of partial likelihood functions as follows:
𝑓 𝒳 Θ =
𝑖=1
𝑁
𝑓 𝑋𝑖 Θ (1.9)
The log-likelihood function of Θ becomes:
𝑙 Θ = 𝑙 Θ 𝒳 = log 𝑓 𝒳 Θ = log
𝑖=1
𝑁
𝑓 𝑋𝑖 Θ =
𝑖=1
𝑁
log 𝑓 𝑋𝑖 Θ (1.10)
The notation l(Θ|𝒳) implies that l(Θ) is determined based on 𝒳. We have:
Θ = argmax
Θ
𝑙 Θ = argmax
Θ
𝑖=1
𝑁
log 𝑓 𝑋𝑖 Θ (1.11)
Equation 1.11 is the main result of MLE for estimating parameter based on observed sample.
27/04/2022 EM Tutorial P1 - Loc Nguyen 40
41. 2. Parameter estimation
For example, suppose each Xi = (xi1, xi2,…, xin)T is vector and follows multinomial distribution of K trials,
𝑓 𝑋𝑖 Θ =
𝐾!
𝑗=1
𝑛
𝑥𝑖𝑗!
𝑗=1
𝑛
𝑝𝑗
𝑥𝑖𝑗
Where xik are integers and Θ = (p1, p2,…, pn)T is the set of probabilities such that 𝑗=1
𝑛
𝑝𝑗 = 1, 𝑗=1
𝑛
𝑥𝑖𝑗 =
𝐾, 𝑥𝑖𝑗 ∈ 0,1, … , 𝐾 . Note, xik is the number of trials generating nominal value k. Given sample 𝒳 = {X1,
X2,…, XN} in which all Xi (s) are iid, according to equation 1.10, the log-likelihood function is
𝑙 Θ =
𝑖=1
𝑁
log 𝐾! −
𝑗=1
𝑛
log 𝑥𝑖𝑗! +
𝑗=1
𝑛
𝑥𝑖𝑗log 𝑝𝑗
Because there is the constraint 𝑗=1
𝑛
𝑝𝑗 = 1, we use Lagrange duality method to maximize l(Θ). The
Lagrange function la(Θ, λ) is sum of l(Θ) and the constraint 𝑗=1
𝑛
𝑝𝑗 = 1 as follows:
𝑙𝑎 Θ, λ = 𝑙 Θ + 𝜆 1 −
𝑗=1
𝑛
𝑝𝑗 =
𝑖=1
𝑁
log 𝐾! −
𝑗=1
𝑛
log 𝑥𝑖𝑗! +
𝑗=1
𝑛
𝑥𝑖𝑗log 𝑝𝑗 + 𝜆 1 −
𝑗=1
𝑛
𝑝𝑗
27/04/2022 EM Tutorial P1 - Loc Nguyen 41
42. 2. Parameter estimation
Of course, la(Θ, λ) is function of Θ and λ where λ is called Lagrange multiplier. Because multinomial PDF
is smooth enough, the estimate Θ = 𝑝1, 𝑝2, … , 𝑝𝑛
𝑇
is solution of the equation created by setting the first-
order of la(Θ) regarding pj and λ to be zero. The first-order partial derivative of la(Θ) with respect to pj is:
𝜕𝑙𝑎 Θ
𝜕𝑝𝑗
=
𝑖=1
𝑁
𝑥𝑖𝑗
𝑝𝑗
− 𝜆
Setting this partial derivative to be zero, we obtain following equation: 𝑖=1
𝑁
𝑥𝑖𝑗
𝑝𝑗
− 𝜆 = 0 ⇒ 𝑖=1
𝑁
𝑥𝑖𝑗 −
𝜆𝑝𝑗 = 0 . Summing this equation over n variables pj, we obtain: 𝑗=1
𝑛
𝑖=1
𝑁
𝑥𝑖𝑗 − 𝜆𝑝𝑗 =
𝑖=1
𝑁
𝑗=1
𝑛
𝑥𝑖𝑗 − 𝜆 𝑗=1
𝑛
𝑝𝑗 = 0. Due to 𝑗=1
𝑛
𝑝𝑗 = 1 and 𝑗=1
𝑛
𝑥𝑖𝑗 = 𝐾, we have
𝐾𝑁 − 𝜆 = 0 ⇒ 𝜆 = 𝐾𝑁
Substitute λ = KN into equation 𝑖=1
𝑁
𝑥𝑖𝑗 − 𝜆𝑝𝑗 = 0, we get the estimate Θ = 𝑝1, 𝑝2, … , 𝑝𝑛
𝑇
as
follows:
𝑝𝑗 =
𝑖=1
𝑁
𝑥𝑖𝑗
𝐾𝑁
∎
27/04/2022 EM Tutorial P1 - Loc Nguyen 42
43. 2. Parameter estimation
Quality of estimation is measured by mean and variance of the estimate Θ. The mean of
Θ is:
𝐸 Θ =
𝑋
Θ 𝑋 𝑓 𝑋 Θ d𝑋 (1.12)
The notation Θ 𝑋 implies the formulation to calculate Θ, which is resulted from MLE,
MAP, or EM. Hence, Θ 𝑋 is considered as function of X in the integral
𝑋
Θ 𝑋 𝑓 𝑋 Θ d𝑋. The Θ is unbiased estimate if 𝐸 Θ = Θ. Otherwise, if 𝐸 Θ ≠ Θ
then, Θ is biased estimate. As usual, unbiased estimate is better than biased estimate.
The condition 𝐸 Θ = Θ is the criterion to check if an estimate is unbiased,
which is applied for all estimation methods. The variance of Θ is:
𝑉 Θ =
𝑋
Θ 𝑋 − 𝐸 𝑋 Θ 𝑋 − 𝐸 𝑋
𝑇
𝑓 𝑋 Θ d𝑋 (1.13)
The smaller the variance 𝑉 Θ , the better the Θ is.
27/04/2022 EM Tutorial P1 - Loc Nguyen 43
44. 2. Parameter estimation
For example, given multinormal distribution and given sample 𝒳 = {X1, X2,…, XN} where all Xi (s) are iid, the estimate Θ =
𝜇, Σ
𝑇
from MLE is:
𝜇 = 𝑋 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖
Σ =
1
𝑁
𝑖=1
𝑁
𝑋𝑖 − 𝜇 𝑋𝑖 − 𝜇 𝑇
Due to:
𝐸 𝜇 = 𝐸
1
𝑁
𝑖=1
𝑁
𝑋𝑖 =
1
𝑁
𝑖=1
𝑁
𝐸 𝑋𝑖 =
1
𝑁
𝑖=1
𝑁
𝐸 𝑋 = 𝜇
Then 𝜇 is unbiased estimate. We also have: 𝐸 Σ = 𝐸
1
𝑁 𝑖=1
𝑁
𝑋𝑖 − 𝜇 𝑋𝑖 − 𝜇 𝑇 = Σ + 𝜇𝜇𝑇 − 𝑉 𝜇 + 𝜇𝜇𝑇 = Σ − 𝑉 𝜇
It is necessary to calculate the variance 𝑉 𝜇 , we have: 𝑉 𝜇 = 𝑉
1
𝑁 𝑖=1
𝑁
𝑋𝑖 =
1
𝑁2 𝑖=1
𝑁
𝑉 𝑋𝑖 =
1
𝑁2 𝑖=1
𝑁
𝑉 𝑋 =
1
𝑁
𝑉 𝑋 =
1
𝑁
Σ.
Therefore, we have:
𝐸 Σ = Σ −
1
𝑁
Σ =
𝑁 − 1
𝑁
Σ
Hence, we conclude that Σ is biased estimate because of 𝐸 Σ ≠ Σ.
27/04/2022 EM Tutorial P1 - Loc Nguyen 44
45. 2. Parameter estimation
Without loss of generality, suppose parameter Θ is vector, the second-order derivative of the log-
likelihood function l(Θ) is called likelihood Hessian matrix (Zivot, 2009, p. 7) denoted S(Θ).
𝑆 Θ = 𝑆 Θ 𝑋 = 𝐷2𝑙 Θ 𝑋 (1.14)
Suppose Θ = (θ1, θ2,…, θr)T where there are r partial parameters θk, equation 1.14 is expended as
follows:
𝐷2
𝑙 Θ 𝑋 =
d2𝑙 Θ 𝑋
dΘ2
=
𝜕2𝑙 Θ 𝑋
𝜕𝜃1
2
𝜕2𝑙 Θ 𝑋
𝜕𝜃1𝜕𝜃2
⋯
𝜕2𝑙 Θ 𝑋
𝜕𝜃1𝜕𝜃𝑟
𝜕2𝑙 Θ 𝑋
𝜕𝜃2𝜕𝜃1
𝜕2𝑙 Θ 𝑋
𝜕𝜃2
2 ⋯
𝜕2𝑙 Θ 𝑋
𝜕𝜃2𝜕𝜃𝑟
⋮ ⋮ ⋱ ⋮
𝜕2𝑙 Θ 𝑋
𝜕𝜃𝑟𝜕𝜃1
𝜕2𝑙 Θ 𝑋
𝜕𝜃𝑟𝜕𝜃2
⋯
𝜕2𝑙 Θ 𝑋
𝜕𝜃𝑟
2
Where,
𝜕2𝑙 Θ 𝑋
𝜕𝜃𝑖𝜕𝜃𝑗
=
𝜕
𝜕𝜃𝑖
𝜕𝑙 Θ 𝑋
𝜕𝜃𝑗
and
𝜕2𝑙 Θ 𝑋
𝜕𝜃𝑖
2 =
𝜕2𝑙 Θ 𝑋
𝜕𝜃𝑖𝜕𝜃𝑖
. The notation S(Θ|X) implies S(Θ) is
calculated based on X. If sample 𝒳 = {X1, X2,…, XN} replaces X then,
𝑆 Θ = 𝑆 Θ 𝒳 = 𝐷2𝑙 Θ 𝒳 (1.15)
27/04/2022 EM Tutorial P1 - Loc Nguyen 45
46. 2. Parameter estimation
The negative expectation of likelihood Hessian matrix is called information matrix or Fisher
information matrix denoted I(Θ). Please distinguish information matrix I(Θ) from identity
matrix I.
𝐼 Θ = −𝐸 𝑆 Θ (1.16)
If S(Θ) is calculated by equation 1.14 with observation X then, I(Θ) becomes:
𝐼 Θ = 𝐼 Θ 𝑋 = −𝐸 𝑆 Θ 𝑋 = −
𝑋
𝐷2𝑙 Θ 𝑋 𝑓 𝑋 Θ d𝑋 (1.17)
The notation I(Θ|X) implies I(Θ) is calculated based on X. Note, 𝐷2
𝑙 Θ 𝑋 is considered as
function of X in the integral 𝑋
𝐷2𝑙 Θ 𝑋 𝑓 𝑋 Θ d𝑋. If S(Θ) is calculated by equation 1.15
with observation sample 𝒳 = {X1, X2,…, XN} in which all Xi (s) are iid then, I(Θ) becomes:
𝐼 Θ = 𝐼 Θ 𝒳 = −𝐸 𝑆 Θ 𝒳 = 𝑁 ∗ 𝐼 Θ 𝑋 = −𝑁
𝑋
𝐷2
𝑙 Θ 𝑋 𝑓 𝑋 Θ d𝑋 (1.18)
Note, 𝐷2
𝑙 Θ 𝑋 is considered as function of X in the integral 𝑋
𝐷2
𝑙 Θ 𝑋 𝑓 𝑋 Θ d𝑋.
27/04/2022 EM Tutorial P1 - Loc Nguyen 46
47. 2. Parameter estimation
For MLE method, the inverse of information matrix is called Cramer-Rao lower bound
denoted 𝐶𝑅 Θ .
𝐶𝑅 Θ = 𝐼 Θ −1 (1.19)
Any covariance matrix of a MLE estimate Θ has such Cramer-Rao lower bound.
Such Cramer-Rao lower bound becomes 𝑉 Θ if and only if Θ is unbiased, (Zivot,
2009, p. 11):
𝑉 Θ ≥ 𝐶𝑅 Θ if Θ biased
𝑉 Θ = 𝐶𝑅 Θ if Θ unbiased
(1.20)
Note, equation 1.19 and equation 1.20 are only valid for MLE method. The sign
“≥” implies lower bound. In other words, Cramer-Rao lower bound is variance of
the optimal MLE estimate. Moreover, beside the criterion 𝐸 Θ = Θ, equation
1.20 can be used as another criterion to check if an estimate is unbiased.
However, the criterion 𝐸 Θ = Θ is applied for all estimation methods whereas
equation 1.20 is only applied for MLE.
27/04/2022 EM Tutorial P1 - Loc Nguyen 47
48. 2. Parameter estimation
Suppose Θ = (θ1, θ2,…, θr)T where there are r partial parameter θk, so the estimate is Θ =
𝜃1, 𝜃2, … , 𝜃𝑟
𝑇
. Each element on diagonal of the Cramer-Rao lower bound is lower
bound of a variance of 𝜃𝑘, denoted 𝑉 𝜃𝑘 . Let 𝐶𝑅 𝜃𝑘 be lower bound of 𝑉 𝜃𝑘 , of
course we have:
𝑉 𝜃𝑘 ≥ 𝐶𝑅 𝜃𝑘 if 𝜃𝑘 biased
𝑉 𝜃𝑘 = 𝐶𝑅 𝜃𝑘 if 𝜃𝑘 unbiased
(1.21)
Derived from equation 1.18 and equation 1.19, 𝐶𝑅 𝜃𝑘 is specified by equation 1.22.
𝐼 𝜃𝑘 = −𝑁 ∗ 𝐸
𝜕2
𝑙 Θ 𝑋
𝜕𝜃𝑘
2 = −𝑁
𝑋
𝜕2
𝑙 Θ 𝑋
𝜕𝜃𝑘
2 𝑓 𝑋 Θ d𝑋
𝐶𝑅 𝜃𝑘 = 𝐼 𝜃𝑘
−1
= −
1
𝑁
𝑋
𝜕2
𝑙 Θ 𝑋
𝜕𝜃𝑘
2 𝑓 𝑋 Θ d𝑋
−1 (1.22)
Where N is size of sample 𝒳 = {X1, X2,…, XN} in which all Xi (s) are iid. Of course, 𝐼 𝜃𝑘 is
information matrix of 𝜃𝑘. If 𝜃𝑘 is univariate, 𝐼 𝜃𝑘 is scalar, which is called information
value.
27/04/2022 EM Tutorial P1 - Loc Nguyen 48
49. 2. Parameter estimation
For example, let 𝒳 = {X1, X2,…, XN} be the observed sample of size N with note that all Xi (s) are iid, given
multinormal PDF as follows:
𝑓 𝑋 Θ = 2𝜋 −
𝑛
2 Σ −
1
2exp −
1
2
𝑋 − 𝜇 𝑇Σ−1(𝑋 − 𝜇)
Where n is dimension of vector X and Θ = (μ, Σ)T with note that μ is theoretical mean vector and Σ is theoretical
covariance matrix. Note, Σ is invertible and symmetric. From previous example, the MLE estimate Θ = 𝜇, Σ
𝑇
given 𝒳 is:
𝜇 = 𝑋 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖
Σ =
1
𝑁
𝑖=1
𝑁
𝑋𝑖 − 𝜇 𝑋𝑖 − 𝜇 𝑇
Mean and variance of 𝜇 from previous example are
𝐸 𝜇 = 𝜇
𝑉 𝜇 =
1
𝑁
Σ
. We knew that 𝜇 is unbiased estimate with
criterion 𝐸 𝜇 = 𝜇. Now we check again if 𝜇 is unbiased estimate with equation 1.21 with Cramer-Rao lower
bound.
27/04/2022 EM Tutorial P1 - Loc Nguyen 49
50. 2. Parameter estimation
Hence, we firstly calculate the lower bound 𝐶𝑅 𝜇 and then compare it with the variance 𝑉 𝜇 . In fact,
according to equation 1.8, the log-likelihood function is:
𝑙 Θ 𝑋 = −
𝑛
2
log 2π −
1
2
log Σ −
1
2
𝑋 − 𝜇 𝑇
Σ−1
𝑋 − 𝜇
The partial first-order derivative of l(Θ|X) with regard to μ is (Nguyen, 2015, p. 35):
𝜕𝑙 Θ 𝑋
𝜕𝜇
= 𝑋 − 𝜇 𝑇Σ−1
The partial second-order derivative of l(Θ|X) with regard to μ is (Nguyen, 2015, p. 36):
𝜕2𝑙 Θ 𝑋
𝜕𝜇2
=
𝜕
𝜕𝜇
𝜕𝑙 Θ 𝑋
𝜕𝜇
=
𝜕
𝜕𝜇
𝑋 − 𝜇 𝑇
Σ−1
= − Σ−1 𝑇
= Σ−1
According to equation 1.22, the lower bound 𝐶𝑅 𝜇 is:
𝐶𝑅 𝜇 = −
1
𝑁
𝑋
𝜕2
𝑙 Θ 𝑋
𝜕𝜇2
𝑓 𝑋 Θ d𝑋
−1
=
1
𝑁
𝑋
Σ−1
𝑓 𝑋 Θ d𝑋
−1
=
1
𝑁
Σ−1
𝑋
𝑓 𝑋 Θ d𝑋
−1
=
1
𝑁
Σ
= 𝑉 𝜇
Due to 𝑉 𝜇 = 𝐶𝑅 𝜇 , 𝜇 is unbiased estimate according to the criterion specified by equation 1.21.
27/04/2022 EM Tutorial P1 - Loc Nguyen 50
51. 2. Parameter estimation
Mean of Σ from previous example is:
𝐸 Σ =
𝑁 − 1
𝑁
Σ
We knew that Σ is biased estimate because 𝐸 Σ ≠ Σ. Now we check again if Σ is biased
estimate with Cramer-Rao lower bound. The partial first-order derivative of l(Θ|X) with
regard to Σ is:
𝜕𝑙 Θ 𝑋
𝜕Σ
= −
1
2
Σ−1 +
1
2
Σ−1 𝑋 − 𝜇 𝑋 − 𝜇 𝑇Σ−1
According to equation 1.22, the lower bound 𝐶𝑅 Σ is:
𝐶𝑅 Σ = −
1
𝑁
𝑋
𝜕2
𝑙 Θ 𝑋
𝜕Σ2 𝑓 𝑋 Θ d𝑋
−1
= ⋯ = −
1
𝑁
𝜕
𝜕Σ
𝟎
−1
Where (0) is zero matrix. This implies the lower bound 𝐶𝑅 Σ is inexistent. Hence, Σ is
biased estimate. Even there is no unbiased estimate of variance for normal
distribution by MLE
27/04/2022 EM Tutorial P1 - Loc Nguyen 51
52. 2. Parameter estimation
MLE ignores prior PDF f(Θ|ξ) because f(Θ|ξ) is assumed to be fixed but Maximum A Posteriori (MAP) method (Wikipedia,
Maximum a posteriori estimation, 2017) concerns f(Θ|ξ) in maximization task when Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉 is constant with regard
to Θ as follows: Θ = argmax
Θ
𝑓 Θ 𝑋 = argmax
Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉
Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉
= argmax
Θ
𝑓 𝑋 Θ 𝑓 Θ 𝜉
Let f(X, Θ | ξ) be the joint PDF of X and Θ where Θ is also random variable too. Note, ξ is parameter in the prior PDF
f(Θ|ξ). The likelihood function in MAP is also f(X, Θ | ξ).
𝑓 𝑋, Θ 𝜉 = 𝑓 𝑋 Θ 𝑓 Θ 𝜉 (1.23)
Theoretical mean and variance of X are based on the joint PDF f(X, Θ | ξ) as follows:
𝐸 𝑋 =
𝑋 Θ
𝑋𝑓 𝑋, Θ 𝜉 d𝑋dΘ (1.24)
𝑉 𝑋 =
𝑋 Θ
𝑋 − 𝐸 𝑋 𝑋 − 𝐸 𝑋
𝑇
𝑓 𝑋, Θ 𝜉 d𝑋dΘ (1.25)
Theoretical mean and variance of Θ are based on f(Θ|ξ) because f(Θ|ξ) is function of only Θ when ξ is constant.
𝐸 Θ =
𝑋 Θ
Θ𝑓 𝑋, Θ 𝜉 d𝑋dΘ =
Θ
Θ𝑓 Θ 𝜉 dΘ (1.26)
𝑉 Θ =
Θ
Θ − 𝐸 Θ Θ − 𝐸 Θ
𝑇
𝑓 Θ 𝜉 dΘ = 𝐸 ΘΘ𝑇
𝜉 − 𝐸 Θ 𝜉 𝐸 Θ𝑇
𝜉 (1.27)
27/04/2022 EM Tutorial P1 - Loc Nguyen 52
53. 2. Parameter estimation
In general, statistics of Θ are still based on f(Θ|ξ). Given sample 𝒳 = {X1, X2,…, XN} in
which all Xi (s) are iid, the likelihood function becomes:
𝑓 𝒳, Θ 𝜉 =
𝑖=1
𝑁
𝑓 𝑋𝑖, Θ 𝜉 (1.28)
The log-likelihood function 𝓁 Θ in MAP is re-defined with observation X or sample 𝒳 as
follows:
𝓁 Θ = log 𝑓 𝑋, Θ 𝜉 = 𝑙 Θ + log 𝑓 Θ 𝜉 (1.29)
𝓁 Θ = log 𝑓 𝒳, Θ 𝜉 = 𝑙 Θ + log 𝑓 Θ 𝜉 (1.30)
Where l(Θ) is specified by equation 1.8 with observation X or equation 1.10 with sample 𝒳.
Therefore, the estimate Θ is determined according to MAP as follows:
Θ = argmax
Θ
𝓁 Θ = argmax
Θ
𝑙 Θ + log 𝑓 Θ 𝜉 (1.31)
Good information provided by the prior f(Θ|ξ) can improve quality of estimation.
Essentially, MAP is an improved variant of MLE. Later on, we also recognize that EM
algorithm is also a variant of MLE. All of them aim to maximize log-likelihood functions.
27/04/2022 EM Tutorial P1 - Loc Nguyen 53
54. 2. Parameter estimation
Likelihood Hessian matrix 𝑆 Θ , information matrix 𝐼 Θ , and Cramer-Rao lower bound 𝐶𝑅 Θ , 𝐶𝑅 𝜃𝑘 are
extended in MAP with the new likelihood function 𝓁 Θ .
𝑆 Θ = 𝐷2
𝓁 Θ , 𝐼 Θ = −𝐸 𝑆 Θ , 𝐶𝑅 Θ = 𝐼 Θ −1
𝐼 𝜃𝑘 = −𝑁 ∗ 𝐸
𝜕2
𝓁 Θ
𝜕𝜃𝑘
2 = −𝑁
𝑋 Θ
𝜕2
𝓁 Θ
𝜕𝜃𝑘
2 𝑓 𝑋, Θ 𝜉 d𝑋dΘ
𝐶𝑅 𝜃𝑘 = 𝐼 𝜃𝑘
−1
Where N is size of sample 𝒳 = {X1, X2,…, XN} in which all Xi (s) are iid. Mean and variance of the estimate Θ which
are used to measure estimation quality are not changed except that the joint PDF f(X, Θ | ξ) is used instead.
𝐸 Θ =
𝑋 Θ
Θ 𝑋, Θ 𝑓 𝑋, Θ 𝜉 d𝑋dΘ (1.32)
𝑉 Θ =
𝑋 Θ
Θ 𝑋, Θ − 𝐸 𝑋 Θ 𝑋, Θ − 𝐸 𝑋
𝑇
𝑓 𝑋, Θ 𝜉 d𝑋dΘ (1.33)
As extension for MAP, recall that there are two criteria to check if Θ is unbiased estimate, as follows:
𝐸 Θ = Θ
𝑉 Θ = 𝐶𝑅 Θ
27/04/2022 EM Tutorial P1 - Loc Nguyen 54
55. 2. Parameter estimation
It is necessary to have an example for parameter estimation with MAP. Given sample 𝒳 = {X1, X2,…, XN} in
which all Xi (s) are iid. Each n-dimension Xi has following multinormal PDF:
𝑓 𝑋𝑖 Θ = 2𝜋 −
𝑛
2 Σ −
1
2exp −
1
2
𝑋𝑖 − 𝜇 𝑇
Σ−1
(𝑋𝑖 − 𝜇)
Where μ and Σ are mean vector and covariance matrix of f(X | Θ), respectively with note that Θ = (μ, Σ)T. In Θ = (μ,
Σ)T, suppose only μ distributes normally with parameter ξ = (μ0, Σ0) where μ0 and Σ0 are theoretical mean and
covariance matrix of μ. Thus, Σ is variable but not random variable. The second-level parameter ξ is constant. The
prior PDF f(Θ|ξ) becomes f(μ|ξ), which specified as follows:
𝑓 Θ 𝜉 = 𝑓 𝜇 𝜇0, Σ0 = 2𝜋 −
𝑛
2 Σ0
−
1
2exp −
1
2
𝜇 − 𝜇0
𝑇
Σ0
−1
(𝜇 − 𝜇0)
Note, μ0 is n-element vector like μ and Σ0 is nxn matrix like Σ. Of course, Σ0 is also invertible and symmetric.
Theoretical mean and theoretical variance of X are:
𝐸 𝑋 =
𝑋 Θ
𝑋𝑓 𝑋, Θ 𝜉 d𝑋dΘ = ⋯ =
𝜇
𝜇𝑓 𝜇 𝜇0, Σ0 d𝜇 = 𝐸 𝜇 = 𝜇0
𝑉 𝑋 =
𝑋 Θ
𝑋 − 𝐸 𝑋 𝑋 − 𝐸 𝑋
𝑇
𝑓 𝑋, Θ 𝜉 d𝑋dΘ = ⋯ =
𝜇
Σ𝑓 𝜇 𝜇0, Σ0 d𝜇 = Σ
27/04/2022 EM Tutorial P1 - Loc Nguyen 55
56. 2. Parameter estimation
The log-likelihood function in MAP is
𝓁 Θ = log 𝑓 𝜇 𝜉 + 𝑙 Θ = log 𝑓 𝜇 𝜉 +
𝑖=1
𝑁
log 𝑓 𝑋𝑖 Θ
= −
𝑛
2
log 2π −
1
2
log Σ0 −
1
2
𝜇 − 𝜇0
𝑇
Σ0
−1
(𝜇 − 𝜇0) +
𝑖=1
𝑁
−
𝑛
2
log 2π −
1
2
log Σ −
1
2
𝑋 − 𝜇 𝑇
Σ−1
𝑋 − 𝜇
Because normal PDF is smooth enough, from equation 1.24, the estimate Θ = 𝜇, Σ
𝑇
is solution of the equation
created by setting the first-order of 𝓁(Θ) regarding μ and Σ to be zero. The first-order partial derivative of 𝓁(Θ) with
respect to μ is:
𝜕𝓁 Θ
𝜕𝜇
= −𝜇𝑇
Σ0
−1
+ 𝑁Σ−1
+ 𝜇0
𝑇
Σ0
−1
+
𝑖=1
𝑁
𝑋𝑖
𝑇
Σ−1
Setting this partial derivative to be zero, we obtain:
𝜇 = ΣΣ0
−1
+ 𝑁𝐼 −1
ΣΣ0
−1
𝜇0 + 𝑁𝑋
Where I is identity matrix and 𝑋 =
1
𝑁 𝑖=1
𝑁
𝑋𝑖
27/04/2022 EM Tutorial P1 - Loc Nguyen 56
57. 2. Parameter estimation
The first-order partial derivative of l(Θ) with respect to Σ is:
𝜕𝓁 Θ
𝜕Σ
=
𝑖=1
𝑁
−
1
2
Σ−1 +
1
2
Σ−1 𝑋𝑖 − 𝜇 𝑋𝑖 − 𝜇 𝑇Σ−1
Setting this partial derivative to be zero, we obtain:
Σ =
1
𝑁
𝑖=1
𝑁
𝑋𝑖𝑋𝑖
𝑇
−
2
𝑁
𝜇
𝑖=1
𝑁
𝑋𝑖
𝑇
+ 𝜇𝜇𝑇 =
1
𝑁
𝑖=1
𝑁
𝑋𝑖𝑋𝑖
𝑇
− 2𝜇𝑋 + 𝜇𝜇𝑇
Because Σ is independent from the prior PDF f(μ | μ0, Σ0), it is estimated by MLE as usual,
Σ =
1
𝑁
𝑖=1
𝑁
𝑋𝑖𝑋𝑖
𝑇
− 𝑋𝑋𝑇
The estimate Σ in MAP here is as same as the one in MLE and so, it is biased. Substituting Σ for Σ, we obtain the
estimate 𝜇 in MAP:
𝜇 = ΣΣ0
−1
+ 𝑁𝐼
−1
ΣΣ0
−1
𝜇0 + 𝑁𝑋
Note, 𝐸 𝑋 = 𝜇0 and 𝑉 𝑋 =
1
𝑁
Σ.
27/04/2022 EM Tutorial P1 - Loc Nguyen 57
58. 2. Parameter estimation
Now we check if 𝜇 is unbiased estimate. In fact, we have:
𝐸 𝜇 = 𝐸 ΣΣ0
−1
+ 𝑁𝐼
−1
ΣΣ0
−1
𝜇0 + 𝑁𝑋 = ⋯ = 𝜇0
Therefore, the estimate 𝜇 is biased because the variable μ is not always to equal μ0. Now we try to check again if 𝜇 is unbiased
estimate with Cramer-Rao lower bound. The second-order partial derivative of 𝓁 Θ regarding μ is:
𝜕2𝓁 Θ
𝜕𝜇2 =
𝜕
𝜕𝜇
𝜕𝓁 Θ
𝜕𝜇
=
⋯ = − Σ0
−1
+ 𝑁Σ−1 . Cramer-Rao lower bound of 𝜇 is:
𝐶𝑅 𝜇 = −
1
𝑁
𝑋 Θ
𝜕2𝓁 Θ
𝜕𝜇2
𝑓 𝑋, Θ 𝜉 d𝑋dΘ
−1
= ⋯ =
1
𝑁
Σ0
−1
+ 𝑁Σ−1 −1
Suppose we fix Σ so that Σ = Σ0 = Σ, variance of 𝜇 is calculated as follows:
𝑉 𝜇 = 𝑉 ΣΣ0
−1
+ 𝑁𝐼 −1 ΣΣ0
−1
𝜇0 + 𝑁𝑋 = ⋯ =
𝑁
𝑁 + 1 2
Σ
By fixing Σ = Σ0 = Σ, the Cramer-Rao lower bound of 𝜇 is re-written as follows: 𝐶𝑅 𝜇 =
1
𝑁
Σ0
−1
+ 𝑁Σ−1 −1 =
1
𝑁 𝑁+1
Σ.
Obviously, 𝜇 is biased estimate due to 𝑉 𝜇 ≠ 𝐶𝑅 𝜇 . In general, the estimate Θ in MAP is affected by the prior PDF f(Θ|ξ).
Even though it is biased, it can be better than the one resulted from MLE because of valuable information in f(Θ|ξ). For instance,
if fixing Σ, the variance of 𝜇 from MAP
𝑁
𝑁+1 2 Σ is “smaller” (lower bounded) than the one from MLE
1
𝑁
Σ .
27/04/2022 EM Tutorial P1 - Loc Nguyen 58
59. 3. EM algorithm
Now we skim through an introduction of EM algorithm. Suppose there are two spaces X and Y, in which X is
hidden space whereas Y is observed space. We do not know X but there is a mapping from X to Y so that we
can survey X by observing Y. The mapping is many-one function φ: X → Y and we denote φ–1(Y) = {𝑋 ∈ 𝑿:
φ(X) = Y} as all 𝑋 ∈ 𝑿 such that φ(X) = Y. We also denote X(Y) = φ–1(Y). Let f(X | Θ) be the PDF of random
variable 𝑋 ∈ 𝑿 and let g(Y | Θ) be the PDF of random variable 𝑌 ∈ 𝒀. Note, Y is also called observation.
Equation 1.34 specifies g(Y | Θ) as integral of f(X | Θ) over φ–1(Y).
𝑔 𝑌 Θ =
𝜑−1 𝑌
𝑓 𝑋 Θ d𝑋 (1.34)
Where Θ is probabilistic parameter represented as a column vector, Θ = (θ1, θ2,…, θr)T in which each θi is a
particular parameter. If X and Y are discrete, equation 1.34 is re-written as follows:
𝑔 𝑌 Θ =
𝑋∈𝜑−1 𝑌
𝑓 𝑋 Θ
According to viewpoint of Bayesian statistics, Θ is also random variable. As a convention, let Ω be the domain
of Θ such that Θ ∈ Ω and the dimension of Ω is r. For example, normal distribution has two particular
parameters such as mean μ and variance σ2 and so we have Θ = (μ, σ2)T. Note that, Θ can degrades into a scalar
as Θ = θ.
27/04/2022 EM Tutorial P1 - Loc Nguyen 59
60. 3. EM algorithm
The conditional PDF of X given Y, denoted k(X | Y, Θ), is specified by equation 1.35.
𝑘 𝑋 𝑌, Θ =
𝑓 𝑋 Θ
𝑔 𝑌 Θ
(1.35)
According to DLR (Dempster, Laird, & Rubin, 1977, p. 1), X is called complete data and the term
“incomplete data” implies existence of X and Y where X is not observed directly and X is only known
by the many-one mapping φ: X → Y. In general, we only know Y, f(X | Θ), and k(X | Y, Θ) and so our
purpose is to estimate Θ based on such Y, f(X | Θ), and k(X | Y, Θ). Like MLE approach, EM
algorithm also maximizes the likelihood function to estimate Θ but the likelihood function in EM
concerns Y and there are also some different aspects in EM which will be described later. Pioneers in
EM algorithm firstly assumed that f(X | Θ) belongs to exponential family with note that many popular
distributions such as normal, multinomial, and Poisson belong to exponential family. Although DLR
(Dempster, Laird, & Rubin, 1977) proposed a generality of EM algorithm in which f(X | Θ)
distributes arbitrarily, we should concern exponential family a little bit. Exponential family
(Wikipedia, Exponential family, 2016) refers to a set of probabilistic distributions whose PDF (s)
have the same exponential form according to equation 1.36 (Dempster, Laird, & Rubin, 1977, p. 3):
𝑓 𝑋 Θ = 𝑏 𝑋 exp Θ𝑇𝜏 𝑋 𝑎 Θ (1.36)
27/04/2022 EM Tutorial P1 - Loc Nguyen 60
61. 3. EM algorithm
𝑓 𝑋 Θ = 𝑏 𝑋 exp Θ𝑇
𝜏 𝑋 𝑎 Θ (1.36)
Where b(X) is a function of X, which is called base measure and τ(X) is a vector function of
X, which is sufficient statistic. For example, the sufficient statistic of normal distribution is
τ(X) = (X, XXT)T. Equation 1.36 expresses the canonical form of exponential family. Recall
that Ω is the domain of Θ such that Θ ∈ Ω. Suppose that Ω is a convex set. If Θ is restricted
only to Ω then, f(X | Θ) specifies a regular exponential family. If Θ lies in a curved sub-
manifold Ω0 of Ω then, f(X | Θ) specifies a curved exponential family. The a(Θ) is partition
function for variable X, which is used for normalization.
𝑎 Θ =
𝑋
𝑏 𝑋 exp Θ𝑇𝜏 𝑋 d𝑋
As usual, a PDF is known as a popular form but its exponential family form (canonical form
of exponential family) specified by equation 1.36 looks unlike popular form although they
are the same. Therefore, parameter in popular form is different from parameter in
exponential family form.
27/04/2022 EM Tutorial P1 - Loc Nguyen 61
62. 3. EM algorithm
For example, multinormal distribution with theoretical mean μ and covariance matrix Σ of random
variable X = (x1, x2,…, xn)T has PDF in popular form as follows:
𝑓 𝑋 𝜇, Σ = 2𝜋 −
𝑛
2 Σ −
1
2 ∗ exp −
1
2
𝑋 − 𝜇 𝑇Σ−1(𝑋 − 𝜇)
Hence, parameter in popular form is Θ = (μ, Σ)T. Exponential family form of such PDF is:
𝑓 𝑋 𝜃1, 𝜃2 = 2𝜋 −
𝑛
2 ∗ exp 𝜃1, 𝜃2
𝑋
𝑋𝑋𝑇 exp −
1
4
𝜃1
𝑇
𝜃2
−1
𝜃1 −
1
2
log −2𝜃2
Where,
Θ =
𝜃1
𝜃2
, 𝜃1 = Σ−1𝜇, 𝜃2 = −
1
2
Σ−1
𝑏 𝑋 = 2𝜋 −
𝑛
2, 𝜏 𝑋 =
𝑋
𝑋𝑋𝑇
𝑎 Θ = exp −
1
4
𝜃1
𝑇
𝜃2
−1
𝜃1 −
1
2
log −2𝜃2
Hence, parameter in exponential family form is Θ = (θ1, θ2)T. Although, f(X | θ1, θ2) looks unlike f(X |
μ, Σ) but they are the same, f(X | θ1, θ2) = f(X | μ, Σ).
27/04/2022 EM Tutorial P1 - Loc Nguyen 62
63. 3. EM algorithm
The exponential family form is used
to represents all distributions
belonging to exponential family as
canonical form. Parameter in
exponential family form is called
exponential family parameter. As a
convention, parameter Θ mentioned
in EM algorithm is often exponential
family parameter if PDF belongs to
exponential family and there is no
additional information. Table 1.1
shows some popular distributions
belonging to exponential family
along with their canonical forms
(Wikipedia, Exponential family,
2016). In case of multivariate
distributions, dimension of random
variable X = (x1, x2,…, xn)T is n.
27/04/2022 EM Tutorial P1 - Loc Nguyen 63
64. 3. EM algorithm
It is necessary to survey some features of exponential family. The first-order derivative of log(a(Θ)) is expectation of transposed
τ(X).
log′ 𝑎 Θ =
dlog 𝑎 Θ
dΘ
=
𝑋
𝜏 𝑋
𝑇
𝑏 𝑋 exp Θ𝑇𝜏 𝑋 𝑎 Θ d𝑋 = 𝐸 𝜏 𝑋 Θ
𝑇
The second-order derivative of log(a(Θ)) is (Jebara, 2015):
log′′ 𝑎 Θ = 𝑉 𝜏 𝑋 Θ =
𝑋
𝜏 𝑋 − 𝐸 𝜏 𝑋 Θ 𝜏 𝑋 − 𝐸 𝜏 𝑋 Θ
𝑇
𝑓 𝑋 Θ d𝑋
Where V(τ(X) | Θ) is central covariance matrix of τ(X). If f(X | Θ) follows exponential family then, k(X | Y, Θ) also follows
exponential family: 𝑘 𝑋 𝑌, Θ = 𝑏 𝑋 exp Θ𝑇𝜏 𝑋 𝑎 Θ 𝑌 . Note, a(Θ | Y) is the observed partition function for observation Y.
𝑎 Θ 𝑌 =
𝜑−1 𝑌
𝑏 𝑋 exp Θ𝑇𝜏 𝑋 d𝑋
Similarly, we obtain that the first-order derivative of log(a(θ | Y)) is expectation of transposed τ(X) based on Y.
log′ 𝑎 Θ 𝑌 = 𝐸 𝜏 𝑋 𝑌, Θ
𝑇
The second-order derivative of log(a(Θ) | Y) is:
log′′ 𝑎 Θ 𝑌 = 𝑉 𝜏 𝑋 𝑌, Θ =
𝜑−1 𝑌
𝜏 𝑋 − 𝐸 𝜏 𝑋 𝑌, Θ 𝜏 𝑋 − 𝐸 𝜏 𝑋 𝑌, Θ
𝑇
𝑘 𝑋 𝑌, Θ d𝑋
Where V(τ(X) | Y, Θ) is central covariance matrix of τ(X) given observed Y.
27/04/2022 EM Tutorial P1 - Loc Nguyen 64
65. 3. EM algorithm
Table 1.2 is summary of f(X | Θ), g(Y | Θ),
k(X | Y, Θ), a(Θ), log’(a(Θ)), log’’(a(Θ)),
a(Θ | Y), log’(a(Θ | Y)), and log’’(a(Θ | Y))
with exponential family.
Simply, EM algorithm is iterative process
including many iterations, in which each
iteration has expectation step (E-step) and
maximization step (M-step). E-step aims to
estimate sufficient statistic given current
parameter and observed data Y whereas M-
step aims to re-estimate the parameter based
on such sufficient statistic by maximizing
likelihood function of X related to Y. EM
algorithm is described in the next section in
detail. As an introduction, DLR gave an
example for illustrating EM algorithm
(Dempster, Laird, & Rubin, 1977, pp. 2-3).
27/04/2022 EM Tutorial P1 - Loc Nguyen 65
66. 3. EM algorithm
Example 1.1. Rao (Rao, 1955) presents observed data Y of 197 animals following multinomial distribution with four
categories, such as Y = (y1, y2, y3, y4) = (125, 18, 20, 34). The PDF of Y is:
𝑔 𝑌 𝜃 =
𝑖=1
4
𝑦𝑖 !
𝑖=1
4
𝑦𝑖!
∗
1
2
+
𝜃
4
𝑦1
∗
1
4
−
𝜃
4
𝑦2
∗
1
4
−
𝜃
4
𝑦3
∗
𝜃
4
𝑦4
Note, probabilities py1, py2, py3, and py4 in g(Y | θ) are 1/2 + θ/4, 1/4 – θ/4, 1/4 – θ/4, and θ/4, respectively as parameters. The
expectation of any sufficient statistic yi with regard to g(Y | θ) is:
𝐸 𝑦𝑖 𝑌, 𝜃 = 𝑦𝑖𝑝𝑦𝑖
Observed data Y is associated with hidden data X following multinomial distribution with five categories, such as X = {x1, x2,
x3, x4, x5} where y1 = x1 + x2, y2 = x3, y3 = x4, y4 = x5. The PDF of X is:
𝑓 𝑋 𝜃 =
𝑖=1
5
𝑥𝑖 !
𝑖=1
5
𝑥𝑖!
∗
1
2
𝑥1
∗
𝜃
4
𝑥2
∗
1
4
−
𝜃
4
𝑥3
∗
1
4
−
𝜃
4
𝑥4
∗
𝜃
4
𝑥5
Note, probabilities px1, px2, px3, px4, and px5 in f(X | θ) are 1/2, θ/4, 1/4 – θ/4, 1/4 – θ/4, and θ/4, respectively as parameters. The
expectation of any sufficient statistic xi with regard to f(X | θ) is: 𝐸 𝑥𝑖 𝜃 = 𝑥𝑖𝑝𝑥𝑖
Due to y1 = x1 + x2, y2 = x3, y3 = x4, y4 = x5, the mapping function φ between X and Y is y1 = φ(x1, x2) = x1 + x2. Therefore g(Y |
θ) is sum of f(X | θ) over x1 and x2 such that x1 + x2 = y1 according to equation 1.34. In other words, g(Y | θ) is resulted from
summing f(X | θ) over all (x1, x2) pairs such as (0, 125), (1, 124),…, (125, 0) and then substituting (18, 20, 34) for (x3, x4, x5)
because of y1 = 125 from observed Y.
𝑔 𝑌 𝜃 =
𝑥1=0
125
𝑥2=125−𝑥1
0
𝑓 𝑋 𝜃
27/04/2022 EM Tutorial P1 - Loc Nguyen 66
67. 3. EM algorithm
Rao (Rao, 1955) applied EM algorithm into determining the optimal estimate θ*. Note y2 = x3, y3 = x4,
y4 = x5 are known and so only sufficient statistics x1 and x2 are not known. Given the tth iteration,
sufficient statistics x1 and x2 are estimated as x1
(t) and x2
(t) based on current parameter θ(t) and g(Y | θ)
in E-step below:
𝑥1
𝑡
+ 𝑥2
𝑡
= 𝑦1
𝑡
= 𝐸 𝑦1 𝑌, 𝜃 𝑡
Given py1 = 1/2 + θ/4, which implies that:
𝑦1
𝑡
= 𝐸 𝑦1 𝑌, 𝜃 𝑡 = 𝑦1𝑝𝑦1
= 𝑦1
1
2
+
𝜃 𝑡
4
When y1 = 125, we have:
𝑥1
𝑡
+ 𝑥2
𝑡
= 125
1
2
+
𝜃 𝑡
4
This suggests us to select:
𝑥1
𝑡
= 𝐸 𝑥1 𝑌, 𝜃 𝑡
= 125
1 2
1 2 + 𝜃 𝑡 4
𝑥2
𝑡
= 𝐸 𝑥2 𝑌, 𝜃 𝑡 = 125
𝜃 𝑡 4
1 2 + 𝜃 𝑡 4
27/04/2022 EM Tutorial P1 - Loc Nguyen 67
68. 3. EM algorithm
According to M-step, the next estimate θ(t+1) is a maximizer of the log-likelihood function of X related to Y. This log-
likelihood function is:
log 𝑓 𝑋 𝜃 = log
𝑖=1
5
𝑥𝑖 !
𝑖=1
5
𝑥𝑖!
− 𝑥1 + 2𝑥2 + 2𝑥3 + 2𝑥4 + 2𝑥5 log 2 + 𝑥2 + 𝑥5 log 𝜃 + 𝑥3 + 𝑥4 log 1 − 𝜃
The first-order derivative of log(f(X | θ) is:
dlog 𝑓 𝑋 𝜃
d𝜃
=
𝑥2+𝑥5
𝜃
−
𝑥3+𝑥4
1−𝜃
=
𝑥2+𝑥5− 𝑥2+𝑥3+𝑥4+𝑥5 𝜃
𝜃 1−𝜃
. Because y2 = x3 = 18,
y3 = x4 = 20, y4 = x5 = 34 and x2 is approximated by x2
(t), we have:
dlog 𝑓 𝑋 𝜃
d𝜃
=
𝑥2
𝑡
+ 34 − 𝑥2
𝑡
+ 72 𝜃
𝜃 1 − 𝜃
As a maximizer of log(f(X | θ), the next estimate θ(t+1) is solution of the following equation
dlog 𝑓 𝑋 𝜃
d𝜃
=
𝑥2
𝑡
+ 34 − 𝑥2
𝑡
+ 72 𝜃
𝜃 1 − 𝜃
= 0
So we have:
𝜃 𝑡+1 =
𝑥2
𝑡
+ 34
𝑥2
𝑡
+ 72
Where, 𝑥2
𝑡
= 125
𝜃 𝑡 4
1 2+𝜃 𝑡 4
27/04/2022 EM Tutorial P1 - Loc Nguyen 68
69. 3. EM algorithm
For example, given the initial θ(1) = 0.5, at the first iteration, we have:
𝑥2
1
= 125
𝜃 1 4
1 2 + 𝜃 1 4
=
125 ∗ 0.5/4
0.5 + 0.5/4
= 25
𝜃 2
=
𝑥2
1
+ 34
𝑥2
1
+ 72
=
25 + 34
25 + 72
= 0.6082
After five iterations we gets the optimal estimate θ*:
𝜃∗ = 𝜃 4 = 𝜃 5 = 0.6268
Table 1.3 (Dempster, Laird, & Rubin, 1977, p. 3) lists estimates of θ over five
iterations (t =1, 2, 3, 4, 5) with note that θ(1) is initialized arbitrarily and θ* = θ(5) =
θ(6) is determined at the 5th iteration. The third column gives deviation θ* and θ(t)
whereas the fourth column gives the ratio of successive deviations. Later on, we
will know that such ratio implies convergence rate. For example, at the first
iteration, we have:
𝜃∗
− 𝜃 1
= 0.6268 − 0.5 = 0.1268
𝜃∗ − 𝜃 2
𝜃∗ − 𝜃 1
=
𝜃 2 − 𝜃∗
𝜃 1 − 𝜃∗
=
0.6082 − 0.6268
0.5 − 0.6268
= 0.1465
27/04/2022 EM Tutorial P1 - Loc Nguyen 69
70. References
1. Bilmes, J. A. (1998). A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden
Markov Models. International Computer Science Institute, Department of Electrical Engineering and Computer Science. Berkeley:
University of Washington. Retrieved from http://melodi.ee.washington.edu/people/bilmes/mypubs/bilmes1997-em.pdf
2. Burden, R. L., & Faires, D. J. (2011). Numerical Analysis (9th Edition ed.). (M. Julet, Ed.) Brooks/Cole Cengage Learning.
3. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. (M. Stone, Ed.)
Journal of the Royal Statistical Society, Series B (Methodological), 39(1), 1-38.
4. Hardle, W., & Simar, L. (2013). Applied Multivariate Statistical Analysis. Berlin, Germany: Research Data Center, School of Business and
Economics, Humboldt University.
5. Jebara, T. (2015). The Exponential Family of Distributions. Columbia University, Computer Science Department. New York: Columbia
Machine Learning Lab. Retrieved April 27, 2016, from http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
6. Jia, Y.-B. (2013). Lagrange Multipliers. Lecture notes on course “Problem Solving Techniques for Applied Computer Science”, Iowa State
University of Science and Technology, USA.
7. Montgomery, D. C., & Runger, G. C. (2003). Applied Statistics and Probability for Engineers (3rd Edition ed.). New York, NY, USA: John
Wiley & Sons, Inc.
8. Nguyen, L. (2015). Matrix Analysis and Calculus (1st ed.). (C. Evans, Ed.) Hanoi, Vietnam: Lambert Academic Publishing. Retrieved
from https://www.shuyuan.sg/store/gb/book/matrix-analysis-and-calculus/isbn/978-3-659-69400-4
9. Rao, R. C. (1955, June). Estimation and tests of significance in factor analysis. Psychometrika, 20(2), 93-111. doi:10.1007/BF02288983
10.Rosen, K. H. (2012). Discrete Mathematics and Its Applications (7nd Edition ed.). (M. Lange, Ed.) McGraw-Hill Companies.
11.StackExchange. (2013, November 19). Eigenvalues of the product of 2 symmetric matrices. (Stack Exchange Network) Retrieved
February 9, 2018, from Mathematics StackExchange: https://math.stackexchange.com/questions/573583/eigenvalues-of-the-product-of-2-
symmetric-matrices
27/04/2022 EM Tutorial P1 - Loc Nguyen 70
71. References
12. Steorts, R. (2018). The Multivariate Distributions: Normal and Wishart. Duke University. Rebecca Steorts Homepage. Retrieved October 21, 2020, from
http://www2.stat.duke.edu/~rcs46/lecturesModernBayes/601-module10-multivariate-normal/multivariate-normal.pdf
13. Ta, P. D. (2014). Numerical Analysis Lecture Notes. Vietnam Institute of Mathematics, Numerical Analysis and Scientific Computing. Hanoi: Vietnam
Institute of Mathematics. Retrieved 2014
14. Wikipedia. (2014, August 4). Karush–Kuhn–Tucker conditions. (Wikimedia Foundation) Retrieved November 16, 2014, from Wikipedia website:
http://en.wikipedia.org/wiki/Karush–Kuhn–Tucker_conditions
15. Wikipedia. (2014, October 10). Set (mathematics). (A. Rubin, Editor, & Wikimedia Foundation) Retrieved October 11, 2014, from Wikipedia website:
http://en.wikipedia.org/wiki/Set_(mathematics)
16. Wikipedia. (2016, March September). Exponential family. (Wikimedia Foundation) Retrieved 2015, from Wikipedia website:
https://en.wikipedia.org/wiki/Exponential_family
17. Wikipedia. (2017, February 27). Commuting matrices. (Wikimedia Foundation) Retrieved February 9, 2018, from Wikipedia website:
https://en.wikipedia.org/wiki/Commuting_matrices
18. Wikipedia. (2017, November 27). Diagonalizable matrix. (Wikimedia Foundation) Retrieved February 10, 2018, from Wikipedia website:
https://en.wikipedia.org/wiki/Diagonalizable_matrix#Simultaneous_diagonalization
19. Wikipedia. (2017, March 2). Maximum a posteriori estimation. (Wikimedia Foundation) Retrieved April 15, 2017, from Wikipedia website:
https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation
20. Wikipedia. (2018, January 15). Conjugate prior. (Wikimedia Foundation) Retrieved February 15, 2018, from Wikipedia website:
https://en.wikipedia.org/wiki/Conjugate_prior
21. Wikipedia. (2018, January 28). Gradient descent. (Wikimedia Foundation) Retrieved February 20, 2018, from Wikipedia website:
https://en.wikipedia.org/wiki/Gradient_descent
22. Wikipedia. (2018, January 7). Symmetry of second derivatives. (Wikimedia Foundation) Retrieved February 10, 2018, from Wikipedia website:
https://en.wikipedia.org/wiki/Symmetry_of_second_derivatives
23. Zivot, E. (2009). Maximum Likelihood Estimation. Lecture Notes on course "Econometric Theory I: Estimation and Inference (first quarter, second year
PhD)", University of Washington, Seattle, Washington, USA.
27/04/2022 EM Tutorial P1 - Loc Nguyen 71
72. Thank you for attention
72
EM Tutorial P1 - Loc Nguyen
27/04/2022