This document discusses conditional mixture models, including mixtures of linear regression models, mixtures of logistic models, and mixtures of experts models. It provides details on learning the parameters of these models using the EM algorithm. Mixtures of experts models use a gating network to determine which expert network is responsible for different regions of the input space. Hierarchical mixtures of experts extend this idea by incorporating multiple levels of gating networks.
5. Conditional Mixture Models
Positive
soft probabilistic splits of the input space
Splits : functions of all of the input variables
Negative
no interpretability
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Fig.14.9
Model A
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Model B
Model
6. Mixture of Experts Model (MoE)
<latexit sha1_base64="(null)">(null)</latexit>
n a fully probabilistic tree-based model
Hierarchical Mixture of Experts (HME)
<latexit sha1_base64="(null)">(null)</latexit>
Ex)
8. Gaussian Mixture Models
n Clustering (unsupervised Learning)
Fig. 9.5
<latexit sha1_base64="(null)">(null)</latexit>
Observed data
If k = 3
k
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
9. Mixtures of linear regression models
<latexit sha1_base64="(null)">(null)</latexit>
n A general of switching regression
, 9 3 79 7 97 3 9 8 7 70 7 A7 1 7 3 3 7 )93 . 7
) 1 (
Linear regression
10. Learning the parameters
EM algorithm
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Parameters
Data set <latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
11. E step
responsibilities
initialized <latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
The expectation of the complete-data log likelihood
12. M step
Fixed <latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Maximize the function about
the constraint
1) <latexit sha1_base64="(null)">(null)</latexit>
the mixing coefficients
Using a Lagrange multiplier method,
13. M step
2) <latexit sha1_base64="(null)">(null)</latexit>
the parameter of the k-th linear regression model
<latexit sha1_base64="(null)">(null)</latexit>
Linear regression model (3.12)
<latexit sha1_base64="(null)">(null)</latexit>
weighted least squares problem
If k = 3
k
<latexit sha1_base64="(null)">(null)</latexit>
n = 1
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit>
14. M step
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
the parameter of the k-th linear regression model
<latexit sha1_base64="(null)">(null)</latexit>
matrix notation
<latexit sha1_base64="(null)">(null)</latexit>
n are learned the data of the high responsibility .<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
15. M step
a precision parameter<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
17. The predictive conditional density
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
No independent of the input
22. M step
<latexit sha1_base64="(null)">(null)</latexit>
Fixed <latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Maximize the function about
<latexit sha1_base64="(null)">(null)</latexit>
the mixing coefficients
}
23. M step
<latexit sha1_base64="(null)">(null)</latexit>
the parameter of the k-th logistic regression model
does not have a closed-form solution.<latexit sha1_base64="(null)">(null)</latexit>
(IRLS) algorithm
<latexit sha1_base64="(null)">(null)</latexit>
n Need the gradient and Hessian
<latexit sha1_base64="(null)">(null)</latexit>
25. A mixture of logistic regression models
true probability of the class label single logistic regression mixture of two logistic regression
<latexit sha1_base64="(null)">(null)</latexit>
Model A
Model B
27. Mixtures of experts
<latexit sha1_base64="(null)">(null)</latexit>
Gating function:
determine which expert are dominant in which region.
be represented by a linear softmax or sigmoid.
n a mixture of experts model
Expert:
can model in different regions of input space.
predict in their own region.
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
30. Hierarchical Mixture of Experts
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
HME
Negative
l large number of parameters =>Bayesian HMoE (2003)
31. 5.6 mixture density network
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
output
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
n share the hidden units of the neural network.
n the splits of the input space are relaxed, can be nolinear!
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
32. Bayesian Hierarchical Mixtures of Experts
Bishop, C. M. and M. Svense ́n (2003). Bayesian hierarchical mixtures of experts. In U. Kjaerulff and C. Meek (Eds.),
Proceedings Nineteenth Conference on Uncertainty in Artificial Intelligence, pp. 57–64. Morgan Kaufmann.
Application: the kinematics of robot arms
inverse problem
two pattern
Input: parameters and angles of the robot arm.
Output : the end effector of the robot arm .