Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

8,641 views

Published on

Principle of Maximum Entropy

No Downloads

Total views

8,641

On SlideShare

0

From Embeds

0

Number of Embeds

256

Shares

0

Downloads

270

Comments

0

Likes

8

No embeds

No notes for slide

- 1. Principle of Maximum Entropy Jiawang Liu liujiawang@baidu.com 2012.6
- 2. Outline What is Entropy Principle of Maximum Entropy Relation to Maximum Likelihood MaxEnt methods and Bayesian Applications NLP(POS tagging) Logistic regression Q&A
- 3. What is Entropy In information theory, entropy is the measure of the amount of information that is missing before reception and is sometimes referred to as Shannon entropy. Uncertainty
- 4. Principle of Maximum Entropy Subject to precisely stated prior data, which must be a proposition that expresses testable information, the probability distribution which best represents the current state of knowledge is the one with largest information theoretical entropy.Why maximum entropy? Minimize commitment Model all that is known and assume nothing about what is unknown
- 5. Principle of Maximum EntropyOverview Should guarantee the uniqueness and consistency of probability assignments obtained by different methods Makes explicit our freedom in using different forms of prior data Admits the most ignorance beyond the stated prior data
- 6. Principle of Maximum EntropyTestable information The principle of maximum entropy is useful explicitly only when applied to testable information A piece of information is testable if it can be determined whether a give distribution is consistent with it. An example: The expectation of the variable x is 2.87 and p2 + p3 > 0.6
- 7. Principle of Maximum EntropyGeneral solution Entropy maximization with no testable information Given testable information Seek the probability distribution which maximizes information entropy, subject to the constraints of the information. A constrained optimization problem. It can be typically solved using the method of Lagrange Multipliers.
- 8. Principle of Maximum EntropyGeneral solution Question Seek the probability distribution which maximizes information entropy, subject to some linear constraints. Mathematical problem Optimization Problem non-linear programming with linear constraints Idea non-linear non-linear get result programming programming with linear with no constraints constraints • Lagrange • Let • partial multipliers derivative differential equals to 0
- 9. Principle of Maximum EntropyGeneral solution Constraints Some testable information I about a quantity x taking values in {x1, x2,..., xn}. Express this information as m constraints on the expectations of the functions fk, that is, we require our probability distribution to satisfy Furthermore, the probabilities must sum to one, giving the constraint Objective function
- 10. Principle of Maximum EntropyGeneral solution The probability distribution with maximum information entropy subject to these constraints is The normalization constant is determined by The λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to These m simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods.
- 11. Principle of Maximum EntropyTraining Model Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972) Improved Iterative Scaling (IIS) (Della Pietra et al., 1995)
- 12. Principle of Maximum EntropyTraining Model Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972) Compute dj, j=1, …, k+1 Initialize (any values, e.g., 0) Repeat until converge • For each j – Compute – Compute – Update
- 13. Principle of Maximum EntropyTraining Model Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972) The running time of each iteration is O(NPA): • N: the training set size • P: the number of classes • A: the average number of features that are active for a given event (a, b).
- 14. Principle of Maximum EntropyRelation to Maximum Likelihood Likelihood function P(x) is the distribution of estimation is the empirical distribution Log-Likelihood function
- 15. Principle of Maximum EntropyRelation to Maximum Likelihood Theorem The model p*C with maximum entropy is the model in the parametric family p(y|x) that maximizes the likelihood of the training sample. Coincidence? Entropy – the measure of uncertainty Likelihood – the degree of identical to knowledge Maximum entropy - assume nothing about what is unknown Maximum likelihood – impartially understand the knowledge Knowledge = complementary set of uncertainty
- 16. Principle of Maximum EntropyMaxEnt methods and Bayesian Bayesian methods p(H|DI) = p(H|I)p(D|HI) / p(D|I) H stands for some hypothesis whose truth we want to judge D for a set of data I for prior information Difference A single application of Bayes’ theorem gives us only a probability, not a probability distribution MaxEnt gives us necessarily a probability distribution, not just a probability.
- 17. Principle of Maximum EntropyMaxEnt methods and Bayesian Difference (continue) Bayes’ theorem cannot determine the numerical value of any probability directly from our information. To apply it one must first use some other principle to translae information into numerical values for p(H|I), p(D|HI), p(D|I) MaxEnt does not require for input the numerical values of any probabilities on the hypothesis space. In common The updating of a state of knowledge Bayes’ rule and MaxEnt are completely compatible and can be seen as special cases of the method of MaxEnt. (Giffin et al. 2007)
- 18. ApplicationsMaximum Entropy Model NLP: POS Tagging, Parsing, PP attachment, Text Classification, LM, … POS Tagging Features Model
- 19. ApplicationsMaximum Entropy Model POS Tagging Tagging with MaxEnt Model The conditional probability of a tag sequence t1,…, tn is given a sentence w1,…, wn and contexts C1,…, Cn Model Estimation • The model should reﬂect the data – use the data to constrain the model • What form should the constraints take? – constrain the expected value of each feature
- 20. ApplicationsMaximum Entropy Model POS Tagging The Constraints • Expected value of each feature must satisfy some constraint Ki • A natural choice for Ki is the average empirical count • derived from the training data (C1, t1), (C2, t2)…, (Cn, tn)
- 21. ApplicationsMaximum Entropy Model POS Tagging MaxEnt Model • The constraints do not uniquely identify a model • The maximum entropy model is the most uniform model – makes no assumptions in addition to what we know from the data • Set the weights to give the MaxEnt model satisfying the constraints – use Generalised Iterative Scaling (GIS) Smoothing • empirical counts for low frequency features can be unreliable • Common smoothing technique is to ignore low frequency features • Use a prior distribution on the parameters
- 22. ApplicationsMaximum Entropy Model Logistic regression Classification • Linear regression for classification • The problems of linear regression for classification
- 23. ApplicationsMaximum Entropy Model Logistic regression Hypothesis representation • What function is used to represent our hypothesis in classification • We want our classifier to output values between 0 and 1 • When using linear regression we did hθ(x) = (θT x) • For classification hypothesis representation we do hθ(x) = g((θT x)) Where we define g(z), z is a real number g(z) = 1/(1 + e-z) This is the sigmoid function, or the logistic function
- 24. ApplicationsMaximum Entropy Model Logistic regression Cost function for logistic regression • Hypothesis representation • Linear regression uses the following function to determine θ • Define cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2 • Redefine J(Θ) • J(Θ) does not work for logistic regression, since it’s a non-convex function
- 25. ApplicationsMaximum Entropy Model Logistic regression Cost function for logistic regression • A convex logistic regression cost function
- 26. ApplicationsMaximum Entropy Model Logistic regression Simplified cost function • For binary classification problems y is always 0 or 1 • So we can write cost function is cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) ) • So, in summary, our cost function for the θ parameters can be defined as • Find parameters θ which minimize J(θ)
- 27. ApplicationsMaximum Entropy Model Logistic regression How to minimize the logistic regression cost function Use gradient descent to minimize J(θ)
- 28. ApplicationsMaximum Entropy Model Logistic regression Advanced optimization • Good for large machine learning problems (e.g. huge feature set) • What is gradient descent actually doing? – compute J(θ) and the derivatives – plug these values into gradient descent • Alternatively, instead of gradient descent to minimize the cost function we could use – Conjugate gradient – BFGS (Broyden-Fletcher-Goldfarb-Shanno) – L-BFGS (Limited memory - BFGS)
- 29. ApplicationsMaximum Entropy Model Logistic regression Why do we chose this function when other cost functions exist? • This cost function can be derived from statistics using the principle of maximum likelihood estimation – Note this does mean theres an underlying Gaussian assumption relating to the distribution of features • Also has the nice property that its convex
- 30. Q&A Thanks!
- 31. Reference Jaynes, E. T., 1988, The Relation of Bayesian and Maximum Entropy Methods, in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1), Kluwer Academic Publishers, p. 25-26. https://www.coursera.org/course/ml The elements of statistical learning, 4.4. Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale University http://en.wikipedia.org/wiki/Principle_of_maximum_entropy Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326. Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments Guiasu, S. and Shenitzer, A., 1985, The principle of maximum entropy, The Mathematical Intelligencer, 7(1), 42-48. Harremoës P. and Topsøe F., 2001, Maximum Entropy Fundamentals, Entropy, 3(3), 191-226. http://en.wikipedia.org/wiki/Logistic_regression

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment