02 Machine Learning - Introduction probability

1. Machine Learning for Data Mining Probability Review Andres Mendez-Vazquez May 14, 2015 1 / 87

2. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 2 / 87

4. Gerolamo Cardano: Gambling out of Darkness Gambling Gambling shows our interest in quantifying the ideas of probability for millennia, but exact mathematical descriptions arose much later. Gerolamo Cardano (16th century) While gambling he developed the following rule!!! Equal conditions “The most fundamental principle of all in gambling is simply equal conditions, e.g. of opponents, of bystanders, of money, of situation, of the dice box and of the dice itself. To the extent to which you depart from that equity, if it is in your opponent’s favour, you are a fool, and if in your own, you are unjust.” 4 / 87

7. Gerolamo Cardano’s Deﬁnition Probability “If therefore, someone should say, I want an ace, a deuce, or a trey, you know that there are 27 favourable throws, and since the circuit is 36, the rest of the throws in which these points will not turn up will be 9; the odds will therefore be 3 to 1.” Meaning Probability as a ratio of favorable to all possible outcomes!!! As long all events are equiprobable... Thus, we get P(All favourable throws) = Number All favourable throws Number of All throws (1) 5 / 87

10. Intuitive Formulation Empiric Deﬁnition Intuitively, the probability of an event A could be deﬁned as: P(A) = lim n→∞ N(A) n Where N(A) is the number that event a happens in n trials. Example Imagine you have three dices, then The total number of outcomes is 63 If we have event A = all numbers are equal, |A| = 6 Then, we have that P(A) = 6 63 = 1 36 6 / 87

15. Axioms of Probability Axioms Given a sample space S of events, we have that 1 0 ≤ P(A) ≤ 1 2 P(S) = 1 3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0), then: P(A1 ∪ A2 ∪ ... ∪ An) = n i=1 P(Ai) 8 / 87

19. Set Operations We are using Set Notation Thus What Operations? 9 / 87

20. Set Operations We are using Set Notation Thus What Operations? 9 / 87

21. Example Setup Throw a biased coin twice HH .36 HT .24 TH .24 TT .16 We have the following event At least one head!!! Can you tell me which events are part of it? What about this one? Tail on ﬁrst toss. 10 / 87

24. We need to count!!! We have four main methods of counting 1 Ordered samples of size r with replacement 2 Ordered samples of size r without replacement 3 Unordered samples of size r without replacement 4 Unordered samples of size r with replacement 11 / 87

28. Ordered samples of size r with replacement Deﬁnition The number of possible sequences (ai1 , ..., air ) for n diﬀerent numbers is n × n × ... × n = nr Example If you throw three dices you have 6 × 6 × 6 = 216 12 / 87

29. Ordered samples of size r with replacement Deﬁnition The number of possible sequences (ai1 , ..., air ) for n diﬀerent numbers is n × n × ... × n = nr Example If you throw three dices you have 6 × 6 × 6 = 216 12 / 87

30. Ordered samples of size r without replacement Definition The number of possible sequences (ai1 , ..., air ) for n different numbers is n × n − 1 × ... × (n − (r − 1)) = n! (n−r)! Example The number of different numbers that can be formed if no digit can be repeated. For example, if you have 4 digits and you want numbers of size 3. 13 / 87

31. Ordered samples of size r without replacement Definition The number of possible sequences (ai1 , ..., air ) for n different numbers is n × n − 1 × ... × (n − (r − 1)) = n! (n−r)! Example The number of different numbers that can be formed if no digit can be repeated. For example, if you have 4 digits and you want numbers of size 3. 13 / 87

32. Unordered samples of size r without replacement Deﬁnition Actually, we want the number of possible unordered sets. However We have n! (n−r)! collections where we care about the order. Thus n! (n−r)! r! = n! r! (n − r)! = n r (2) 14 / 87

33. Unordered samples of size r without replacement Deﬁnition Actually, we want the number of possible unordered sets. However We have n! (n−r)! collections where we care about the order. Thus n! (n−r)! r! = n! r! (n − r)! = n r (2) 14 / 87

34. Unordered samples of size r with replacement Deﬁnition We want to ﬁnd an unordered set {ai1 , ..., air } with replacement Use a digit trick for that Look at the Board Thus n + r − 1 r (3) 15 / 87

37. How? Change encoding by adding more signs Imagine all the strings of three numbers with {1, 2, 3} We have Old String New String 111 1+0,1+1,1+2=123 112 1+0,1+1,2+2=124 113 1+0,1+1,3+2=125 122 1+0,2+1,2+2=134 123 1+0,2+1,3+2=135 133 1+0,3+1,3+2=145 222 2+0,2+1,2+2=234 223 2+0,2+1,3+2=225 233 1+0,3+1,3+2=233 333 3+0,3+1,3+2=345 16 / 87

38. How? Change encoding by adding more signs Imagine all the strings of three numbers with {1, 2, 3} We have Old String New String 111 1+0,1+1,1+2=123 112 1+0,1+1,2+2=124 113 1+0,1+1,3+2=125 122 1+0,2+1,2+2=134 123 1+0,2+1,3+2=135 133 1+0,3+1,3+2=145 222 2+0,2+1,2+2=234 223 2+0,2+1,3+2=225 233 1+0,3+1,3+2=233 333 3+0,3+1,3+2=345 16 / 87

39. Independence Deﬁnition Two events A and B are independent if and only if P(A, B) = P(A ∩ B) = P(A)P(B) 17 / 87

40. Example We have two dices Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6 We have the following events A ={First dice 1,2 or 3} B = {First dice 3, 4 or 5} C = {The sum of two faces is 9} So, we can do Look at the board!!! Independence between A, B, C 18 / 87

45. We can use to derive the Binomial Distribution WHAT????? 19 / 87

46. First, we use a sequence of n Bernoulli Trials We have this “Success” has a probability p. “Failure” has a probability 1 − p. Examples Toss a coin independently n times. Examine components produced on an assembly line. Now We take S =all 2n ordered sequences of length n, with components 0(failure) and 1(success). 20 / 87

51. Thus, taking a sample ω ω = 11 · · · 10 · · · 0 k 1’s followed by n − k 0’s. We have then P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac k+1 ∩ . . . ∩ Ac n = P (A1) P (A2) · · · P (Ak) P Ac k+1 · · · P (Ac n) = pk (1 − p)n−k Important The number of such sample is the number of sets with k elements.... or... n k 21 / 87

54. Did you notice? We do not care where the 1’s and 0’s are Thus all the probabilities are equal to pk (1 − p)k Thus, we are looking to sum all those probabilities of all those combinations of 1’s and 0’s k 1’s p ωk Then k 1’s p ωk = n k p (1 − p)n−k 22 / 87

57. Proving this is a probability Sum of these probabilities is equal to 1 n k=0 n k p (1 − p)n−k = (p + (1 − p))n = 1 The other is simple 0 ≤ n k p (1 − p)n−k ≤ 1 ∀k This is know as The Binomial probability function!!! 23 / 87

61. Diﬀerent Probabilities Unconditional This is the probability of an event A prior to arrival of any evidence, it is denoted by P(A). For example: P(Cavity)=0.1 means that “in the absence of any other information, there is a 10% chance that the patient is having a cavity”. Conditional This is the probability of an event A given some evidence B, it is denoted P(A|B). For example: P(Cavity/Toothache)=0.8 means that “there is an 80% chance that the patient is having a cavity given that he is having a toothache” 25 / 87

66. Posterior Probabilities Relation between conditional and unconditional probabilities Conditional probabilities can be deﬁned in terms of unconditional probabilities: P(A|B) = P(A, B) P(B) which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A). Law of Total Probabilities if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then P(A) = n i=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B). In addition, this can be rewritten into P(A) = n i=1 P(A|Bi)P(Bi). 27 / 87

69. Example Three cards are drawn from a deck Find the probability of no obtaining a heart We have 52 cards 39 of them not a heart Deﬁne Ai ={Card i is not a heart} Then? 28 / 87

74. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the speciﬁed value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 30 / 87

81. Example Setup Throw two unbiased dice independently. Let 1 A ={sum of the faces =8} 2 B ={faces are equal} Then calculate P (B|A) Look at the board 32 / 87

84. Another Example We have the following Two coins are available, one unbiased and the other two headed Assume That you have a probability of 3 4 to choose the unbiased Events A= {head comes up} B1= {Unbiased coin chosen} B2= {Biased coin chosen} Find that if a head come up, ﬁnd the probability that the two headed coin was chosen 33 / 87

89. Random Variables I Deﬁnition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Deﬁne the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87

96. Thus... It is necessary to deﬁne a function “random variable as follow” X : S → R Graphically 35 / 87

97. Thus... It is necessary to deﬁne a function “random variable as follow” X : S → R Graphically 35 / 87

98. Random Variables II How? What is the probability function of the random variable is being deﬁned from the probability function of the original sample space? Suppose the sample space is S = {s1, s2, ..., sn} Suppose the range of the random variable X =< x1, x2, ..., xm > Then, we observe X = xi if and only if the outcome of the random experiment is an sj ∈ S s.t. X(sj) = xj or 36 / 87

101. Random Variables II How? What is the probability function of the random variable is being deﬁned from the probability function of the original sample space? Suppose the sample space is S = {s1, s2, ..., sn} Suppose the range of the random variable X =< x1, x2, ..., xm > Then, we observe X = xi if and only if the outcome of the random experiment is an sj ∈ S s.t. X(sj) = xj or P(X = xj) = P(sj ∈ S|X(sj) = xj) 36 / 87

102. Example Setup Throw a coin 10 times, and let R be the number of heads. Then S = all sequences of length 10 with components H and T We have for ω =HHHHTTHTTH ⇒ R (ω) = 6 37 / 87

105. Example Setup Let R be the number of heads in two independent tosses of a coin. Probability of head is .6 What are the probabilities? Ω ={HH,HT,TH,TT} Thus, we can calculate P (R = 0) , P (R = 1) , P (R = 2) 38 / 87

109. Types of Random Variables Discrete A discrete random variable can assume only a countable number of values. Continuous A continuous random variable can assume a continuous range of values. 40 / 87

110. Types of Random Variables Discrete A discrete random variable can assume only a countable number of values. Continuous A continuous random variable can assume a continuous range of values. 40 / 87

111. Properties Probability Mass Function (PMF) and Probability Density Function (PDF) The pmf /pdf of a random variable X assigns a probability for each possible value of X. Properties of the pmf and pdf Some properties of the pmf: x p(x) = 1 and P(a < X < b) = b k=a p(k). In a similar way for the pdf: ´ ∞ −∞ p(x)dx = 1 and P(a < X < b) = ´ b a p(t)dt . 41 / 87

116. 42 / 87

118. Cumulative Distributive Function I Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is deﬁned as follows: FX (x) = P(f (X) ≤ x) With properties: FX (x) ≥ 0 FX (x) in a non-decreasing function of X. Example If X is discrete, its CDF can be computed as follows: FX (x) = P(f (X) ≤ x) = N k=1 P(Xk = pk). 44 / 87

122. Example: Discrete Function .16 .48 .36 .16 .48 .36 1 2 1 2 1 45 / 87

123. Cumulative Distributive Function II Continuous Function If X is continuous, its CDF can be computed as follows: F(x) = ˆ x −∞ f (t)dt. Remark Based in the fundamental theorem of calculus, we have the following equality. p(x) = dF dx (x) Note This particular p(x) is known as the Probability Mass Function (PMF) or Probability Distribution Function (PDF). 46 / 87

126. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87

132. Graphically Example uniform distribution 1 48 / 87

134. Properties of the PMF/PDF I Conditional PMF/PDF We have the conditional pdf: p(y|x) = p(x, y) p(x) . From this, we have the general chain rule p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn). Independence If X and Y are independent, then: p(x, y) = p(x)p(y). 50 / 87

135. Properties of the PMF/PDF I Conditional PMF/PDF We have the conditional pdf: p(y|x) = p(x, y) p(x) . From this, we have the general chain rule p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn). Independence If X and Y are independent, then: p(x, y) = p(x)p(y). 50 / 87

136. Properties of the PMF/PDF II Law of Total Probability p(y) = x p(y|x)p(x). 51 / 87

138. Expectation Something Notable You have the random variables R1, R2 representing how long is a call and how much you pay for an international call if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents) if 3 < R1 ≤ 6(minute) R2 = 20(cents) if 6 < R1 ≤ 9(minute) R2 = 30(cents) We have then the probabilities P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15 If we observe N calls and N is very large We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those calls 53 / 87

141. Expectation Similarly {R2 = 20} =⇒ 0.25N and total cost 5N {R2 = 20} =⇒ 0.15N and total cost 4.5N We have then the probabilities The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per call The average 10(0.6N)+20(.25N)+30(0.15N) N = 10 (0.6) + 20 (.25) + 30 (0.15) = y yP {R2 = y} 54 / 87

144. Expected Value Deﬁnition Discrete random variable X: E(X) = x xp(x). Continuous random variable Y : E(Y ) = ´ x xp(x)dx. Extension to a function g(X) E(g(X)) = x g(x)p(x) (Discrete case). E(g(X)) = ´ ∞ =∞ g(x)p(x)dx (Continuous case) Linearity property E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y )) 55 / 87

147. Example Imagine the following We have the following functions 1 f (x) = e−x, x ≥ 0 2 g (x) = 0, x < 0 Find The expected Value 56 / 87

151. Variance Deﬁnition Var(X) = E((X − µ)2) where µ = E(X) Standard Deviation The standard deviation is simply σ = Var(X). 57 / 87

152. Variance Deﬁnition Var(X) = E((X − µ)2) where µ = E(X) Standard Deviation The standard deviation is simply σ = Var(X). 57 / 87

154. Example Suppose You have that the number of call made per day at a given exchange has a Poisson distribution with an unknown parameter θ: p (x|θ) = θxe−θ x! x = 0, 1, 2, ... (6) We need to obtain information about θ For this, we observe that certain information is needed!!! For example We could need more of certain equipment if θ > θ0 We do not need it if θ ≤ θ0 59 / 87

157. Thus, we want to take a decision about θ To avoid making an incorrect decision To avoid losing money!!! 60 / 87

158. Ingredients of statistical decision models First N, the set of states Second A random variable or random vector X, the observable, whose distribution Fθ depends on θ ∈ N Third A, the set of possible actions: A = N = (0, ∞) Fourth A loss (cost) function L (θ, a), θ ∈ N, a ∈ A: It represents the loss of taking a decision. 61 / 87

163. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87

170. Simple Hypothesis Vs. Simple Alternative In this speciﬁc case Each H0 and H1 contains one element, θ0 and θ1 Thus We have that our random variable X which depends on θ: If we are in H0, X ∼ f0 If we are in H1, X ∼ f1 Thus, the problem It is deciding whether X has density f0 or f1 64 / 87

174. What do we do? We deﬁne a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87

180. Thus {x|ϕ (x) = 1} It is called the rejection region or critical section. And ϕ is called a test!!! Clearly the decision could be erroneous!!! A type 1 error occurs if we reject H0 when H0 is true!!! A type 2 error occurs if we accept H0 when H1 is true!!! 66 / 87

184. Thus the probability of error when X = x If H0 is rejected when true Probability of a type error 1 α = ˆ ∞ −∞ ϕ (x) f0 (x) dx (7) If H0 is accepted when false Probability of a type error 2 β = ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx (8) 67 / 87

185. Thus the probability of error when X = x If H0 is rejected when true Probability of a type error 1 α = ˆ ∞ −∞ ϕ (x) f0 (x) dx (7) If H0 is accepted when false Probability of a type error 2 β = ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx (8) 67 / 87

186. Actually If the test is an indicator function ϕ (x) = IAccept H0 (x) and 1 − ϕ (x) = IReject H0 (x) True True Retain Reject 68 / 87

187. Problem!!! There is not a unique answer to the question of what is a good test Thus, we suppose there is a nonnegative cost ci associated to error type i. In addition, we have a prior probability p of H0 to be true. The over-all average cost associated with ϕ is B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9) 69 / 87

190. We can do the following The over-all average cost associated with ϕ is B (ϕ) = p × c1 × ˆ ∞ −∞ ϕ (x) f0 (x) dx + (1 − p) × c2 × ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx Thus B (ϕ) = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ... (1 − p) c2 ˆ ∞ −∞ f1 (x) dx We have that B (ϕ) = ˆ ∞ −∞ ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2 70 / 87

193. Bayes Risk We have that... B (ϕ) is called the Bayes risk associated to the test function ϕ In addition A test that minimizes B (ϕ) is called a Bayes test corresponding to the given p, c1, c2, f0 and f1. 71 / 87

194. Bayes Risk We have that... B (ϕ) is called the Bayes risk associated to the test function ϕ In addition A test that minimizes B (ϕ) is called a Bayes test corresponding to the given p, c1, c2, f0 and f1. 71 / 87

195. What do we want? We want To minimize ´ S ϕ (x) g (x) dx We want to ﬁnd g (x)!!! This will tell us how to select the correct hypothesis!!! 72 / 87

198. What do we want? Case 1 If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S. Case 2 If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S. Case 3 If g (x) = 0, ϕ (x) may be chosen arbitrarily. 73 / 87

201. Finally We choose g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10) We look at the moment where g (x) = 0 pc1f0 (x) − (1 − p) c2f1 (x) = 0 pc1f0 (x) = (1 − p) c2f1 (x) pc1 (1 − p) c2 = f1 (x) f0 (x) 74 / 87

202. Finally We choose g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10) We look at the moment where g (x) = 0 pc1f0 (x) − (1 − p) c2f1 (x) = 0 pc1f0 (x) = (1 − p) c2f1 (x) pc1 (1 − p) c2 = f1 (x) f0 (x) 74 / 87

203. Bayes Solution Thus, we have Let L (x) = f1(x) f0(x) If L (x) > pc1 (1−p)c2 then take ϕ (x) = 1 i.e. reject H0. If L (x) < pc1 (1−p)c2 then take ϕ (x) = 0 i.e. accept H0. If L (x) = pc1 (1−p)c2 then take ϕ (x) =anything 75 / 87

207. Likelihood Ratio We have L is called the likelihood ratio. For the test ϕ There is a constant 0 ≤ λ ≤ ∞ ϕ (x) = 1 when L (x) > λ ϕ (x) = 0 when L (x) < λ Remark: This is know as the Likelihood Ratio Test (LRT) 76 / 87

212. Example Let X be a discrete random variable x = {0, 1, 2, 3} We have then x 0 1 2 3 p0 (x) .1 .2 .3 .4 p1 (x) .2 .1 .4 .3 We have the following likelihood ratio x 1 3 2 0 L (x) 1 2 3 4 4 3 2 77 / 87

215. Example We have the following situation LRT Reject Region Acceptance Region α β 0 ≤ λ < 1 2 All x Empty 1 0 1 2 < λ < 3 4 x = 0, 2, 3 x = 1 .8 .1 3 4 < λ < 4 3 x = 0, 2 x = 1, 3 .4 .4 4 3 < λ < 2 x = 0 x = 1, 2, 3 .1 .8 2 < λ ≤ ∞ Empty All x 0 1 78 / 87

216. Example Assume λ = 3/4 Reject H0 if x = 0, 2 Accept H0 if x = 1 If x = 3, we randomize i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a) 79 / 87

221. The Graph of B (ϕ) Thus, we have for each λ value 80 / 87

222. Thus, we have several test The classic one: Minimax Test The test that minimize max {α, β} Which An admissible test with constant risk (α = β) is minimax Then We have only one test where α = β = 0.4 then 3 4 < λ < 4 3, Thus We reject H0 when x =0 or 2 We accept H0 when x =1 or 3 81 / 87

226. Remark From this ideas We can work out the classics of hypothesis testing 82 / 87

228. Introduction Suppose γ is a real valued function on the set N of states of nature. Now, we observe X = x, we want to produce a number ψ (x) that is close to γ (θ). There are diﬀerent ways of doing this Maximum Likelihood (ML). Expectation Maximization (EM). Maximum A Posteriori (MAP) 84 / 87

233. Maximum Likelihood Estimation Suppose the following fθ be a density or probability function corresponding to the state of nature θ. Assume for simplicity that γ (θ) = θ If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ that maximizes fθ (x) 85 / 87

236. Example Let X have a binomial distribution With parameters n and θ, 0 ≤ θ ≤ 1 The pdf pθ (x) = n x θx (1 − θ)n−x with x = 0, 1, 2, ..., n Derive with respect to θ ∂ ∂θ ln pθ (x) = 0 86 / 87

239. Example We get x θ − n − x 1 − θ = 0 =⇒ ˆθ = x n Now, we can regard X as a sum of independent variables X = X1 + X2 + ... + Xn where: Xi is 1 with probability θ or 0 with probability 1 − θ We get ﬁnally ˆθ (X) = n i=1 Xi n ⇒ lim n→∞ ˆθ (X) = E (Xi) = θ 87 / 87

02 Machine Learning - Introduction probability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 02 Machine Learning - Introduction probability

Similar to 02 Machine Learning - Introduction probability (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

02 Machine Learning - Introduction probability