Successfully reported this slideshow.
Upcoming SlideShare
×

# Dm week01 prob-refresher.handout

330 views

Published on

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Dm week01 prob-refresher.handout

1. 1. Christof Monz Informatics Institute University of Amsterdam Data Mining Week 1: Probabilities Refresher Today’s Class Christof Monz Data Mining - Week 1: Probabilities Refresher 1 Quick refresher of probabilities Essential Information Theory Calculus in one slide
2. 2. Probabilities: Refresher Christof Monz Data Mining - Week 1: Probabilities Refresher 2 Experiment (trial): Repeatable procedure with well-deﬁned possible outcomes Sample Space (S): the set of all possible outcomes (ﬁnite or inﬁnite) • Example: coin toss experiment possible outcomes: S = {heads, tails} • Example: die toss experiment possible outcomes: S = {1,2,3,4,5,6} Probabilities: Sample Space Christof Monz Data Mining - Week 1: Probabilities Refresher 3 Deﬁnition of sample space depends on what we are asking Sample Space (S): the set of all possible outcomes Example: die toss experiment for whether the number is even or odd • possible outcomes: {even, odd} • not {1,2,3,4,5,6}
3. 3. Probabilities: Deﬁnitions Christof Monz Data Mining - Week 1: Probabilities Refresher 4 An event is any subset of outcomes from the sample space Example: let A represent the event such that the outcome of the die toss experiment is divisible by 3 • A = {3,6} • A is a subset of the sample space S= {1,2,3,4,5,6} Example: suppose sample space S = {heart,spade,club,diamond} (deck of cards) • let A represent the event of drawing a heart: A = {heart} • let B represent the event of drawing a red card: B = {heart,diamond} Probability Function Christof Monz Data Mining - Week 1: Probabilities Refresher 5 The probability law assigns to an event a nonnegative number called P(A) (also called the probability of A) P(A) encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties
4. 4. Probability Axioms Christof Monz Data Mining - Week 1: Probabilities Refresher 6 Non-negativity: P(A) ≥ 0, for every event A Additivity: If A and B are two disjoint events, then the probability of their union satisﬁes: P(A ∪B) = P(A)+P(B) Normalization: The probability of the entire sample space S is equal to 1, i.e. P(S) = 1 Probabilities: Example Christof Monz Data Mining - Week 1: Probabilities Refresher 7 An experiment involving a single coin toss There are two possible outcomes, H and T, i.e. the sample space S = {H,T} If coin is fair, one should assign equal probabilities to 2 outcomes P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0
5. 5. Probabilities: Example II Christof Monz Data Mining - Week 1: Probabilities Refresher 8 Experiment involving 3 coin tosses Outcome is a 3-long string of H or T: S = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT} Assume each outcome is equiprobable (“Uniform distribution”) What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} P(A) = P({HHT})+P({HTH})+P({THH}) P(A)= 1/8 + 1/8 + 1/8 = 3/8 Joint and Conditional Probabilities Christof Monz Data Mining - Week 1: Probabilities Refresher 9 The joint probability P(A,B) is the probability of two events (A and B) occurring together The conditional probability P(A|B): Assume event B is the case, what is probability of event A being the case as well? Note: P(A|B) = P(A,B) (not necessarily) Deﬁnition: P(A|B) = P(A,B) P(B) P(A,B) = P(B,A) but P(A|B) = P(B|A)
6. 6. Bayes’ Rule Christof Monz Data Mining - Week 1: Probabilities Refresher 10 Chain Rule: P(A,B) = P(A|B)P(B) = P(B|A)P(A) Bayes’ rule lets us swap the order of dependence between events P(A|B) = P(B|A)P(A) P(B) Determining Probabilities Christof Monz Data Mining - Week 1: Probabilities Refresher 11 So far we have assumed that the values that P assigns to events is given Determining P is an important part of machine learning In an empirical setting, P is of estimated by using relative frequencies: • P(A) = freq(A) N where freq(A) is the frequency of A in a sample set, and N is the size of the sample set
7. 7. Entropy Christof Monz Data Mining - Week 1: Probabilities Refresher 12 Entropy measures the amount of uncertainty in a variable (the variable ranges over points in the sample space) The amount of uncertainty is commonly measured in bits H(p) = H(X) = − ∑ x∈X p(x)log2p(x) Entropy: Example Christof Monz Data Mining - Week 1: Probabilities Refresher 13 let x represent the result of rolling a (fair) 8-sided die H(X) = − ∑ x∈X p(x)log2p(x) H(X) = − ∑ x∈X 1/8log21/8 H(X) = − ∑ x∈X 1/8 ·−3 = 3 Each equiprobable outcome can be represented by 3 bits: 1 2 3 4 5 6 7 8 001 010 011 100 101 110 111 000
8. 8. Entropy: Better Encoding Christof Monz Data Mining - Week 1: Probabilities Refresher 14 If the probability distribution is not uniform, one can achieve lower entropy Example: Consider a unfair 4-sided die value probability 1 0.5 2 0.125 3 0.125 4 0.25 H(X) = 0.5log20.5 +0.25log20.25 + 0.125log20.125 = 1.75 Entropy: Better Encoding Christof Monz Data Mining - Week 1: Probabilities Refresher 15 value probability code1 code2 1 0.5 00 0 2 0.125 01 110 3 0.125 10 111 4 0.25 11 10 Average number of bits: • code1: 0.5 ·2bits +0.25 ·2bits +0.25 ·2bits = 2bits • code2: 0.5 ·1bit +0.25 ·3bits +0.25 ·2bits = 1.75bits
9. 9. Entropy: Saving Bits Christof Monz Data Mining - Week 1: Probabilities Refresher 16 Coding tree: How many yes-no questions must be asked to determine each message? 0 10 110 111 In general, the optimal number of bits can be computed as: − log2p(x) bits for each message x ∈ X or: log2 1 p(x) bits for each message x ∈ X Tiny Calculus Refresher Christof Monz Data Mining - Week 1: Probabilities Refresher 17 Derivate: The (ﬁrst) derivative of function allows us to compute the rate of change for any point Rate of change: slope of the tangent For multi-variable functions we compute the partial derivatives for each variable separately Derivatives are computed by applying diﬀerentiation rules: • ∂ ∂x (φ+ψ) = ∂ ∂x φ+ ∂ ∂x ψ • ∂ ∂x cxn = cnxn−1 • ∂ ∂x f(g(x)) = f (g(x))g (x) (chain rule)
10. 10. Recap Christof Monz Data Mining - Week 1: Probabilities Refresher 18 Probability distributions (joint, conditional) Bayes’ rule Entropy