SlideShare a Scribd company logo
Machine Learning for Data Mining
Probability Review
Andres Mendez-Vazquez
May 14, 2015
1 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
2 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
3 / 87
Gerolamo Cardano: Gambling out of Darkness
Gambling
Gambling shows our interest in quantifying the ideas of probability for
millennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)
While gambling he developed the following rule!!!
Equal conditions
“The most fundamental principle of all in gambling is simply equal
conditions, e.g. of opponents, of bystanders, of money, of situation, of the
dice box and of the dice itself. To the extent to which you depart from
that equity, if it is in your opponent’s favour, you are a fool, and if in your
own, you are unjust.”
4 / 87
Gerolamo Cardano: Gambling out of Darkness
Gambling
Gambling shows our interest in quantifying the ideas of probability for
millennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)
While gambling he developed the following rule!!!
Equal conditions
“The most fundamental principle of all in gambling is simply equal
conditions, e.g. of opponents, of bystanders, of money, of situation, of the
dice box and of the dice itself. To the extent to which you depart from
that equity, if it is in your opponent’s favour, you are a fool, and if in your
own, you are unjust.”
4 / 87
Gerolamo Cardano: Gambling out of Darkness
Gambling
Gambling shows our interest in quantifying the ideas of probability for
millennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)
While gambling he developed the following rule!!!
Equal conditions
“The most fundamental principle of all in gambling is simply equal
conditions, e.g. of opponents, of bystanders, of money, of situation, of the
dice box and of the dice itself. To the extent to which you depart from
that equity, if it is in your opponent’s favour, you are a fool, and if in your
own, you are unjust.”
4 / 87
Gerolamo Cardano’s Definition
Probability
“If therefore, someone should say, I want an ace, a deuce, or a trey, you
know that there are 27 favourable throws, and since the circuit is 36, the
rest of the throws in which these points will not turn up will be 9; the
odds will therefore be 3 to 1.”
Meaning
Probability as a ratio of favorable to all possible outcomes!!! As long all
events are equiprobable...
Thus, we get
P(All favourable throws) =
Number All favourable throws
Number of All throws
(1)
5 / 87
Gerolamo Cardano’s Definition
Probability
“If therefore, someone should say, I want an ace, a deuce, or a trey, you
know that there are 27 favourable throws, and since the circuit is 36, the
rest of the throws in which these points will not turn up will be 9; the
odds will therefore be 3 to 1.”
Meaning
Probability as a ratio of favorable to all possible outcomes!!! As long all
events are equiprobable...
Thus, we get
P(All favourable throws) =
Number All favourable throws
Number of All throws
(1)
5 / 87
Gerolamo Cardano’s Definition
Probability
“If therefore, someone should say, I want an ace, a deuce, or a trey, you
know that there are 27 favourable throws, and since the circuit is 36, the
rest of the throws in which these points will not turn up will be 9; the
odds will therefore be 3 to 1.”
Meaning
Probability as a ratio of favorable to all possible outcomes!!! As long all
events are equiprobable...
Thus, we get
P(All favourable throws) =
Number All favourable throws
Number of All throws
(1)
5 / 87
Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
7 / 87
Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
Set Operations
We are using
Set Notation
Thus
What Operations?
9 / 87
Set Operations
We are using
Set Notation
Thus
What Operations?
9 / 87
Example
Setup
Throw a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following event
At least one head!!! Can you tell me which events are part of it?
What about this one?
Tail on first toss.
10 / 87
Example
Setup
Throw a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following event
At least one head!!! Can you tell me which events are part of it?
What about this one?
Tail on first toss.
10 / 87
Example
Setup
Throw a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following event
At least one head!!! Can you tell me which events are part of it?
What about this one?
Tail on first toss.
10 / 87
We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
Ordered samples of size r with replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n × ... × n = nr
Example
If you throw three dices you have 6 × 6 × 6 = 216
12 / 87
Ordered samples of size r with replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n × ... × n = nr
Example
If you throw three dices you have 6 × 6 × 6 = 216
12 / 87
Ordered samples of size r without replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n − 1 × ... × (n − (r − 1)) = n!
(n−r)!
Example
The number of different numbers that can be formed if no digit can be
repeated. For example, if you have 4 digits and you want numbers of size
3.
13 / 87
Ordered samples of size r without replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n − 1 × ... × (n − (r − 1)) = n!
(n−r)!
Example
The number of different numbers that can be formed if no digit can be
repeated. For example, if you have 4 digits and you want numbers of size
3.
13 / 87
Unordered samples of size r without replacement
Definition
Actually, we want the number of possible unordered sets.
However
We have n!
(n−r)! collections where we care about the order. Thus
n!
(n−r)!
r!
=
n!
r! (n − r)!
=
n
r
(2)
14 / 87
Unordered samples of size r without replacement
Definition
Actually, we want the number of possible unordered sets.
However
We have n!
(n−r)! collections where we care about the order. Thus
n!
(n−r)!
r!
=
n!
r! (n − r)!
=
n
r
(2)
14 / 87
Unordered samples of size r with replacement
Definition
We want to find an unordered set {ai1 , ..., air } with replacement
Use a digit trick for that
Look at the Board
Thus
n + r − 1
r
(3)
15 / 87
Unordered samples of size r with replacement
Definition
We want to find an unordered set {ai1 , ..., air } with replacement
Use a digit trick for that
Look at the Board
Thus
n + r − 1
r
(3)
15 / 87
Unordered samples of size r with replacement
Definition
We want to find an unordered set {ai1 , ..., air } with replacement
Use a digit trick for that
Look at the Board
Thus
n + r − 1
r
(3)
15 / 87
How?
Change encoding by adding more signs
Imagine all the strings of three numbers with {1, 2, 3}
We have
Old String New String
111 1+0,1+1,1+2=123
112 1+0,1+1,2+2=124
113 1+0,1+1,3+2=125
122 1+0,2+1,2+2=134
123 1+0,2+1,3+2=135
133 1+0,3+1,3+2=145
222 2+0,2+1,2+2=234
223 2+0,2+1,3+2=225
233 1+0,3+1,3+2=233
333 3+0,3+1,3+2=345
16 / 87
How?
Change encoding by adding more signs
Imagine all the strings of three numbers with {1, 2, 3}
We have
Old String New String
111 1+0,1+1,1+2=123
112 1+0,1+1,2+2=124
113 1+0,1+1,3+2=125
122 1+0,2+1,2+2=134
123 1+0,2+1,3+2=135
133 1+0,3+1,3+2=145
222 2+0,2+1,2+2=234
223 2+0,2+1,3+2=225
233 1+0,3+1,3+2=233
333 3+0,3+1,3+2=345
16 / 87
Independence
Definition
Two events A and B are independent if and only if
P(A, B) = P(A ∩ B) = P(A)P(B)
17 / 87
Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
We can use to derive the Binomial Distribution
WHAT?????
19 / 87
First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
Thus, taking a sample ω
ω = 11 · · · 10 · · · 0
k 1’s followed by n − k 0’s.
We have then
P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac
k+1 ∩ . . . ∩ Ac
n
= P (A1) P (A2) · · · P (Ak) P Ac
k+1 · · · P (Ac
n)
= pk
(1 − p)n−k
Important
The number of such sample is the number of sets with k elements.... or...
n
k
21 / 87
Thus, taking a sample ω
ω = 11 · · · 10 · · · 0
k 1’s followed by n − k 0’s.
We have then
P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac
k+1 ∩ . . . ∩ Ac
n
= P (A1) P (A2) · · · P (Ak) P Ac
k+1 · · · P (Ac
n)
= pk
(1 − p)n−k
Important
The number of such sample is the number of sets with k elements.... or...
n
k
21 / 87
Thus, taking a sample ω
ω = 11 · · · 10 · · · 0
k 1’s followed by n − k 0’s.
We have then
P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac
k+1 ∩ . . . ∩ Ac
n
= P (A1) P (A2) · · · P (Ak) P Ac
k+1 · · · P (Ac
n)
= pk
(1 − p)n−k
Important
The number of such sample is the number of sets with k elements.... or...
n
k
21 / 87
Did you notice?
We do not care where the 1’s and 0’s are
Thus all the probabilities are equal to pk (1 − p)k
Thus, we are looking to sum all those probabilities of all those
combinations of 1’s and 0’s
k 1’s
p ωk
Then
k 1’s
p ωk
=
n
k
p (1 − p)n−k
22 / 87
Did you notice?
We do not care where the 1’s and 0’s are
Thus all the probabilities are equal to pk (1 − p)k
Thus, we are looking to sum all those probabilities of all those
combinations of 1’s and 0’s
k 1’s
p ωk
Then
k 1’s
p ωk
=
n
k
p (1 − p)n−k
22 / 87
Did you notice?
We do not care where the 1’s and 0’s are
Thus all the probabilities are equal to pk (1 − p)k
Thus, we are looking to sum all those probabilities of all those
combinations of 1’s and 0’s
k 1’s
p ωk
Then
k 1’s
p ωk
=
n
k
p (1 − p)n−k
22 / 87
Proving this is a probability
Sum of these probabilities is equal to 1
n
k=0
n
k
p (1 − p)n−k
= (p + (1 − p))n
= 1
The other is simple
0 ≤
n
k
p (1 − p)n−k
≤ 1 ∀k
This is know as
The Binomial probability function!!!
23 / 87
Proving this is a probability
Sum of these probabilities is equal to 1
n
k=0
n
k
p (1 − p)n−k
= (p + (1 − p))n
= 1
The other is simple
0 ≤
n
k
p (1 − p)n−k
≤ 1 ∀k
This is know as
The Binomial probability function!!!
23 / 87
Proving this is a probability
Sum of these probabilities is equal to 1
n
k=0
n
k
p (1 − p)n−k
= (p + (1 − p))n
= 1
The other is simple
0 ≤
n
k
p (1 − p)n−k
≤ 1 ∀k
This is know as
The Binomial probability function!!!
23 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
24 / 87
Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
26 / 87
Posterior Probabilities
Relation between conditional and unconditional probabilities
Conditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) =
P(A, B)
P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilities
if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then
P(A) =
n
i=1
P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =
n
i=1
P(A|Bi)P(Bi).
27 / 87
Posterior Probabilities
Relation between conditional and unconditional probabilities
Conditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) =
P(A, B)
P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilities
if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then
P(A) =
n
i=1
P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =
n
i=1
P(A|Bi)P(Bi).
27 / 87
Posterior Probabilities
Relation between conditional and unconditional probabilities
Conditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) =
P(A, B)
P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilities
if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then
P(A) =
n
i=1
P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =
n
i=1
P(A|Bi)P(Bi).
27 / 87
Example
Three cards are drawn from a deck
Find the probability of no obtaining a heart
We have
52 cards
39 of them not a heart
Define
Ai ={Card i is not a heart} Then?
28 / 87
Example
Three cards are drawn from a deck
Find the probability of no obtaining a heart
We have
52 cards
39 of them not a heart
Define
Ai ={Card i is not a heart} Then?
28 / 87
Example
Three cards are drawn from a deck
Find the probability of no obtaining a heart
We have
52 cards
39 of them not a heart
Define
Ai ={Card i is not a heart} Then?
28 / 87
Independence and Conditional
From here, we have that...
P(A|B) = P(A) and P(B|A) = P(B).
Conditional independence
A and B are conditionally independent given C if and only if
P(A|B, C) = P(A|C)
Example: P(WetGrass|Season, Rain) = P(WetGrass|Rain).
29 / 87
Independence and Conditional
From here, we have that...
P(A|B) = P(A) and P(B|A) = P(B).
Conditional independence
A and B are conditionally independent given C if and only if
P(A|B, C) = P(A|C)
Example: P(WetGrass|Season, Rain) = P(WetGrass|Rain).
29 / 87
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
General Form of the Bayes Rule
Definition
If A1, A2, ..., An is a partition of mutually exclusive events and B any
event, then:
P(Ai|B) =
P(B|Ai)P(Ai)
P(B)
=
P(B|Ai)P(Ai)
n
i=1 P(B|Ai)P(Ai)
where
P(B) =
n
i=1
P(B ∩ Ai) =
n
i=1
P(B|Ai)P(Ai)
31 / 87
General Form of the Bayes Rule
Definition
If A1, A2, ..., An is a partition of mutually exclusive events and B any
event, then:
P(Ai|B) =
P(B|Ai)P(Ai)
P(B)
=
P(B|Ai)P(Ai)
n
i=1 P(B|Ai)P(Ai)
where
P(B) =
n
i=1
P(B ∩ Ai) =
n
i=1
P(B|Ai)P(Ai)
31 / 87
Example
Setup
Throw two unbiased dice independently.
Let
1 A ={sum of the faces =8}
2 B ={faces are equal}
Then calculate P (B|A)
Look at the board
32 / 87
Example
Setup
Throw two unbiased dice independently.
Let
1 A ={sum of the faces =8}
2 B ={faces are equal}
Then calculate P (B|A)
Look at the board
32 / 87
Example
Setup
Throw two unbiased dice independently.
Let
1 A ={sum of the faces =8}
2 B ={faces are equal}
Then calculate P (B|A)
Look at the board
32 / 87
Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
Thus...
It is necessary to define a function “random variable as follow”
X : S → R
Graphically
35 / 87
Thus...
It is necessary to define a function “random variable as follow”
X : S → R
Graphically
35 / 87
Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
P(X = xj) = P(sj ∈ S|X(sj) = xj)
36 / 87
Example
Setup
Throw a coin 10 times, and let R be the number of heads.
Then
S = all sequences of length 10 with components H and T
We have for
ω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
Example
Setup
Throw a coin 10 times, and let R be the number of heads.
Then
S = all sequences of length 10 with components H and T
We have for
ω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
Example
Setup
Throw a coin 10 times, and let R be the number of heads.
Then
S = all sequences of length 10 with components H and T
We have for
ω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
Example
Setup
Let R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?
Ω ={HH,HT,TH,TT}
Thus, we can calculate
P (R = 0) , P (R = 1) , P (R = 2)
38 / 87
Example
Setup
Let R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?
Ω ={HH,HT,TH,TT}
Thus, we can calculate
P (R = 0) , P (R = 1) , P (R = 2)
38 / 87
Example
Setup
Let R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?
Ω ={HH,HT,TH,TT}
Thus, we can calculate
P (R = 0) , P (R = 1) , P (R = 2)
38 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
39 / 87
Types of Random Variables
Discrete
A discrete random variable can assume only a countable number of values.
Continuous
A continuous random variable can assume a continuous range of values.
40 / 87
Types of Random Variables
Discrete
A discrete random variable can assume only a countable number of values.
Continuous
A continuous random variable can assume a continuous range of values.
40 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
42 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
43 / 87
Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
Example: Discrete Function
.16
.48
.36
.16
.48
.36
1 2 1 2
1
45 / 87
Cumulative Distributive Function II
Continuous Function
If X is continuous, its CDF can be computed as follows:
F(x) =
ˆ x
−∞
f (t)dt.
Remark
Based in the fundamental theorem of calculus, we have the following
equality.
p(x) =
dF
dx
(x)
Note
This particular p(x) is known as the Probability Mass Function (PMF) or
Probability Distribution Function (PDF).
46 / 87
Cumulative Distributive Function II
Continuous Function
If X is continuous, its CDF can be computed as follows:
F(x) =
ˆ x
−∞
f (t)dt.
Remark
Based in the fundamental theorem of calculus, we have the following
equality.
p(x) =
dF
dx
(x)
Note
This particular p(x) is known as the Probability Mass Function (PMF) or
Probability Distribution Function (PDF).
46 / 87
Cumulative Distributive Function II
Continuous Function
If X is continuous, its CDF can be computed as follows:
F(x) =
ˆ x
−∞
f (t)dt.
Remark
Based in the fundamental theorem of calculus, we have the following
equality.
p(x) =
dF
dx
(x)
Note
This particular p(x) is known as the Probability Mass Function (PMF) or
Probability Distribution Function (PDF).
46 / 87
Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
Graphically
Example uniform distribution
1
48 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
49 / 87
Properties of the PMF/PDF I
Conditional PMF/PDF
We have the conditional pdf:
p(y|x) =
p(x, y)
p(x)
.
From this, we have the general chain rule
p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).
Independence
If X and Y are independent, then:
p(x, y) = p(x)p(y).
50 / 87
Properties of the PMF/PDF I
Conditional PMF/PDF
We have the conditional pdf:
p(y|x) =
p(x, y)
p(x)
.
From this, we have the general chain rule
p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).
Independence
If X and Y are independent, then:
p(x, y) = p(x)p(y).
50 / 87
Properties of the PMF/PDF II
Law of Total Probability
p(y) =
x
p(y|x)p(x).
51 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
52 / 87
Expectation
Something Notable
You have the random variables R1, R2 representing how long is a call and
how much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)
if 3 < R1 ≤ 6(minute) R2 = 20(cents)
if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilities
P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15
If we observe N calls and N is very large
We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those
calls
53 / 87
Expectation
Something Notable
You have the random variables R1, R2 representing how long is a call and
how much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)
if 3 < R1 ≤ 6(minute) R2 = 20(cents)
if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilities
P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15
If we observe N calls and N is very large
We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those
calls
53 / 87
Expectation
Something Notable
You have the random variables R1, R2 representing how long is a call and
how much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)
if 3 < R1 ≤ 6(minute) R2 = 20(cents)
if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilities
P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15
If we observe N calls and N is very large
We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those
calls
53 / 87
Expectation
Similarly
{R2 = 20} =⇒ 0.25N and total cost 5N
{R2 = 20} =⇒ 0.15N and total cost 4.5N
We have then the probabilities
The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per
call
The average
10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =
y yP {R2 = y}
54 / 87
Expectation
Similarly
{R2 = 20} =⇒ 0.25N and total cost 5N
{R2 = 20} =⇒ 0.15N and total cost 4.5N
We have then the probabilities
The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per
call
The average
10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =
y yP {R2 = y}
54 / 87
Expectation
Similarly
{R2 = 20} =⇒ 0.25N and total cost 5N
{R2 = 20} =⇒ 0.15N and total cost 4.5N
We have then the probabilities
The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per
call
The average
10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =
y yP {R2 = y}
54 / 87
Expected Value
Definition
Discrete random variable X: E(X) = x xp(x).
Continuous random variable Y : E(Y ) =
´
x xp(x)dx.
Extension to a function g(X)
E(g(X)) = x g(x)p(x) (Discrete case).
E(g(X)) =
´ ∞
=∞ g(x)p(x)dx (Continuous case)
Linearity property
E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
Expected Value
Definition
Discrete random variable X: E(X) = x xp(x).
Continuous random variable Y : E(Y ) =
´
x xp(x)dx.
Extension to a function g(X)
E(g(X)) = x g(x)p(x) (Discrete case).
E(g(X)) =
´ ∞
=∞ g(x)p(x)dx (Continuous case)
Linearity property
E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
Expected Value
Definition
Discrete random variable X: E(X) = x xp(x).
Continuous random variable Y : E(Y ) =
´
x xp(x)dx.
Extension to a function g(X)
E(g(X)) = x g(x)p(x) (Discrete case).
E(g(X)) =
´ ∞
=∞ g(x)p(x)dx (Continuous case)
Linearity property
E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
Variance
Definition
Var(X) = E((X − µ)2) where µ = E(X)
Standard Deviation
The standard deviation is simply σ = Var(X).
57 / 87
Variance
Definition
Var(X) = E((X − µ)2) where µ = E(X)
Standard Deviation
The standard deviation is simply σ = Var(X).
57 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
58 / 87
Example
Suppose
You have that the number of call made per day at a given exchange has a
Poisson distribution with an unknown parameter θ:
p (x|θ) =
θxe−θ
x!
x = 0, 1, 2, ... (6)
We need to obtain information about θ
For this, we observe that certain information is needed!!!
For example
We could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
Example
Suppose
You have that the number of call made per day at a given exchange has a
Poisson distribution with an unknown parameter θ:
p (x|θ) =
θxe−θ
x!
x = 0, 1, 2, ... (6)
We need to obtain information about θ
For this, we observe that certain information is needed!!!
For example
We could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
Example
Suppose
You have that the number of call made per day at a given exchange has a
Poisson distribution with an unknown parameter θ:
p (x|θ) =
θxe−θ
x!
x = 0, 1, 2, ... (6)
We need to obtain information about θ
For this, we observe that certain information is needed!!!
For example
We could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
Thus, we want to take a decision about θ
To avoid making an incorrect decision
To avoid losing money!!!
60 / 87
Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
62 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus the probability of error when X = x
If H0 is rejected when true
Probability of a type error 1
α =
ˆ ∞
−∞
ϕ (x) f0 (x) dx (7)
If H0 is accepted when false
Probability of a type error 2
β =
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx (8)
67 / 87
Thus the probability of error when X = x
If H0 is rejected when true
Probability of a type error 1
α =
ˆ ∞
−∞
ϕ (x) f0 (x) dx (7)
If H0 is accepted when false
Probability of a type error 2
β =
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx (8)
67 / 87
Actually
If the test is an indicator function ϕ (x) = IAccept H0 (x) and
1 − ϕ (x) = IReject H0 (x)
True
True
Retain Reject
68 / 87
Problem!!!
There is not a unique answer to the question of what is a good test
Thus, we suppose there is a nonnegative cost ci associated to error
type i.
In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9)
69 / 87
Problem!!!
There is not a unique answer to the question of what is a good test
Thus, we suppose there is a nonnegative cost ci associated to error
type i.
In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9)
69 / 87
Problem!!!
There is not a unique answer to the question of what is a good test
Thus, we suppose there is a nonnegative cost ci associated to error
type i.
In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9)
69 / 87
We can do the following
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 ×
ˆ ∞
−∞
ϕ (x) f0 (x) dx + (1 − p) × c2 ×
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx
Thus
B (ϕ) =
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ...
(1 − p) c2
ˆ ∞
−∞
f1 (x) dx
We have that
B (ϕ) =
ˆ ∞
−∞
ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2
70 / 87
We can do the following
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 ×
ˆ ∞
−∞
ϕ (x) f0 (x) dx + (1 − p) × c2 ×
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx
Thus
B (ϕ) =
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ...
(1 − p) c2
ˆ ∞
−∞
f1 (x) dx
We have that
B (ϕ) =
ˆ ∞
−∞
ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2
70 / 87
We can do the following
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 ×
ˆ ∞
−∞
ϕ (x) f0 (x) dx + (1 − p) × c2 ×
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx
Thus
B (ϕ) =
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ...
(1 − p) c2
ˆ ∞
−∞
f1 (x) dx
We have that
B (ϕ) =
ˆ ∞
−∞
ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2
70 / 87
Bayes Risk
We have that...
B (ϕ) is called the Bayes risk associated to the test function ϕ
In addition
A test that minimizes B (ϕ) is called a Bayes test corresponding to the
given p, c1, c2, f0 and f1.
71 / 87
Bayes Risk
We have that...
B (ϕ) is called the Bayes risk associated to the test function ϕ
In addition
A test that minimizes B (ϕ) is called a Bayes test corresponding to the
given p, c1, c2, f0 and f1.
71 / 87
What do we want?
We want
To minimize
´
S ϕ (x) g (x) dx
We want to find g (x)!!!
This will tell us how to select the correct hypothesis!!!
72 / 87
What do we want?
We want
To minimize
´
S ϕ (x) g (x) dx
We want to find g (x)!!!
This will tell us how to select the correct hypothesis!!!
72 / 87
What do we want?
We want
To minimize
´
S ϕ (x) g (x) dx
We want to find g (x)!!!
This will tell us how to select the correct hypothesis!!!
72 / 87
What do we want?
Case 1
If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S.
Case 2
If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S.
Case 3
If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
What do we want?
Case 1
If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S.
Case 2
If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S.
Case 3
If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
What do we want?
Case 1
If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S.
Case 2
If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S.
Case 3
If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
Finally
We choose
g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10)
We look at the moment where g (x) = 0
pc1f0 (x) − (1 − p) c2f1 (x) = 0
pc1f0 (x) = (1 − p) c2f1 (x)
pc1
(1 − p) c2
=
f1 (x)
f0 (x)
74 / 87
Finally
We choose
g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10)
We look at the moment where g (x) = 0
pc1f0 (x) − (1 − p) c2f1 (x) = 0
pc1f0 (x) = (1 − p) c2f1 (x)
pc1
(1 − p) c2
=
f1 (x)
f0 (x)
74 / 87
Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Example
Let X be a discrete random variable
x = {0, 1, 2, 3}
We have then
x 0 1 2 3
p0 (x) .1 .2 .3 .4
p1 (x) .2 .1 .4 .3
We have the following likelihood ratio
x 1 3 2 0
L (x) 1
2
3
4
4
3 2
77 / 87
Example
Let X be a discrete random variable
x = {0, 1, 2, 3}
We have then
x 0 1 2 3
p0 (x) .1 .2 .3 .4
p1 (x) .2 .1 .4 .3
We have the following likelihood ratio
x 1 3 2 0
L (x) 1
2
3
4
4
3 2
77 / 87
Example
Let X be a discrete random variable
x = {0, 1, 2, 3}
We have then
x 0 1 2 3
p0 (x) .1 .2 .3 .4
p1 (x) .2 .1 .4 .3
We have the following likelihood ratio
x 1 3 2 0
L (x) 1
2
3
4
4
3 2
77 / 87
Example
We have the following situation
LRT Reject Region Acceptance Region α β
0 ≤ λ < 1
2 All x Empty 1 0
1
2 < λ < 3
4 x = 0, 2, 3 x = 1 .8 .1
3
4 < λ < 4
3 x = 0, 2 x = 1, 3 .4 .4
4
3 < λ < 2 x = 0 x = 1, 2, 3 .1 .8
2 < λ ≤ ∞ Empty All x 0 1
78 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
The Graph of B (ϕ)
Thus, we have for each λ value
80 / 87
Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
Remark
From this ideas
We can work out the classics of hypothesis testing
82 / 87
Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
83 / 87
Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
Maximum Likelihood Estimation
Suppose the following
fθ be a density or probability function corresponding to the state of
nature θ.
Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ
that maximizes fθ (x)
85 / 87
Maximum Likelihood Estimation
Suppose the following
fθ be a density or probability function corresponding to the state of
nature θ.
Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ
that maximizes fθ (x)
85 / 87
Maximum Likelihood Estimation
Suppose the following
fθ be a density or probability function corresponding to the state of
nature θ.
Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ
that maximizes fθ (x)
85 / 87
Example
Let X have a binomial distribution
With parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =
n
x
θx (1 − θ)n−x
with x = 0, 1, 2, ..., n
Derive with respect to θ
∂
∂θ ln pθ (x) = 0
86 / 87
Example
Let X have a binomial distribution
With parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =
n
x
θx (1 − θ)n−x
with x = 0, 1, 2, ..., n
Derive with respect to θ
∂
∂θ ln pθ (x) = 0
86 / 87
Example
Let X have a binomial distribution
With parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =
n
x
θx (1 − θ)n−x
with x = 0, 1, 2, ..., n
Derive with respect to θ
∂
∂θ ln pθ (x) = 0
86 / 87
Example
We get
x
θ
−
n − x
1 − θ
= 0 =⇒ ˆθ =
x
n
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ... + Xn
where: Xi is 1 with probability θ or 0 with probability 1 − θ
We get finally
ˆθ (X) =
n
i=1 Xi
n
⇒ lim
n→∞
ˆθ (X) = E (Xi) = θ
87 / 87
Example
We get
x
θ
−
n − x
1 − θ
= 0 =⇒ ˆθ =
x
n
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ... + Xn
where: Xi is 1 with probability θ or 0 with probability 1 − θ
We get finally
ˆθ (X) =
n
i=1 Xi
n
⇒ lim
n→∞
ˆθ (X) = E (Xi) = θ
87 / 87
Example
We get
x
θ
−
n − x
1 − θ
= 0 =⇒ ˆθ =
x
n
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ... + Xn
where: Xi is 1 with probability θ or 0 with probability 1 − θ
We get finally
ˆθ (X) =
n
i=1 Xi
n
⇒ lim
n→∞
ˆθ (X) = E (Xi) = θ
87 / 87

More Related Content

What's hot

L03 ai - knowledge representation using logic
L03 ai - knowledge representation using logicL03 ai - knowledge representation using logic
L03 ai - knowledge representation using logic
Manjula V
 
Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealing
Joy Dutta
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
Suraj Parmar
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
Adnan Masood
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
swapnac12
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3Xueping Peng
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)
Tajim Md. Niamat Ullah Akhund
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
sathish sak
 
Probability Theory for Data Scientists
Probability Theory for Data ScientistsProbability Theory for Data Scientists
Probability Theory for Data Scientists
Ferdin Joe John Joseph PhD
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
Sai Kumar Kodam
 
Automata theory
Automata theoryAutomata theory
Automata theory
Pardeep Vats
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
Massimiliano Patacchiola
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
Krish_ver2
 

What's hot (20)

L03 ai - knowledge representation using logic
L03 ai - knowledge representation using logicL03 ai - knowledge representation using logic
L03 ai - knowledge representation using logic
 
Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealing
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
Probability Theory for Data Scientists
Probability Theory for Data ScientistsProbability Theory for Data Scientists
Probability Theory for Data Scientists
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Automata theory
Automata theoryAutomata theory
Automata theory
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 

Viewers also liked

Artificial Intelligence 06.2 More on Causality Bayesian Networks
Artificial Intelligence 06.2 More on  Causality Bayesian NetworksArtificial Intelligence 06.2 More on  Causality Bayesian Networks
Artificial Intelligence 06.2 More on Causality Bayesian Networks
Andres Mendez-Vazquez
 
Instagram
InstagramInstagram
Linux教程:Windows用户转向Linux的12个步骤.pdf
Linux教程:Windows用户转向Linux的12个步骤.pdfLinux教程:Windows用户转向Linux的12个步骤.pdf
Linux教程:Windows用户转向Linux的12个步骤.pdf
pangoo
 
Spanish vocabulary vh02
Spanish vocabulary vh02Spanish vocabulary vh02
Spanish vocabulary vh02
Patrick Auta
 
Heidi Pollock FOWA '07
Heidi Pollock FOWA '07Heidi Pollock FOWA '07
Heidi Pollock FOWA '07
heidipollock
 
T.K. morning
T.K. morning T.K. morning
T.K. morning
makotitob
 
Austep group general presentation usa
Austep group general presentation usaAustep group general presentation usa
Austep group general presentation usa
Ian Harris
 
The missing links in software estimation: Work, Team Loading and Team Power
The missing links in software estimation: Work, Team Loading and Team PowerThe missing links in software estimation: Work, Team Loading and Team Power
The missing links in software estimation: Work, Team Loading and Team Power
Luigi Buglione
 
Motion Study
Motion StudyMotion Study
Motion Study
ahmad bassiouny
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
ahmad bassiouny
 
Afm nudge
Afm nudgeAfm nudge
Algorithmes et marketing : rendre des comptes
Algorithmes et marketing : rendre des comptesAlgorithmes et marketing : rendre des comptes
Algorithmes et marketing : rendre des comptes
Christophe Benavent
 
Razlike između biljne i životinjske ćelije
Razlike između biljne i životinjske ćelijeRazlike između biljne i životinjske ćelije
Razlike između biljne i životinjske ćelije
Ivana Damnjanović
 
What's the Matter?
What's the Matter?What's the Matter?
What's the Matter?
Stephen Taylor
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
Andres Mendez-Vazquez
 
Preparation Data Structures 10 trees
Preparation Data Structures 10 treesPreparation Data Structures 10 trees
Preparation Data Structures 10 trees
Andres Mendez-Vazquez
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
Andres Mendez-Vazquez
 

Viewers also liked (20)

Artificial Intelligence 06.2 More on Causality Bayesian Networks
Artificial Intelligence 06.2 More on  Causality Bayesian NetworksArtificial Intelligence 06.2 More on  Causality Bayesian Networks
Artificial Intelligence 06.2 More on Causality Bayesian Networks
 
Instagram
InstagramInstagram
Instagram
 
Linux教程:Windows用户转向Linux的12个步骤.pdf
Linux教程:Windows用户转向Linux的12个步骤.pdfLinux教程:Windows用户转向Linux的12个步骤.pdf
Linux教程:Windows用户转向Linux的12个步骤.pdf
 
Sailing
SailingSailing
Sailing
 
Spanish vocabulary vh02
Spanish vocabulary vh02Spanish vocabulary vh02
Spanish vocabulary vh02
 
Heidi Pollock FOWA '07
Heidi Pollock FOWA '07Heidi Pollock FOWA '07
Heidi Pollock FOWA '07
 
Festa da ..
Festa da ..Festa da ..
Festa da ..
 
T.K. morning
T.K. morning T.K. morning
T.K. morning
 
Austep group general presentation usa
Austep group general presentation usaAustep group general presentation usa
Austep group general presentation usa
 
The missing links in software estimation: Work, Team Loading and Team Power
The missing links in software estimation: Work, Team Loading and Team PowerThe missing links in software estimation: Work, Team Loading and Team Power
The missing links in software estimation: Work, Team Loading and Team Power
 
Motion Study
Motion StudyMotion Study
Motion Study
 
Talent Digital
Talent DigitalTalent Digital
Talent Digital
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Afm nudge
Afm nudgeAfm nudge
Afm nudge
 
Algorithmes et marketing : rendre des comptes
Algorithmes et marketing : rendre des comptesAlgorithmes et marketing : rendre des comptes
Algorithmes et marketing : rendre des comptes
 
Razlike između biljne i životinjske ćelije
Razlike između biljne i životinjske ćelijeRazlike između biljne i životinjske ćelije
Razlike između biljne i životinjske ćelije
 
What's the Matter?
What's the Matter?What's the Matter?
What's the Matter?
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
Preparation Data Structures 10 trees
Preparation Data Structures 10 treesPreparation Data Structures 10 trees
Preparation Data Structures 10 trees
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 

Similar to 02 Machine Learning - Introduction probability

03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms
Andres Mendez-Vazquez
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
Statistics Assignment Help
 
1 - Probabilty Introduction .ppt
1 - Probabilty Introduction .ppt1 - Probabilty Introduction .ppt
1 - Probabilty Introduction .ppt
Vivek Bhartiya
 
3 PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx
3  PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx3  PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx
3 PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx
tamicawaysmith
 
x13.pdf
x13.pdfx13.pdf
x13.pdf
TarikuArega1
 
M.C.A. (Sem - II) Probability and Statistics.pdf
M.C.A. (Sem - II) Probability and Statistics.pdfM.C.A. (Sem - II) Probability and Statistics.pdf
M.C.A. (Sem - II) Probability and Statistics.pdf
satyamkumarkashyap12
 
STAB52 Lecture Notes (Week 2)
STAB52 Lecture Notes (Week 2)STAB52 Lecture Notes (Week 2)
STAB52 Lecture Notes (Week 2)Danny Cao
 
Probability theory discrete probability distribution
Probability theory discrete probability distributionProbability theory discrete probability distribution
Probability theory discrete probability distribution
samarthpawar9890
 
Chapter-6-Random Variables & Probability distributions-3.doc
Chapter-6-Random Variables & Probability distributions-3.docChapter-6-Random Variables & Probability distributions-3.doc
Chapter-6-Random Variables & Probability distributions-3.doc
Desalechali1
 
4.1-4.2 Sample Spaces and Probability
4.1-4.2 Sample Spaces and Probability4.1-4.2 Sample Spaces and Probability
4.1-4.2 Sample Spaces and Probability
mlong24
 
Brian Prior - Probability and gambling
Brian Prior - Probability and gamblingBrian Prior - Probability and gambling
Brian Prior - Probability and gambling
onthewight
 
Unit 2 Probability
Unit 2 ProbabilityUnit 2 Probability
Unit 2 Probability
Rai University
 
Counting
CountingCounting
2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.
WeihanKhor2
 
Probability
ProbabilityProbability
Probability
Anjali Devi J S
 
Principles of Counting
Principles of CountingPrinciples of Counting
Principles of Counting
Amelita Martinez
 
Basic concepts of probability
Basic concepts of probability Basic concepts of probability
Basic concepts of probability
Long Beach City College
 
Chapter7ppt.pdf
Chapter7ppt.pdfChapter7ppt.pdf
Chapter7ppt.pdf
SohailBhatti21
 
Lecture-1-Probability-Theory-Part-1.pdf
Lecture-1-Probability-Theory-Part-1.pdfLecture-1-Probability-Theory-Part-1.pdf
Lecture-1-Probability-Theory-Part-1.pdf
MICAHJAMELLEICAWAT1
 
Simple probability
Simple probabilitySimple probability
Simple probability
06426345
 

Similar to 02 Machine Learning - Introduction probability (20)

03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
 
1 - Probabilty Introduction .ppt
1 - Probabilty Introduction .ppt1 - Probabilty Introduction .ppt
1 - Probabilty Introduction .ppt
 
3 PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx
3  PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx3  PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx
3 PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docx
 
x13.pdf
x13.pdfx13.pdf
x13.pdf
 
M.C.A. (Sem - II) Probability and Statistics.pdf
M.C.A. (Sem - II) Probability and Statistics.pdfM.C.A. (Sem - II) Probability and Statistics.pdf
M.C.A. (Sem - II) Probability and Statistics.pdf
 
STAB52 Lecture Notes (Week 2)
STAB52 Lecture Notes (Week 2)STAB52 Lecture Notes (Week 2)
STAB52 Lecture Notes (Week 2)
 
Probability theory discrete probability distribution
Probability theory discrete probability distributionProbability theory discrete probability distribution
Probability theory discrete probability distribution
 
Chapter-6-Random Variables & Probability distributions-3.doc
Chapter-6-Random Variables & Probability distributions-3.docChapter-6-Random Variables & Probability distributions-3.doc
Chapter-6-Random Variables & Probability distributions-3.doc
 
4.1-4.2 Sample Spaces and Probability
4.1-4.2 Sample Spaces and Probability4.1-4.2 Sample Spaces and Probability
4.1-4.2 Sample Spaces and Probability
 
Brian Prior - Probability and gambling
Brian Prior - Probability and gamblingBrian Prior - Probability and gambling
Brian Prior - Probability and gambling
 
Unit 2 Probability
Unit 2 ProbabilityUnit 2 Probability
Unit 2 Probability
 
Counting
CountingCounting
Counting
 
2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.
 
Probability
ProbabilityProbability
Probability
 
Principles of Counting
Principles of CountingPrinciples of Counting
Principles of Counting
 
Basic concepts of probability
Basic concepts of probability Basic concepts of probability
Basic concepts of probability
 
Chapter7ppt.pdf
Chapter7ppt.pdfChapter7ppt.pdf
Chapter7ppt.pdf
 
Lecture-1-Probability-Theory-Part-1.pdf
Lecture-1-Probability-Theory-Part-1.pdfLecture-1-Probability-Theory-Part-1.pdf
Lecture-1-Probability-Theory-Part-1.pdf
 
Simple probability
Simple probabilitySimple probability
Simple probability
 

More from Andres Mendez-Vazquez

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Andres Mendez-Vazquez
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Andres Mendez-Vazquez
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Andres Mendez-Vazquez
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Andres Mendez-Vazquez
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
 
Zetta global
Zetta globalZetta global
Zetta global
Andres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Andres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Andres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Andres Mendez-Vazquez
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Andres Mendez-Vazquez
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Andres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 

Recently uploaded (20)

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 

02 Machine Learning - Introduction probability

  • 1. Machine Learning for Data Mining Probability Review Andres Mendez-Vazquez May 14, 2015 1 / 87
  • 2. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 2 / 87
  • 3. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 3 / 87
  • 4. Gerolamo Cardano: Gambling out of Darkness Gambling Gambling shows our interest in quantifying the ideas of probability for millennia, but exact mathematical descriptions arose much later. Gerolamo Cardano (16th century) While gambling he developed the following rule!!! Equal conditions “The most fundamental principle of all in gambling is simply equal conditions, e.g. of opponents, of bystanders, of money, of situation, of the dice box and of the dice itself. To the extent to which you depart from that equity, if it is in your opponent’s favour, you are a fool, and if in your own, you are unjust.” 4 / 87
  • 5. Gerolamo Cardano: Gambling out of Darkness Gambling Gambling shows our interest in quantifying the ideas of probability for millennia, but exact mathematical descriptions arose much later. Gerolamo Cardano (16th century) While gambling he developed the following rule!!! Equal conditions “The most fundamental principle of all in gambling is simply equal conditions, e.g. of opponents, of bystanders, of money, of situation, of the dice box and of the dice itself. To the extent to which you depart from that equity, if it is in your opponent’s favour, you are a fool, and if in your own, you are unjust.” 4 / 87
  • 6. Gerolamo Cardano: Gambling out of Darkness Gambling Gambling shows our interest in quantifying the ideas of probability for millennia, but exact mathematical descriptions arose much later. Gerolamo Cardano (16th century) While gambling he developed the following rule!!! Equal conditions “The most fundamental principle of all in gambling is simply equal conditions, e.g. of opponents, of bystanders, of money, of situation, of the dice box and of the dice itself. To the extent to which you depart from that equity, if it is in your opponent’s favour, you are a fool, and if in your own, you are unjust.” 4 / 87
  • 7. Gerolamo Cardano’s Definition Probability “If therefore, someone should say, I want an ace, a deuce, or a trey, you know that there are 27 favourable throws, and since the circuit is 36, the rest of the throws in which these points will not turn up will be 9; the odds will therefore be 3 to 1.” Meaning Probability as a ratio of favorable to all possible outcomes!!! As long all events are equiprobable... Thus, we get P(All favourable throws) = Number All favourable throws Number of All throws (1) 5 / 87
  • 8. Gerolamo Cardano’s Definition Probability “If therefore, someone should say, I want an ace, a deuce, or a trey, you know that there are 27 favourable throws, and since the circuit is 36, the rest of the throws in which these points will not turn up will be 9; the odds will therefore be 3 to 1.” Meaning Probability as a ratio of favorable to all possible outcomes!!! As long all events are equiprobable... Thus, we get P(All favourable throws) = Number All favourable throws Number of All throws (1) 5 / 87
  • 9. Gerolamo Cardano’s Definition Probability “If therefore, someone should say, I want an ace, a deuce, or a trey, you know that there are 27 favourable throws, and since the circuit is 36, the rest of the throws in which these points will not turn up will be 9; the odds will therefore be 3 to 1.” Meaning Probability as a ratio of favorable to all possible outcomes!!! As long all events are equiprobable... Thus, we get P(All favourable throws) = Number All favourable throws Number of All throws (1) 5 / 87
  • 10. Intuitive Formulation Empiric Definition Intuitively, the probability of an event A could be defined as: P(A) = lim n→∞ N(A) n Where N(A) is the number that event a happens in n trials. Example Imagine you have three dices, then The total number of outcomes is 63 If we have event A = all numbers are equal, |A| = 6 Then, we have that P(A) = 6 63 = 1 36 6 / 87
  • 11. Intuitive Formulation Empiric Definition Intuitively, the probability of an event A could be defined as: P(A) = lim n→∞ N(A) n Where N(A) is the number that event a happens in n trials. Example Imagine you have three dices, then The total number of outcomes is 63 If we have event A = all numbers are equal, |A| = 6 Then, we have that P(A) = 6 63 = 1 36 6 / 87
  • 12. Intuitive Formulation Empiric Definition Intuitively, the probability of an event A could be defined as: P(A) = lim n→∞ N(A) n Where N(A) is the number that event a happens in n trials. Example Imagine you have three dices, then The total number of outcomes is 63 If we have event A = all numbers are equal, |A| = 6 Then, we have that P(A) = 6 63 = 1 36 6 / 87
  • 13. Intuitive Formulation Empiric Definition Intuitively, the probability of an event A could be defined as: P(A) = lim n→∞ N(A) n Where N(A) is the number that event a happens in n trials. Example Imagine you have three dices, then The total number of outcomes is 63 If we have event A = all numbers are equal, |A| = 6 Then, we have that P(A) = 6 63 = 1 36 6 / 87
  • 14. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 7 / 87
  • 15. Axioms of Probability Axioms Given a sample space S of events, we have that 1 0 ≤ P(A) ≤ 1 2 P(S) = 1 3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0), then: P(A1 ∪ A2 ∪ ... ∪ An) = n i=1 P(Ai) 8 / 87
  • 16. Axioms of Probability Axioms Given a sample space S of events, we have that 1 0 ≤ P(A) ≤ 1 2 P(S) = 1 3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0), then: P(A1 ∪ A2 ∪ ... ∪ An) = n i=1 P(Ai) 8 / 87
  • 17. Axioms of Probability Axioms Given a sample space S of events, we have that 1 0 ≤ P(A) ≤ 1 2 P(S) = 1 3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0), then: P(A1 ∪ A2 ∪ ... ∪ An) = n i=1 P(Ai) 8 / 87
  • 18. Axioms of Probability Axioms Given a sample space S of events, we have that 1 0 ≤ P(A) ≤ 1 2 P(S) = 1 3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0), then: P(A1 ∪ A2 ∪ ... ∪ An) = n i=1 P(Ai) 8 / 87
  • 19. Set Operations We are using Set Notation Thus What Operations? 9 / 87
  • 20. Set Operations We are using Set Notation Thus What Operations? 9 / 87
  • 21. Example Setup Throw a biased coin twice HH .36 HT .24 TH .24 TT .16 We have the following event At least one head!!! Can you tell me which events are part of it? What about this one? Tail on first toss. 10 / 87
  • 22. Example Setup Throw a biased coin twice HH .36 HT .24 TH .24 TT .16 We have the following event At least one head!!! Can you tell me which events are part of it? What about this one? Tail on first toss. 10 / 87
  • 23. Example Setup Throw a biased coin twice HH .36 HT .24 TH .24 TT .16 We have the following event At least one head!!! Can you tell me which events are part of it? What about this one? Tail on first toss. 10 / 87
  • 24. We need to count!!! We have four main methods of counting 1 Ordered samples of size r with replacement 2 Ordered samples of size r without replacement 3 Unordered samples of size r without replacement 4 Unordered samples of size r with replacement 11 / 87
  • 25. We need to count!!! We have four main methods of counting 1 Ordered samples of size r with replacement 2 Ordered samples of size r without replacement 3 Unordered samples of size r without replacement 4 Unordered samples of size r with replacement 11 / 87
  • 26. We need to count!!! We have four main methods of counting 1 Ordered samples of size r with replacement 2 Ordered samples of size r without replacement 3 Unordered samples of size r without replacement 4 Unordered samples of size r with replacement 11 / 87
  • 27. We need to count!!! We have four main methods of counting 1 Ordered samples of size r with replacement 2 Ordered samples of size r without replacement 3 Unordered samples of size r without replacement 4 Unordered samples of size r with replacement 11 / 87
  • 28. Ordered samples of size r with replacement Definition The number of possible sequences (ai1 , ..., air ) for n different numbers is n × n × ... × n = nr Example If you throw three dices you have 6 × 6 × 6 = 216 12 / 87
  • 29. Ordered samples of size r with replacement Definition The number of possible sequences (ai1 , ..., air ) for n different numbers is n × n × ... × n = nr Example If you throw three dices you have 6 × 6 × 6 = 216 12 / 87
  • 30. Ordered samples of size r without replacement Definition The number of possible sequences (ai1 , ..., air ) for n different numbers is n × n − 1 × ... × (n − (r − 1)) = n! (n−r)! Example The number of different numbers that can be formed if no digit can be repeated. For example, if you have 4 digits and you want numbers of size 3. 13 / 87
  • 31. Ordered samples of size r without replacement Definition The number of possible sequences (ai1 , ..., air ) for n different numbers is n × n − 1 × ... × (n − (r − 1)) = n! (n−r)! Example The number of different numbers that can be formed if no digit can be repeated. For example, if you have 4 digits and you want numbers of size 3. 13 / 87
  • 32. Unordered samples of size r without replacement Definition Actually, we want the number of possible unordered sets. However We have n! (n−r)! collections where we care about the order. Thus n! (n−r)! r! = n! r! (n − r)! = n r (2) 14 / 87
  • 33. Unordered samples of size r without replacement Definition Actually, we want the number of possible unordered sets. However We have n! (n−r)! collections where we care about the order. Thus n! (n−r)! r! = n! r! (n − r)! = n r (2) 14 / 87
  • 34. Unordered samples of size r with replacement Definition We want to find an unordered set {ai1 , ..., air } with replacement Use a digit trick for that Look at the Board Thus n + r − 1 r (3) 15 / 87
  • 35. Unordered samples of size r with replacement Definition We want to find an unordered set {ai1 , ..., air } with replacement Use a digit trick for that Look at the Board Thus n + r − 1 r (3) 15 / 87
  • 36. Unordered samples of size r with replacement Definition We want to find an unordered set {ai1 , ..., air } with replacement Use a digit trick for that Look at the Board Thus n + r − 1 r (3) 15 / 87
  • 37. How? Change encoding by adding more signs Imagine all the strings of three numbers with {1, 2, 3} We have Old String New String 111 1+0,1+1,1+2=123 112 1+0,1+1,2+2=124 113 1+0,1+1,3+2=125 122 1+0,2+1,2+2=134 123 1+0,2+1,3+2=135 133 1+0,3+1,3+2=145 222 2+0,2+1,2+2=234 223 2+0,2+1,3+2=225 233 1+0,3+1,3+2=233 333 3+0,3+1,3+2=345 16 / 87
  • 38. How? Change encoding by adding more signs Imagine all the strings of three numbers with {1, 2, 3} We have Old String New String 111 1+0,1+1,1+2=123 112 1+0,1+1,2+2=124 113 1+0,1+1,3+2=125 122 1+0,2+1,2+2=134 123 1+0,2+1,3+2=135 133 1+0,3+1,3+2=145 222 2+0,2+1,2+2=234 223 2+0,2+1,3+2=225 233 1+0,3+1,3+2=233 333 3+0,3+1,3+2=345 16 / 87
  • 39. Independence Definition Two events A and B are independent if and only if P(A, B) = P(A ∩ B) = P(A)P(B) 17 / 87
  • 40. Example We have two dices Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6 We have the following events A ={First dice 1,2 or 3} B = {First dice 3, 4 or 5} C = {The sum of two faces is 9} So, we can do Look at the board!!! Independence between A, B, C 18 / 87
  • 41. Example We have two dices Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6 We have the following events A ={First dice 1,2 or 3} B = {First dice 3, 4 or 5} C = {The sum of two faces is 9} So, we can do Look at the board!!! Independence between A, B, C 18 / 87
  • 42. Example We have two dices Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6 We have the following events A ={First dice 1,2 or 3} B = {First dice 3, 4 or 5} C = {The sum of two faces is 9} So, we can do Look at the board!!! Independence between A, B, C 18 / 87
  • 43. Example We have two dices Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6 We have the following events A ={First dice 1,2 or 3} B = {First dice 3, 4 or 5} C = {The sum of two faces is 9} So, we can do Look at the board!!! Independence between A, B, C 18 / 87
  • 44. Example We have two dices Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6 We have the following events A ={First dice 1,2 or 3} B = {First dice 3, 4 or 5} C = {The sum of two faces is 9} So, we can do Look at the board!!! Independence between A, B, C 18 / 87
  • 45. We can use to derive the Binomial Distribution WHAT????? 19 / 87
  • 46. First, we use a sequence of n Bernoulli Trials We have this “Success” has a probability p. “Failure” has a probability 1 − p. Examples Toss a coin independently n times. Examine components produced on an assembly line. Now We take S =all 2n ordered sequences of length n, with components 0(failure) and 1(success). 20 / 87
  • 47. First, we use a sequence of n Bernoulli Trials We have this “Success” has a probability p. “Failure” has a probability 1 − p. Examples Toss a coin independently n times. Examine components produced on an assembly line. Now We take S =all 2n ordered sequences of length n, with components 0(failure) and 1(success). 20 / 87
  • 48. First, we use a sequence of n Bernoulli Trials We have this “Success” has a probability p. “Failure” has a probability 1 − p. Examples Toss a coin independently n times. Examine components produced on an assembly line. Now We take S =all 2n ordered sequences of length n, with components 0(failure) and 1(success). 20 / 87
  • 49. First, we use a sequence of n Bernoulli Trials We have this “Success” has a probability p. “Failure” has a probability 1 − p. Examples Toss a coin independently n times. Examine components produced on an assembly line. Now We take S =all 2n ordered sequences of length n, with components 0(failure) and 1(success). 20 / 87
  • 50. First, we use a sequence of n Bernoulli Trials We have this “Success” has a probability p. “Failure” has a probability 1 − p. Examples Toss a coin independently n times. Examine components produced on an assembly line. Now We take S =all 2n ordered sequences of length n, with components 0(failure) and 1(success). 20 / 87
  • 51. Thus, taking a sample ω ω = 11 · · · 10 · · · 0 k 1’s followed by n − k 0’s. We have then P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac k+1 ∩ . . . ∩ Ac n = P (A1) P (A2) · · · P (Ak) P Ac k+1 · · · P (Ac n) = pk (1 − p)n−k Important The number of such sample is the number of sets with k elements.... or... n k 21 / 87
  • 52. Thus, taking a sample ω ω = 11 · · · 10 · · · 0 k 1’s followed by n − k 0’s. We have then P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac k+1 ∩ . . . ∩ Ac n = P (A1) P (A2) · · · P (Ak) P Ac k+1 · · · P (Ac n) = pk (1 − p)n−k Important The number of such sample is the number of sets with k elements.... or... n k 21 / 87
  • 53. Thus, taking a sample ω ω = 11 · · · 10 · · · 0 k 1’s followed by n − k 0’s. We have then P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac k+1 ∩ . . . ∩ Ac n = P (A1) P (A2) · · · P (Ak) P Ac k+1 · · · P (Ac n) = pk (1 − p)n−k Important The number of such sample is the number of sets with k elements.... or... n k 21 / 87
  • 54. Did you notice? We do not care where the 1’s and 0’s are Thus all the probabilities are equal to pk (1 − p)k Thus, we are looking to sum all those probabilities of all those combinations of 1’s and 0’s k 1’s p ωk Then k 1’s p ωk = n k p (1 − p)n−k 22 / 87
  • 55. Did you notice? We do not care where the 1’s and 0’s are Thus all the probabilities are equal to pk (1 − p)k Thus, we are looking to sum all those probabilities of all those combinations of 1’s and 0’s k 1’s p ωk Then k 1’s p ωk = n k p (1 − p)n−k 22 / 87
  • 56. Did you notice? We do not care where the 1’s and 0’s are Thus all the probabilities are equal to pk (1 − p)k Thus, we are looking to sum all those probabilities of all those combinations of 1’s and 0’s k 1’s p ωk Then k 1’s p ωk = n k p (1 − p)n−k 22 / 87
  • 57. Proving this is a probability Sum of these probabilities is equal to 1 n k=0 n k p (1 − p)n−k = (p + (1 − p))n = 1 The other is simple 0 ≤ n k p (1 − p)n−k ≤ 1 ∀k This is know as The Binomial probability function!!! 23 / 87
  • 58. Proving this is a probability Sum of these probabilities is equal to 1 n k=0 n k p (1 − p)n−k = (p + (1 − p))n = 1 The other is simple 0 ≤ n k p (1 − p)n−k ≤ 1 ∀k This is know as The Binomial probability function!!! 23 / 87
  • 59. Proving this is a probability Sum of these probabilities is equal to 1 n k=0 n k p (1 − p)n−k = (p + (1 − p))n = 1 The other is simple 0 ≤ n k p (1 − p)n−k ≤ 1 ∀k This is know as The Binomial probability function!!! 23 / 87
  • 60. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 24 / 87
  • 61. Different Probabilities Unconditional This is the probability of an event A prior to arrival of any evidence, it is denoted by P(A). For example: P(Cavity)=0.1 means that “in the absence of any other information, there is a 10% chance that the patient is having a cavity”. Conditional This is the probability of an event A given some evidence B, it is denoted P(A|B). For example: P(Cavity/Toothache)=0.8 means that “there is an 80% chance that the patient is having a cavity given that he is having a toothache” 25 / 87
  • 62. Different Probabilities Unconditional This is the probability of an event A prior to arrival of any evidence, it is denoted by P(A). For example: P(Cavity)=0.1 means that “in the absence of any other information, there is a 10% chance that the patient is having a cavity”. Conditional This is the probability of an event A given some evidence B, it is denoted P(A|B). For example: P(Cavity/Toothache)=0.8 means that “there is an 80% chance that the patient is having a cavity given that he is having a toothache” 25 / 87
  • 63. Different Probabilities Unconditional This is the probability of an event A prior to arrival of any evidence, it is denoted by P(A). For example: P(Cavity)=0.1 means that “in the absence of any other information, there is a 10% chance that the patient is having a cavity”. Conditional This is the probability of an event A given some evidence B, it is denoted P(A|B). For example: P(Cavity/Toothache)=0.8 means that “there is an 80% chance that the patient is having a cavity given that he is having a toothache” 25 / 87
  • 64. Different Probabilities Unconditional This is the probability of an event A prior to arrival of any evidence, it is denoted by P(A). For example: P(Cavity)=0.1 means that “in the absence of any other information, there is a 10% chance that the patient is having a cavity”. Conditional This is the probability of an event A given some evidence B, it is denoted P(A|B). For example: P(Cavity/Toothache)=0.8 means that “there is an 80% chance that the patient is having a cavity given that he is having a toothache” 25 / 87
  • 65. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 26 / 87
  • 66. Posterior Probabilities Relation between conditional and unconditional probabilities Conditional probabilities can be defined in terms of unconditional probabilities: P(A|B) = P(A, B) P(B) which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A). Law of Total Probabilities if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then P(A) = n i=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B). In addition, this can be rewritten into P(A) = n i=1 P(A|Bi)P(Bi). 27 / 87
  • 67. Posterior Probabilities Relation between conditional and unconditional probabilities Conditional probabilities can be defined in terms of unconditional probabilities: P(A|B) = P(A, B) P(B) which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A). Law of Total Probabilities if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then P(A) = n i=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B). In addition, this can be rewritten into P(A) = n i=1 P(A|Bi)P(Bi). 27 / 87
  • 68. Posterior Probabilities Relation between conditional and unconditional probabilities Conditional probabilities can be defined in terms of unconditional probabilities: P(A|B) = P(A, B) P(B) which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A). Law of Total Probabilities if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then P(A) = n i=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B). In addition, this can be rewritten into P(A) = n i=1 P(A|Bi)P(Bi). 27 / 87
  • 69. Example Three cards are drawn from a deck Find the probability of no obtaining a heart We have 52 cards 39 of them not a heart Define Ai ={Card i is not a heart} Then? 28 / 87
  • 70. Example Three cards are drawn from a deck Find the probability of no obtaining a heart We have 52 cards 39 of them not a heart Define Ai ={Card i is not a heart} Then? 28 / 87
  • 71. Example Three cards are drawn from a deck Find the probability of no obtaining a heart We have 52 cards 39 of them not a heart Define Ai ={Card i is not a heart} Then? 28 / 87
  • 72. Independence and Conditional From here, we have that... P(A|B) = P(A) and P(B|A) = P(B). Conditional independence A and B are conditionally independent given C if and only if P(A|B, C) = P(A|C) Example: P(WetGrass|Season, Rain) = P(WetGrass|Rain). 29 / 87
  • 73. Independence and Conditional From here, we have that... P(A|B) = P(A) and P(B|A) = P(B). Conditional independence A and B are conditionally independent given C if and only if P(A|B, C) = P(A|C) Example: P(WetGrass|Season, Rain) = P(WetGrass|Rain). 29 / 87
  • 74. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 30 / 87
  • 75. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 30 / 87
  • 76. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 30 / 87
  • 77. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 30 / 87
  • 78. Bayes Theorem One Version P(A|B) = P(B|A)P(A) P(B) Where P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. It is also called the likelihood. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. 30 / 87
  • 79. General Form of the Bayes Rule Definition If A1, A2, ..., An is a partition of mutually exclusive events and B any event, then: P(Ai|B) = P(B|Ai)P(Ai) P(B) = P(B|Ai)P(Ai) n i=1 P(B|Ai)P(Ai) where P(B) = n i=1 P(B ∩ Ai) = n i=1 P(B|Ai)P(Ai) 31 / 87
  • 80. General Form of the Bayes Rule Definition If A1, A2, ..., An is a partition of mutually exclusive events and B any event, then: P(Ai|B) = P(B|Ai)P(Ai) P(B) = P(B|Ai)P(Ai) n i=1 P(B|Ai)P(Ai) where P(B) = n i=1 P(B ∩ Ai) = n i=1 P(B|Ai)P(Ai) 31 / 87
  • 81. Example Setup Throw two unbiased dice independently. Let 1 A ={sum of the faces =8} 2 B ={faces are equal} Then calculate P (B|A) Look at the board 32 / 87
  • 82. Example Setup Throw two unbiased dice independently. Let 1 A ={sum of the faces =8} 2 B ={faces are equal} Then calculate P (B|A) Look at the board 32 / 87
  • 83. Example Setup Throw two unbiased dice independently. Let 1 A ={sum of the faces =8} 2 B ={faces are equal} Then calculate P (B|A) Look at the board 32 / 87
  • 84. Another Example We have the following Two coins are available, one unbiased and the other two headed Assume That you have a probability of 3 4 to choose the unbiased Events A= {head comes up} B1= {Unbiased coin chosen} B2= {Biased coin chosen} Find that if a head come up, find the probability that the two headed coin was chosen 33 / 87
  • 85. Another Example We have the following Two coins are available, one unbiased and the other two headed Assume That you have a probability of 3 4 to choose the unbiased Events A= {head comes up} B1= {Unbiased coin chosen} B2= {Biased coin chosen} Find that if a head come up, find the probability that the two headed coin was chosen 33 / 87
  • 86. Another Example We have the following Two coins are available, one unbiased and the other two headed Assume That you have a probability of 3 4 to choose the unbiased Events A= {head comes up} B1= {Unbiased coin chosen} B2= {Biased coin chosen} Find that if a head come up, find the probability that the two headed coin was chosen 33 / 87
  • 87. Another Example We have the following Two coins are available, one unbiased and the other two headed Assume That you have a probability of 3 4 to choose the unbiased Events A= {head comes up} B1= {Unbiased coin chosen} B2= {Biased coin chosen} Find that if a head come up, find the probability that the two headed coin was chosen 33 / 87
  • 88. Another Example We have the following Two coins are available, one unbiased and the other two headed Assume That you have a probability of 3 4 to choose the unbiased Events A= {head comes up} B1= {Unbiased coin chosen} B2= {Biased coin chosen} Find that if a head come up, find the probability that the two headed coin was chosen 33 / 87
  • 89. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 90. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 91. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 92. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 93. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 94. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 95. Random Variables I Definition In many experiments, it is easier to deal with a summary variable than with the original probability structure. Example In an opinion poll, we ask 50 people whether agree or disagree with a certain issue. Suppose we record a “1” for agree and “0” for disagree. The sample space for this experiment has 250 elements. Why? Suppose we are only interested in the number of people who agree. Define the variable X=number of “1” ’s recorded out of 50. Easier to deal with this sample space (has only 51 elements). 34 / 87
  • 96. Thus... It is necessary to define a function “random variable as follow” X : S → R Graphically 35 / 87
  • 97. Thus... It is necessary to define a function “random variable as follow” X : S → R Graphically 35 / 87
  • 98. Random Variables II How? What is the probability function of the random variable is being defined from the probability function of the original sample space? Suppose the sample space is S = {s1, s2, ..., sn} Suppose the range of the random variable X =< x1, x2, ..., xm > Then, we observe X = xi if and only if the outcome of the random experiment is an sj ∈ S s.t. X(sj) = xj or 36 / 87
  • 99. Random Variables II How? What is the probability function of the random variable is being defined from the probability function of the original sample space? Suppose the sample space is S = {s1, s2, ..., sn} Suppose the range of the random variable X =< x1, x2, ..., xm > Then, we observe X = xi if and only if the outcome of the random experiment is an sj ∈ S s.t. X(sj) = xj or 36 / 87
  • 100. Random Variables II How? What is the probability function of the random variable is being defined from the probability function of the original sample space? Suppose the sample space is S = {s1, s2, ..., sn} Suppose the range of the random variable X =< x1, x2, ..., xm > Then, we observe X = xi if and only if the outcome of the random experiment is an sj ∈ S s.t. X(sj) = xj or 36 / 87
  • 101. Random Variables II How? What is the probability function of the random variable is being defined from the probability function of the original sample space? Suppose the sample space is S = {s1, s2, ..., sn} Suppose the range of the random variable X =< x1, x2, ..., xm > Then, we observe X = xi if and only if the outcome of the random experiment is an sj ∈ S s.t. X(sj) = xj or P(X = xj) = P(sj ∈ S|X(sj) = xj) 36 / 87
  • 102. Example Setup Throw a coin 10 times, and let R be the number of heads. Then S = all sequences of length 10 with components H and T We have for ω =HHHHTTHTTH ⇒ R (ω) = 6 37 / 87
  • 103. Example Setup Throw a coin 10 times, and let R be the number of heads. Then S = all sequences of length 10 with components H and T We have for ω =HHHHTTHTTH ⇒ R (ω) = 6 37 / 87
  • 104. Example Setup Throw a coin 10 times, and let R be the number of heads. Then S = all sequences of length 10 with components H and T We have for ω =HHHHTTHTTH ⇒ R (ω) = 6 37 / 87
  • 105. Example Setup Let R be the number of heads in two independent tosses of a coin. Probability of head is .6 What are the probabilities? Ω ={HH,HT,TH,TT} Thus, we can calculate P (R = 0) , P (R = 1) , P (R = 2) 38 / 87
  • 106. Example Setup Let R be the number of heads in two independent tosses of a coin. Probability of head is .6 What are the probabilities? Ω ={HH,HT,TH,TT} Thus, we can calculate P (R = 0) , P (R = 1) , P (R = 2) 38 / 87
  • 107. Example Setup Let R be the number of heads in two independent tosses of a coin. Probability of head is .6 What are the probabilities? Ω ={HH,HT,TH,TT} Thus, we can calculate P (R = 0) , P (R = 1) , P (R = 2) 38 / 87
  • 108. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 39 / 87
  • 109. Types of Random Variables Discrete A discrete random variable can assume only a countable number of values. Continuous A continuous random variable can assume a continuous range of values. 40 / 87
  • 110. Types of Random Variables Discrete A discrete random variable can assume only a countable number of values. Continuous A continuous random variable can assume a continuous range of values. 40 / 87
  • 111. Properties Probability Mass Function (PMF) and Probability Density Function (PDF) The pmf /pdf of a random variable X assigns a probability for each possible value of X. Properties of the pmf and pdf Some properties of the pmf: x p(x) = 1 and P(a < X < b) = b k=a p(k). In a similar way for the pdf: ´ ∞ −∞ p(x)dx = 1 and P(a < X < b) = ´ b a p(t)dt . 41 / 87
  • 112. Properties Probability Mass Function (PMF) and Probability Density Function (PDF) The pmf /pdf of a random variable X assigns a probability for each possible value of X. Properties of the pmf and pdf Some properties of the pmf: x p(x) = 1 and P(a < X < b) = b k=a p(k). In a similar way for the pdf: ´ ∞ −∞ p(x)dx = 1 and P(a < X < b) = ´ b a p(t)dt . 41 / 87
  • 113. Properties Probability Mass Function (PMF) and Probability Density Function (PDF) The pmf /pdf of a random variable X assigns a probability for each possible value of X. Properties of the pmf and pdf Some properties of the pmf: x p(x) = 1 and P(a < X < b) = b k=a p(k). In a similar way for the pdf: ´ ∞ −∞ p(x)dx = 1 and P(a < X < b) = ´ b a p(t)dt . 41 / 87
  • 114. Properties Probability Mass Function (PMF) and Probability Density Function (PDF) The pmf /pdf of a random variable X assigns a probability for each possible value of X. Properties of the pmf and pdf Some properties of the pmf: x p(x) = 1 and P(a < X < b) = b k=a p(k). In a similar way for the pdf: ´ ∞ −∞ p(x)dx = 1 and P(a < X < b) = ´ b a p(t)dt . 41 / 87
  • 115. Properties Probability Mass Function (PMF) and Probability Density Function (PDF) The pmf /pdf of a random variable X assigns a probability for each possible value of X. Properties of the pmf and pdf Some properties of the pmf: x p(x) = 1 and P(a < X < b) = b k=a p(k). In a similar way for the pdf: ´ ∞ −∞ p(x)dx = 1 and P(a < X < b) = ´ b a p(t)dt . 41 / 87
  • 117. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 43 / 87
  • 118. Cumulative Distributive Function I Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX (x) = P(f (X) ≤ x) With properties: FX (x) ≥ 0 FX (x) in a non-decreasing function of X. Example If X is discrete, its CDF can be computed as follows: FX (x) = P(f (X) ≤ x) = N k=1 P(Xk = pk). 44 / 87
  • 119. Cumulative Distributive Function I Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX (x) = P(f (X) ≤ x) With properties: FX (x) ≥ 0 FX (x) in a non-decreasing function of X. Example If X is discrete, its CDF can be computed as follows: FX (x) = P(f (X) ≤ x) = N k=1 P(Xk = pk). 44 / 87
  • 120. Cumulative Distributive Function I Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX (x) = P(f (X) ≤ x) With properties: FX (x) ≥ 0 FX (x) in a non-decreasing function of X. Example If X is discrete, its CDF can be computed as follows: FX (x) = P(f (X) ≤ x) = N k=1 P(Xk = pk). 44 / 87
  • 121. Cumulative Distributive Function I Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX (x) = P(f (X) ≤ x) With properties: FX (x) ≥ 0 FX (x) in a non-decreasing function of X. Example If X is discrete, its CDF can be computed as follows: FX (x) = P(f (X) ≤ x) = N k=1 P(Xk = pk). 44 / 87
  • 123. Cumulative Distributive Function II Continuous Function If X is continuous, its CDF can be computed as follows: F(x) = ˆ x −∞ f (t)dt. Remark Based in the fundamental theorem of calculus, we have the following equality. p(x) = dF dx (x) Note This particular p(x) is known as the Probability Mass Function (PMF) or Probability Distribution Function (PDF). 46 / 87
  • 124. Cumulative Distributive Function II Continuous Function If X is continuous, its CDF can be computed as follows: F(x) = ˆ x −∞ f (t)dt. Remark Based in the fundamental theorem of calculus, we have the following equality. p(x) = dF dx (x) Note This particular p(x) is known as the Probability Mass Function (PMF) or Probability Distribution Function (PDF). 46 / 87
  • 125. Cumulative Distributive Function II Continuous Function If X is continuous, its CDF can be computed as follows: F(x) = ˆ x −∞ f (t)dt. Remark Based in the fundamental theorem of calculus, we have the following equality. p(x) = dF dx (x) Note This particular p(x) is known as the Probability Mass Function (PMF) or Probability Distribution Function (PDF). 46 / 87
  • 126. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87
  • 127. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87
  • 128. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87
  • 129. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87
  • 130. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87
  • 131. Example: Continuous Function Setup A number X is chosen at random between a and b Xhas a uniform distribution fX (x) = 1 b−a for a ≤ x ≤ b fX (x) = 0 for x < a and x > b We have FX (x) = P {X ≤ x} = ˆ x −∞ fX (t) dt (4) P {a < X ≤ b} = ˆ b a fX (t) dt (5) 47 / 87
  • 133. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 49 / 87
  • 134. Properties of the PMF/PDF I Conditional PMF/PDF We have the conditional pdf: p(y|x) = p(x, y) p(x) . From this, we have the general chain rule p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn). Independence If X and Y are independent, then: p(x, y) = p(x)p(y). 50 / 87
  • 135. Properties of the PMF/PDF I Conditional PMF/PDF We have the conditional pdf: p(y|x) = p(x, y) p(x) . From this, we have the general chain rule p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn). Independence If X and Y are independent, then: p(x, y) = p(x)p(y). 50 / 87
  • 136. Properties of the PMF/PDF II Law of Total Probability p(y) = x p(y|x)p(x). 51 / 87
  • 137. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 52 / 87
  • 138. Expectation Something Notable You have the random variables R1, R2 representing how long is a call and how much you pay for an international call if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents) if 3 < R1 ≤ 6(minute) R2 = 20(cents) if 6 < R1 ≤ 9(minute) R2 = 30(cents) We have then the probabilities P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15 If we observe N calls and N is very large We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those calls 53 / 87
  • 139. Expectation Something Notable You have the random variables R1, R2 representing how long is a call and how much you pay for an international call if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents) if 3 < R1 ≤ 6(minute) R2 = 20(cents) if 6 < R1 ≤ 9(minute) R2 = 30(cents) We have then the probabilities P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15 If we observe N calls and N is very large We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those calls 53 / 87
  • 140. Expectation Something Notable You have the random variables R1, R2 representing how long is a call and how much you pay for an international call if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents) if 3 < R1 ≤ 6(minute) R2 = 20(cents) if 6 < R1 ≤ 9(minute) R2 = 30(cents) We have then the probabilities P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15 If we observe N calls and N is very large We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those calls 53 / 87
  • 141. Expectation Similarly {R2 = 20} =⇒ 0.25N and total cost 5N {R2 = 20} =⇒ 0.15N and total cost 4.5N We have then the probabilities The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per call The average 10(0.6N)+20(.25N)+30(0.15N) N = 10 (0.6) + 20 (.25) + 30 (0.15) = y yP {R2 = y} 54 / 87
  • 142. Expectation Similarly {R2 = 20} =⇒ 0.25N and total cost 5N {R2 = 20} =⇒ 0.15N and total cost 4.5N We have then the probabilities The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per call The average 10(0.6N)+20(.25N)+30(0.15N) N = 10 (0.6) + 20 (.25) + 30 (0.15) = y yP {R2 = y} 54 / 87
  • 143. Expectation Similarly {R2 = 20} =⇒ 0.25N and total cost 5N {R2 = 20} =⇒ 0.15N and total cost 4.5N We have then the probabilities The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per call The average 10(0.6N)+20(.25N)+30(0.15N) N = 10 (0.6) + 20 (.25) + 30 (0.15) = y yP {R2 = y} 54 / 87
  • 144. Expected Value Definition Discrete random variable X: E(X) = x xp(x). Continuous random variable Y : E(Y ) = ´ x xp(x)dx. Extension to a function g(X) E(g(X)) = x g(x)p(x) (Discrete case). E(g(X)) = ´ ∞ =∞ g(x)p(x)dx (Continuous case) Linearity property E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y )) 55 / 87
  • 145. Expected Value Definition Discrete random variable X: E(X) = x xp(x). Continuous random variable Y : E(Y ) = ´ x xp(x)dx. Extension to a function g(X) E(g(X)) = x g(x)p(x) (Discrete case). E(g(X)) = ´ ∞ =∞ g(x)p(x)dx (Continuous case) Linearity property E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y )) 55 / 87
  • 146. Expected Value Definition Discrete random variable X: E(X) = x xp(x). Continuous random variable Y : E(Y ) = ´ x xp(x)dx. Extension to a function g(X) E(g(X)) = x g(x)p(x) (Discrete case). E(g(X)) = ´ ∞ =∞ g(x)p(x)dx (Continuous case) Linearity property E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y )) 55 / 87
  • 147. Example Imagine the following We have the following functions 1 f (x) = e−x, x ≥ 0 2 g (x) = 0, x < 0 Find The expected Value 56 / 87
  • 148. Example Imagine the following We have the following functions 1 f (x) = e−x, x ≥ 0 2 g (x) = 0, x < 0 Find The expected Value 56 / 87
  • 149. Example Imagine the following We have the following functions 1 f (x) = e−x, x ≥ 0 2 g (x) = 0, x < 0 Find The expected Value 56 / 87
  • 150. Example Imagine the following We have the following functions 1 f (x) = e−x, x ≥ 0 2 g (x) = 0, x < 0 Find The expected Value 56 / 87
  • 151. Variance Definition Var(X) = E((X − µ)2) where µ = E(X) Standard Deviation The standard deviation is simply σ = Var(X). 57 / 87
  • 152. Variance Definition Var(X) = E((X − µ)2) where µ = E(X) Standard Deviation The standard deviation is simply σ = Var(X). 57 / 87
  • 153. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 58 / 87
  • 154. Example Suppose You have that the number of call made per day at a given exchange has a Poisson distribution with an unknown parameter θ: p (x|θ) = θxe−θ x! x = 0, 1, 2, ... (6) We need to obtain information about θ For this, we observe that certain information is needed!!! For example We could need more of certain equipment if θ > θ0 We do not need it if θ ≤ θ0 59 / 87
  • 155. Example Suppose You have that the number of call made per day at a given exchange has a Poisson distribution with an unknown parameter θ: p (x|θ) = θxe−θ x! x = 0, 1, 2, ... (6) We need to obtain information about θ For this, we observe that certain information is needed!!! For example We could need more of certain equipment if θ > θ0 We do not need it if θ ≤ θ0 59 / 87
  • 156. Example Suppose You have that the number of call made per day at a given exchange has a Poisson distribution with an unknown parameter θ: p (x|θ) = θxe−θ x! x = 0, 1, 2, ... (6) We need to obtain information about θ For this, we observe that certain information is needed!!! For example We could need more of certain equipment if θ > θ0 We do not need it if θ ≤ θ0 59 / 87
  • 157. Thus, we want to take a decision about θ To avoid making an incorrect decision To avoid losing money!!! 60 / 87
  • 158. Ingredients of statistical decision models First N, the set of states Second A random variable or random vector X, the observable, whose distribution Fθ depends on θ ∈ N Third A, the set of possible actions: A = N = (0, ∞) Fourth A loss (cost) function L (θ, a), θ ∈ N, a ∈ A: It represents the loss of taking a decision. 61 / 87
  • 159. Ingredients of statistical decision models First N, the set of states Second A random variable or random vector X, the observable, whose distribution Fθ depends on θ ∈ N Third A, the set of possible actions: A = N = (0, ∞) Fourth A loss (cost) function L (θ, a), θ ∈ N, a ∈ A: It represents the loss of taking a decision. 61 / 87
  • 160. Ingredients of statistical decision models First N, the set of states Second A random variable or random vector X, the observable, whose distribution Fθ depends on θ ∈ N Third A, the set of possible actions: A = N = (0, ∞) Fourth A loss (cost) function L (θ, a), θ ∈ N, a ∈ A: It represents the loss of taking a decision. 61 / 87
  • 161. Ingredients of statistical decision models First N, the set of states Second A random variable or random vector X, the observable, whose distribution Fθ depends on θ ∈ N Third A, the set of possible actions: A = N = (0, ∞) Fourth A loss (cost) function L (θ, a), θ ∈ N, a ∈ A: It represents the loss of taking a decision. 61 / 87
  • 162. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 62 / 87
  • 163. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 164. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 165. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 166. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 167. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 168. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 169. Hypothesis Testing Suppose H0 and H1 two subset such that H0 ∩ H1 = ∅ H0 ∪ H1 = N In the telephone example H0 = {θ|θ ≤ θ0} H1 = {θ|θ > θ1} In other words “θ ∈ H0” “θ ∈ H1” 63 / 87
  • 170. Simple Hypothesis Vs. Simple Alternative In this specific case Each H0 and H1 contains one element, θ0 and θ1 Thus We have that our random variable X which depends on θ: If we are in H0, X ∼ f0 If we are in H1, X ∼ f1 Thus, the problem It is deciding whether X has density f0 or f1 64 / 87
  • 171. Simple Hypothesis Vs. Simple Alternative In this specific case Each H0 and H1 contains one element, θ0 and θ1 Thus We have that our random variable X which depends on θ: If we are in H0, X ∼ f0 If we are in H1, X ∼ f1 Thus, the problem It is deciding whether X has density f0 or f1 64 / 87
  • 172. Simple Hypothesis Vs. Simple Alternative In this specific case Each H0 and H1 contains one element, θ0 and θ1 Thus We have that our random variable X which depends on θ: If we are in H0, X ∼ f0 If we are in H1, X ∼ f1 Thus, the problem It is deciding whether X has density f0 or f1 64 / 87
  • 173. Simple Hypothesis Vs. Simple Alternative In this specific case Each H0 and H1 contains one element, θ0 and θ1 Thus We have that our random variable X which depends on θ: If we are in H0, X ∼ f0 If we are in H1, X ∼ f1 Thus, the problem It is deciding whether X has density f0 or f1 64 / 87
  • 174. What do we do? We define a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87
  • 175. What do we do? We define a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87
  • 176. What do we do? We define a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87
  • 177. What do we do? We define a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87
  • 178. What do we do? We define a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87
  • 179. What do we do? We define a function ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is observed We have then If ϕ (x) = 1, we reject H0 If ϕ (x) = 0, we accept H0 if 0 < ϕ (x) < 1, we toss a coin with probability a of heads if coins comes up reject H0 if coins comes up tail accept H0 65 / 87
  • 180. Thus {x|ϕ (x) = 1} It is called the rejection region or critical section. And ϕ is called a test!!! Clearly the decision could be erroneous!!! A type 1 error occurs if we reject H0 when H0 is true!!! A type 2 error occurs if we accept H0 when H1 is true!!! 66 / 87
  • 181. Thus {x|ϕ (x) = 1} It is called the rejection region or critical section. And ϕ is called a test!!! Clearly the decision could be erroneous!!! A type 1 error occurs if we reject H0 when H0 is true!!! A type 2 error occurs if we accept H0 when H1 is true!!! 66 / 87
  • 182. Thus {x|ϕ (x) = 1} It is called the rejection region or critical section. And ϕ is called a test!!! Clearly the decision could be erroneous!!! A type 1 error occurs if we reject H0 when H0 is true!!! A type 2 error occurs if we accept H0 when H1 is true!!! 66 / 87
  • 183. Thus {x|ϕ (x) = 1} It is called the rejection region or critical section. And ϕ is called a test!!! Clearly the decision could be erroneous!!! A type 1 error occurs if we reject H0 when H0 is true!!! A type 2 error occurs if we accept H0 when H1 is true!!! 66 / 87
  • 184. Thus the probability of error when X = x If H0 is rejected when true Probability of a type error 1 α = ˆ ∞ −∞ ϕ (x) f0 (x) dx (7) If H0 is accepted when false Probability of a type error 2 β = ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx (8) 67 / 87
  • 185. Thus the probability of error when X = x If H0 is rejected when true Probability of a type error 1 α = ˆ ∞ −∞ ϕ (x) f0 (x) dx (7) If H0 is accepted when false Probability of a type error 2 β = ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx (8) 67 / 87
  • 186. Actually If the test is an indicator function ϕ (x) = IAccept H0 (x) and 1 − ϕ (x) = IReject H0 (x) True True Retain Reject 68 / 87
  • 187. Problem!!! There is not a unique answer to the question of what is a good test Thus, we suppose there is a nonnegative cost ci associated to error type i. In addition, we have a prior probability p of H0 to be true. The over-all average cost associated with ϕ is B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9) 69 / 87
  • 188. Problem!!! There is not a unique answer to the question of what is a good test Thus, we suppose there is a nonnegative cost ci associated to error type i. In addition, we have a prior probability p of H0 to be true. The over-all average cost associated with ϕ is B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9) 69 / 87
  • 189. Problem!!! There is not a unique answer to the question of what is a good test Thus, we suppose there is a nonnegative cost ci associated to error type i. In addition, we have a prior probability p of H0 to be true. The over-all average cost associated with ϕ is B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9) 69 / 87
  • 190. We can do the following The over-all average cost associated with ϕ is B (ϕ) = p × c1 × ˆ ∞ −∞ ϕ (x) f0 (x) dx + (1 − p) × c2 × ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx Thus B (ϕ) = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ... (1 − p) c2 ˆ ∞ −∞ f1 (x) dx We have that B (ϕ) = ˆ ∞ −∞ ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2 70 / 87
  • 191. We can do the following The over-all average cost associated with ϕ is B (ϕ) = p × c1 × ˆ ∞ −∞ ϕ (x) f0 (x) dx + (1 − p) × c2 × ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx Thus B (ϕ) = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ... (1 − p) c2 ˆ ∞ −∞ f1 (x) dx We have that B (ϕ) = ˆ ∞ −∞ ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2 70 / 87
  • 192. We can do the following The over-all average cost associated with ϕ is B (ϕ) = p × c1 × ˆ ∞ −∞ ϕ (x) f0 (x) dx + (1 − p) × c2 × ˆ ∞ −∞ (1 − ϕ (x)) f1 (x) dx Thus B (ϕ) = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx = ˆ ∞ −∞ [pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ... (1 − p) c2 ˆ ∞ −∞ f1 (x) dx We have that B (ϕ) = ˆ ∞ −∞ ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2 70 / 87
  • 193. Bayes Risk We have that... B (ϕ) is called the Bayes risk associated to the test function ϕ In addition A test that minimizes B (ϕ) is called a Bayes test corresponding to the given p, c1, c2, f0 and f1. 71 / 87
  • 194. Bayes Risk We have that... B (ϕ) is called the Bayes risk associated to the test function ϕ In addition A test that minimizes B (ϕ) is called a Bayes test corresponding to the given p, c1, c2, f0 and f1. 71 / 87
  • 195. What do we want? We want To minimize ´ S ϕ (x) g (x) dx We want to find g (x)!!! This will tell us how to select the correct hypothesis!!! 72 / 87
  • 196. What do we want? We want To minimize ´ S ϕ (x) g (x) dx We want to find g (x)!!! This will tell us how to select the correct hypothesis!!! 72 / 87
  • 197. What do we want? We want To minimize ´ S ϕ (x) g (x) dx We want to find g (x)!!! This will tell us how to select the correct hypothesis!!! 72 / 87
  • 198. What do we want? Case 1 If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S. Case 2 If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S. Case 3 If g (x) = 0, ϕ (x) may be chosen arbitrarily. 73 / 87
  • 199. What do we want? Case 1 If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S. Case 2 If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S. Case 3 If g (x) = 0, ϕ (x) may be chosen arbitrarily. 73 / 87
  • 200. What do we want? Case 1 If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S. Case 2 If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S. Case 3 If g (x) = 0, ϕ (x) may be chosen arbitrarily. 73 / 87
  • 201. Finally We choose g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10) We look at the moment where g (x) = 0 pc1f0 (x) − (1 − p) c2f1 (x) = 0 pc1f0 (x) = (1 − p) c2f1 (x) pc1 (1 − p) c2 = f1 (x) f0 (x) 74 / 87
  • 202. Finally We choose g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10) We look at the moment where g (x) = 0 pc1f0 (x) − (1 − p) c2f1 (x) = 0 pc1f0 (x) = (1 − p) c2f1 (x) pc1 (1 − p) c2 = f1 (x) f0 (x) 74 / 87
  • 203. Bayes Solution Thus, we have Let L (x) = f1(x) f0(x) If L (x) > pc1 (1−p)c2 then take ϕ (x) = 1 i.e. reject H0. If L (x) < pc1 (1−p)c2 then take ϕ (x) = 0 i.e. accept H0. If L (x) = pc1 (1−p)c2 then take ϕ (x) =anything 75 / 87
  • 204. Bayes Solution Thus, we have Let L (x) = f1(x) f0(x) If L (x) > pc1 (1−p)c2 then take ϕ (x) = 1 i.e. reject H0. If L (x) < pc1 (1−p)c2 then take ϕ (x) = 0 i.e. accept H0. If L (x) = pc1 (1−p)c2 then take ϕ (x) =anything 75 / 87
  • 205. Bayes Solution Thus, we have Let L (x) = f1(x) f0(x) If L (x) > pc1 (1−p)c2 then take ϕ (x) = 1 i.e. reject H0. If L (x) < pc1 (1−p)c2 then take ϕ (x) = 0 i.e. accept H0. If L (x) = pc1 (1−p)c2 then take ϕ (x) =anything 75 / 87
  • 206. Bayes Solution Thus, we have Let L (x) = f1(x) f0(x) If L (x) > pc1 (1−p)c2 then take ϕ (x) = 1 i.e. reject H0. If L (x) < pc1 (1−p)c2 then take ϕ (x) = 0 i.e. accept H0. If L (x) = pc1 (1−p)c2 then take ϕ (x) =anything 75 / 87
  • 207. Likelihood Ratio We have L is called the likelihood ratio. For the test ϕ There is a constant 0 ≤ λ ≤ ∞ ϕ (x) = 1 when L (x) > λ ϕ (x) = 0 when L (x) < λ Remark: This is know as the Likelihood Ratio Test (LRT) 76 / 87
  • 208. Likelihood Ratio We have L is called the likelihood ratio. For the test ϕ There is a constant 0 ≤ λ ≤ ∞ ϕ (x) = 1 when L (x) > λ ϕ (x) = 0 when L (x) < λ Remark: This is know as the Likelihood Ratio Test (LRT) 76 / 87
  • 209. Likelihood Ratio We have L is called the likelihood ratio. For the test ϕ There is a constant 0 ≤ λ ≤ ∞ ϕ (x) = 1 when L (x) > λ ϕ (x) = 0 when L (x) < λ Remark: This is know as the Likelihood Ratio Test (LRT) 76 / 87
  • 210. Likelihood Ratio We have L is called the likelihood ratio. For the test ϕ There is a constant 0 ≤ λ ≤ ∞ ϕ (x) = 1 when L (x) > λ ϕ (x) = 0 when L (x) < λ Remark: This is know as the Likelihood Ratio Test (LRT) 76 / 87
  • 211. Likelihood Ratio We have L is called the likelihood ratio. For the test ϕ There is a constant 0 ≤ λ ≤ ∞ ϕ (x) = 1 when L (x) > λ ϕ (x) = 0 when L (x) < λ Remark: This is know as the Likelihood Ratio Test (LRT) 76 / 87
  • 212. Example Let X be a discrete random variable x = {0, 1, 2, 3} We have then x 0 1 2 3 p0 (x) .1 .2 .3 .4 p1 (x) .2 .1 .4 .3 We have the following likelihood ratio x 1 3 2 0 L (x) 1 2 3 4 4 3 2 77 / 87
  • 213. Example Let X be a discrete random variable x = {0, 1, 2, 3} We have then x 0 1 2 3 p0 (x) .1 .2 .3 .4 p1 (x) .2 .1 .4 .3 We have the following likelihood ratio x 1 3 2 0 L (x) 1 2 3 4 4 3 2 77 / 87
  • 214. Example Let X be a discrete random variable x = {0, 1, 2, 3} We have then x 0 1 2 3 p0 (x) .1 .2 .3 .4 p1 (x) .2 .1 .4 .3 We have the following likelihood ratio x 1 3 2 0 L (x) 1 2 3 4 4 3 2 77 / 87
  • 215. Example We have the following situation LRT Reject Region Acceptance Region α β 0 ≤ λ < 1 2 All x Empty 1 0 1 2 < λ < 3 4 x = 0, 2, 3 x = 1 .8 .1 3 4 < λ < 4 3 x = 0, 2 x = 1, 3 .4 .4 4 3 < λ < 2 x = 0 x = 1, 2, 3 .1 .8 2 < λ ≤ ∞ Empty All x 0 1 78 / 87
  • 216. Example Assume λ = 3/4 Reject H0 if x = 0, 2 Accept H0 if x = 1 If x = 3, we randomize i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a) 79 / 87
  • 217. Example Assume λ = 3/4 Reject H0 if x = 0, 2 Accept H0 if x = 1 If x = 3, we randomize i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a) 79 / 87
  • 218. Example Assume λ = 3/4 Reject H0 if x = 0, 2 Accept H0 if x = 1 If x = 3, we randomize i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a) 79 / 87
  • 219. Example Assume λ = 3/4 Reject H0 if x = 0, 2 Accept H0 if x = 1 If x = 3, we randomize i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a) 79 / 87
  • 220. Example Assume λ = 3/4 Reject H0 if x = 0, 2 Accept H0 if x = 1 If x = 3, we randomize i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a) 79 / 87
  • 221. The Graph of B (ϕ) Thus, we have for each λ value 80 / 87
  • 222. Thus, we have several test The classic one: Minimax Test The test that minimize max {α, β} Which An admissible test with constant risk (α = β) is minimax Then We have only one test where α = β = 0.4 then 3 4 < λ < 4 3, Thus We reject H0 when x =0 or 2 We accept H0 when x =1 or 3 81 / 87
  • 223. Thus, we have several test The classic one: Minimax Test The test that minimize max {α, β} Which An admissible test with constant risk (α = β) is minimax Then We have only one test where α = β = 0.4 then 3 4 < λ < 4 3, Thus We reject H0 when x =0 or 2 We accept H0 when x =1 or 3 81 / 87
  • 224. Thus, we have several test The classic one: Minimax Test The test that minimize max {α, β} Which An admissible test with constant risk (α = β) is minimax Then We have only one test where α = β = 0.4 then 3 4 < λ < 4 3, Thus We reject H0 when x =0 or 2 We accept H0 when x =1 or 3 81 / 87
  • 225. Thus, we have several test The classic one: Minimax Test The test that minimize max {α, β} Which An admissible test with constant risk (α = β) is minimax Then We have only one test where α = β = 0.4 then 3 4 < λ < 4 3, Thus We reject H0 when x =0 or 2 We accept H0 when x =1 or 3 81 / 87
  • 226. Remark From this ideas We can work out the classics of hypothesis testing 82 / 87
  • 227. Outline 1 Basic Theory Intuitive Formulation Axioms 2 Independence Unconditional and Conditional Probability Posterior (Conditional) Probability 3 Random Variables Types of Random Variables Cumulative Distributive Function Properties of the PMF/PDF Expected Value and Variance 4 Statistical Decision Statistical Decision Model Hypothesis Testing Estimation 83 / 87
  • 228. Introduction Suppose γ is a real valued function on the set N of states of nature. Now, we observe X = x, we want to produce a number ψ (x) that is close to γ (θ). There are different ways of doing this Maximum Likelihood (ML). Expectation Maximization (EM). Maximum A Posteriori (MAP) 84 / 87
  • 229. Introduction Suppose γ is a real valued function on the set N of states of nature. Now, we observe X = x, we want to produce a number ψ (x) that is close to γ (θ). There are different ways of doing this Maximum Likelihood (ML). Expectation Maximization (EM). Maximum A Posteriori (MAP) 84 / 87
  • 230. Introduction Suppose γ is a real valued function on the set N of states of nature. Now, we observe X = x, we want to produce a number ψ (x) that is close to γ (θ). There are different ways of doing this Maximum Likelihood (ML). Expectation Maximization (EM). Maximum A Posteriori (MAP) 84 / 87
  • 231. Introduction Suppose γ is a real valued function on the set N of states of nature. Now, we observe X = x, we want to produce a number ψ (x) that is close to γ (θ). There are different ways of doing this Maximum Likelihood (ML). Expectation Maximization (EM). Maximum A Posteriori (MAP) 84 / 87
  • 232. Introduction Suppose γ is a real valued function on the set N of states of nature. Now, we observe X = x, we want to produce a number ψ (x) that is close to γ (θ). There are different ways of doing this Maximum Likelihood (ML). Expectation Maximization (EM). Maximum A Posteriori (MAP) 84 / 87
  • 233. Maximum Likelihood Estimation Suppose the following fθ be a density or probability function corresponding to the state of nature θ. Assume for simplicity that γ (θ) = θ If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ that maximizes fθ (x) 85 / 87
  • 234. Maximum Likelihood Estimation Suppose the following fθ be a density or probability function corresponding to the state of nature θ. Assume for simplicity that γ (θ) = θ If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ that maximizes fθ (x) 85 / 87
  • 235. Maximum Likelihood Estimation Suppose the following fθ be a density or probability function corresponding to the state of nature θ. Assume for simplicity that γ (θ) = θ If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ that maximizes fθ (x) 85 / 87
  • 236. Example Let X have a binomial distribution With parameters n and θ, 0 ≤ θ ≤ 1 The pdf pθ (x) = n x θx (1 − θ)n−x with x = 0, 1, 2, ..., n Derive with respect to θ ∂ ∂θ ln pθ (x) = 0 86 / 87
  • 237. Example Let X have a binomial distribution With parameters n and θ, 0 ≤ θ ≤ 1 The pdf pθ (x) = n x θx (1 − θ)n−x with x = 0, 1, 2, ..., n Derive with respect to θ ∂ ∂θ ln pθ (x) = 0 86 / 87
  • 238. Example Let X have a binomial distribution With parameters n and θ, 0 ≤ θ ≤ 1 The pdf pθ (x) = n x θx (1 − θ)n−x with x = 0, 1, 2, ..., n Derive with respect to θ ∂ ∂θ ln pθ (x) = 0 86 / 87
  • 239. Example We get x θ − n − x 1 − θ = 0 =⇒ ˆθ = x n Now, we can regard X as a sum of independent variables X = X1 + X2 + ... + Xn where: Xi is 1 with probability θ or 0 with probability 1 − θ We get finally ˆθ (X) = n i=1 Xi n ⇒ lim n→∞ ˆθ (X) = E (Xi) = θ 87 / 87
  • 240. Example We get x θ − n − x 1 − θ = 0 =⇒ ˆθ = x n Now, we can regard X as a sum of independent variables X = X1 + X2 + ... + Xn where: Xi is 1 with probability θ or 0 with probability 1 − θ We get finally ˆθ (X) = n i=1 Xi n ⇒ lim n→∞ ˆθ (X) = E (Xi) = θ 87 / 87
  • 241. Example We get x θ − n − x 1 − θ = 0 =⇒ ˆθ = x n Now, we can regard X as a sum of independent variables X = X1 + X2 + ... + Xn where: Xi is 1 with probability θ or 0 with probability 1 − θ We get finally ˆθ (X) = n i=1 Xi n ⇒ lim n→∞ ˆθ (X) = E (Xi) = θ 87 / 87