CSC446: Pattern Recognition (LN3)

CSC446 : Pattern Recognition
Prof. Dr. Mostafa G. M. Mostafa
Faculty of Computer & Information Sciences
Computer Science Department
AIN SHAMS UNIVERSITY
Lecture Note 3:
Mathematical Foundations
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1
Appendix, Pattern Classification and PRML

CS446 : Pattern Recognition
Readings: Chapter 1 in Bishop’s PRML
Data Modeling (Regression)

Learning: Data Modeling
• Assume we have examples of pairs (x , y) and we
want to learn the mapping 𝑭: 𝑿 → 𝒀 to predict y
for future values of x.
𝒚 𝒙 = 𝐬𝐢𝐧⁡( 𝟐𝝅𝒙)

Polynomial Curve Fitting
• Problem: There are many possible mapping
functions 𝑭: 𝑿 → 𝒀 exist!
Which one to choose?
• We could choose the one
that minimize the error :

• Fitting a different polynomials (models) to
data:
𝑦 𝑥 = 𝒘 𝟎 𝑦 𝑥 = 𝒘 𝟎+𝒘 𝟏 𝒙

• Fitting a different polynomials (models) to
data:
𝑦 𝑥 = 𝒘 𝟎+𝒘 𝟏 𝒙+𝒘 𝟐 𝒙 𝟐
𝑦 𝑥 = 𝒘 𝟎+𝒘 𝟏 𝒙+𝒘 𝟐 𝒙 𝟐
+ ⋯ + 𝒘 𝟖 𝒙 𝟖

Overfitting
• At M = 9, we get zero training Error , BUT
highest testing Error

Effect of Data Size
• As number of data samples N increases, we
get more closer to the real data model with
higher order.
M = 9 M = 9

Performance Evaluation
• Generalization error is the true error for the
population of examples we would like to optimize
– Sample mean only approximates it.
• Two ways to assess the generalization error is:
• Theoretical: Law of Large numbers
– statistical bounds on the difference between the true and
sample mean errors
• Practical: Use a separate data set with m data
samples to test the model
(Mean) test error =

Assignment 1
1. Derive an equation for estimating the
parameters w from the sample data for
the cases M = 1 and M = 2.
2. Use such equations to draw a relation
between w and E(w) for each M. Use the
estimated values of w as the middle values
of the w range.

CS446 : Pattern Recognition
Readings: Appendix A
Probability & Statistics

1- Probability Theory
• Randomness:
–we call a phenomenon random if individual outcomes
are uncertain but there is nonetheless a regular
distribution of outcomes in a large number of
repetitions.
• Probability:
–the probability of any outcome of a random phenomenon
is the proportion of times the outcome would occur in a
very long series of repetitions.
–Probability is the long-term relative frequency.

• Discrete random variables:
–Let x X ; the sample space X = {v1, v2, ... , vm}.
–We denote by pi the probability that x = vi:
• Where pi must satisfy the following two conditions:
pi = Pr{ x = vi } , i = 1, . . . , m.


m
i
ii pp
1
1and0

• Equally likely outcomes:
“Equally likely outcomes are outcomes that
have the same probability of occurring.”
• Examples:
– Rolling a fair die
– Tossing a fair coin
• P(x) is a “Uniform Distribution”

• Equally likely outcomes:
• if we have ten identical balls numbered from 0 to 9, in a box
find the probability of randomly drawing a ball with a number
divisible by 3,
– the event space (desired outcomes): A={3,6,9}.
– the sample space (possible outcomes): S = {0, 1, 2, . . . , 9}.
• Since the drawing is at random, then each outcome is equally
likely to occur, i.e.: P(0) = P(1) = P(2) =…= P(9) =1/10
• P(A) ={numb. Of outcomes in A} / {number of outcomes in S}
= 3/10 = 0.3

• Biased outcomes (non-uniform dist.):
“Biased outcomes are outcomes that have
different probability of occurring.”
• Examples:
– Rolling a unfair die
– Tossing a unfair coin
• P(x) is a “Non-uniform Dist.”

• Biased outcomes (non-uniform dist.):
• A biased coin, twice as likely to come up tails as
heads, is tossed twice:
– What is the probability that at least one head occurs?
• Solution:
– Sample space = {HH, HT, TH, TT}
– P(H= head) = 1/3 , P(T= tail) =2/3
– Sample points/probability for the event:
• P(HT)= 1/3 x 2/3 = 2/9 P(HH)= 1/3 x 1/3= 1/9
• P(TH) = 2/3 x 1/3 = 2/9 P(TT)= 2/3 x 2/3 = 4/9
– Answer: 5/9 = 0.56 (sum of weights in red)

• Probability and Language
What’s the probability of a random word (from a random
dictionary page) being a verb?
• Solution:
• All words = just count all the words in the dictionary
• # of ways to get a verb: number of words which are verbs!
• If a dictionary has 50,000 entries, and 10,000 are verbs,
then:
• P(Verb) =10000/50000 = 1/5 = .20
wordsall
verbagettowaysof
verbadrawingP
#
)( 

• Conditional Probability
– A way to reason about the outcome of an
experiment based on partial information:
• In a word guessing game the first letter for the word
is a “t”. How likely is the second letter is an “h”?
• How likely is a person has a disease given that a
medical test was negative?
• A spot shows up on a radar screen. How likely it
corresponds to an aircraft?
• I saw your friend, How likely I will saw you?

• let A and B be events
• p(B|A) = the probability of event B occurring given event A
occurs
• definition:
)(
),(
)|(
BP
BAP
BAP 
A BA,B
Note: P(A,B)=P(A|B) · P(B)
Also : P(A,B) = P(B,A)

• One of the following 30 items is chosen at random.
• What is P(X), the probability that it is an X?
• What is P(X|red), the probability that it is an X given that it
is red?

• Statistically Independent events
–Variables x and y are said to be
statistically independent if and only if:
–That is, knowing the value of x did not
give us any additional knowledge about
the possible value of y
)()(),( yPxPyxP 

• Marginal Probability
• Joint Probability

• Sum Rule
• Product Rule

• Sum Rule
• Product Rule
• The Rules of Probability
)()|()()|(),( YpYXpXpXYpYXp 

Y
YXpXp ),()(

• Bayes Theorem
where

• Probability mass function, P(x):
– P(x) is the cumulative distribution of p(x).







Xx
z
xP
xP
dxxpz)P(x
1)(and
0)(
)(

2- Statistics
• Statistics is the science of collecting, organizing, and interpreting numerical
facts, which we call data.
• The best way of
looking at data is to
draw its histogram/
(frequency
distribution)

2- Statistics
• Univariate Gaussian/Normal Density:
–A density that is analytically tractable
–Continuous density
–A lot of processes are asymptotically Gaussian
Where:
 = mean (or expected value) of x
2 = squared deviation or variance
,
2
1
exp
2
1
)(
2













 




x
xp



 1)( dxxp

2- Statistics
• Univariate Gaussian/Normal Density
p(u) ~ N(0,1)

2- Statistics
• Multivariate Normal Density
– Multivariate normal density in d dimensions is:
where:
x = (x1, x2, …, xd)t = The multivariate random variable
 = (1, 2, …, d)t = the mean vector
 = d*d covariance matrix, || and -1 are it determinant
and inverse, respectively .






 
)x()x(
2
1
exp
)2(
1
)x( 1
2/12/


t
d
p

2- Statistics
• Multivariate Density: Statistically Independent
– If xi and xj are statistically independent
 σij = 0.
– In this case, p (x) reduces to the product of the
univariate normal densities for the components of
x. That is: if p(xi) ~ N(xi | µi , σi )
p(x) = p(x1,x2, …, xd) = p(x1) p(x2) … p(xd)
=  p(xi) ,
i

2- Statistics
• Multivariate Normal Density
– From the multivariate normal density, the loci of
points of constant density are hyperellipsoids for
which the quadratic form (x−µ)t Σ−1(x−µ) is
constant
– The quantity:
r2 = (x−µ)t Σ−1 (x−µ)
is sometimes called the squared Mahalanobis
distance from x to µ.

2- Statistics
Multivariate Normal Density

2- Statistics
Expected values:
• The expected value, mean or average of the random variable
x is defined by:
• if f(x) is any function of x, the expected value of f is defined
by:

2- Statistics
Expected values:
• The second moment of x is defined by:
• The variance of x is defined by:
where σ is the standard deviation of x.

3- Mathematical Notations

Next Time
Bayesian Decision Theory

CSC446: Pattern Recognition (LN3)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CSC446: Pattern Recognition (LN3)

Similar to CSC446: Pattern Recognition (LN3) (20)

More from Mostafa G. M. Mostafa

More from Mostafa G. M. Mostafa (14)

Recently uploaded

Recently uploaded (20)

CSC446: Pattern Recognition (LN3)