Lecture-1.pdf

Lecture - 1
CSEE-4142 & PHDCS-834: Machine Learning

CSEC-4142 & PHDCS-4 @2022
Introduction
• Analytics is the systematic computational analysis of data or statistics.
• It is used for the discovery, interpretation, and communication of meaningful patterns in
data.
• It also entails applying data patterns towards effective decision-making.
• It can be valuable in areas rich with recorded information
• Analytics relies on the simultaneous application of statistics, computer programming and
operation research to qualify performance.

CSEC-4142 & PHDCS-4 @2022
Data Analysis vs Data Analytics
• Data analysis focuses on the process of examining past data through collection, inspection,
modelling and questioning.
• It is a subset of data analytics, which takes multiple data analysis processes to focus on why
an event happened and what may happen in the future based on the previous data.
• Data analytics is used to formulate larger organization decisions.
• Data analytics use extensive computer skills, mathematics, statistics, descriptive techniques
and predictive models to gain valuable knowledge from data through analytics.

CSEC-4142 & PHDCS-4 @2022
Techniques and Tools of Analytics
• Analytics is a collection of techniques and tools for creating value from data.
• The Techniques included concepts such as Artificial Intelligence (AI), Machine Learning
(ML) and Deep Learning (DL) algorithms
• Artificial Intelligence is the algorithms and systems that exhibit human like intelligence
• Machine learning is a subset of AI that can learn to perform a task with extracted data and or
models
• Deep Learning is a subset of machine learning that imitate the functioning of human brain to
solve problem

CSEC-4142 & PHDCS-4 @2022
Relationship between AI, ML and DL (?)

CSEC-4142 & PHDCS-4 @2022
Introduction to Machine Learning
• Machine learning (ML) is an area in computer science which involves teaching computers to
do things naturally by teaching through experience.
• Machine Learning deals with systems that are trained from data rather than being explicitly
programmed.
• It is centred on the creation of algorithms that can learn from past experience and data, as
opposed to a computer being pre-programmed to carry out a task in a specific manner.
• Put simply, if a computer program improves with experience then we can say that it has
learned.

CSEC-4142 & PHDCS-4 @2022
Definition
• Tom Mitchell, one of the patrons of machine learning define machine learning as “A
computer program is said to learn from experience ‘E’, with respect to some class of tasks
‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by ‘P’
improves with experience ‘E’.
• Stanford University define it as “Machine learning is the science of getting computers to
act without being explicitly programmed”.
• McKinsey et al state that “Machine learning is based on algorithm that can learn from data
without relying on rule based program”

CSEC-4142 & PHDCS-4 @2022
Example-1: Handwriting Recognition
Task T : Recognising and classifying handwritten words within images
Performance P : Percent of words correctly classified
Training experience E: A dataset of handwritten words with given classifications

CSEC-4142 & PHDCS-4 @2022
Example-2: Driving Robot Training
Task T : Driving on highways using vision sensors
Performance measure P : Average distance traveled before an error
Training experience E: A sequence of images and steering commands recorded
while observing a human driver

CSEC-4142 & PHDCS-4 @2022
Example-3: Chess Learning
Task T : Playing chess
Performance measure P : Percent of games won against opponents
Training experience E: Playing practice games against itself

CSEC-4142 & PHDCS-4 @2022
Basic components of learning process
• The learning process can be divided into four components:
• data storage
• abstraction
• generalization
• evaluation
Fig 1. Components of Learning Process

CSEC-4142 & PHDCS-4 @2022
Data Storage
• Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data
storage as a foundation for advanced reasoning.
• In human being, the data is stored in the brain and data is retrieved using
electrochemical signals.
• Computers use hard disk drives, flash memory, random access memory and similar
devices to store data and retrieve data.

CSEC-4142 & PHDCS-4 @2022
Abstraction
• The second component of the learning process is known as abstraction.
• Abstraction is the process of extracting knowledge about stored data. This involves
creating general concepts about the data as a whole.
• The creation of knowledge involves application of known models and creation of
new models.
• The process of fitting a model to a dataset is known as training.
• When the model has been trained, the data is transformed into an abstract form that
summarizes the original information.

CSEC-4142 & PHDCS-4 @2022
Generalization
• The third component of the learning process is known as generalization.
• The term generalization describes the process of turning the knowledge about
stored data into a form that can be utilized for future action.
• These actions are to be carried out on tasks that are similar, but not identical, to
those what have been seen before.
• In generalization, the goal is to discover those properties of the data that will be
most relevant to future tasks.

CSEC-4142 & PHDCS-4 @2022
Evaluation
• Evaluation is the last component of the learning process.
• It is the process of giving feedback to the user to measure the utility of the learned
knowledge.
• This feedback is then utilized to improve in the whole learning process.

CSEC-4142 & PHDCS-4 @2022
Applications of machine learning
• In retail business, machine learning is used to study consumer behaviour.
• In finance, banks analyze their past data to build models to use in credit
application fraud detection and the stock market.
• In manufacturing, learning models are used for optimization, control, and
troubleshooting.
• In medicine, learning programs are used for medical diagnosis.

CSEC-4142 & PHDCS-4 @2022
Applications of machine learning (cont.)
• In telecommunications, call patterns are analyzed for network optimization and
maximizing the quality of service.
• It is used to find solutions to many problems in vision, speech recognition, and
robotics.
• Machine learning methods are applied in the design of computer-controlled
vehicles
• Machine learning methods have been used to develop programmes for playing games

CSEC-4142 & PHDCS-4 @2022
Understanding data
• Unit of observation
 Unit of observation means the smallest entity with measurable properties of interest for a
study.
 Examples:
o A person, an object or a thing
o A time point (second, minute, day, month, year etc.)
o A geographic region
o A measurement (Height, weight etc.)
• Examples
 An “example” is an instance of the unit of observation for which properties have
been recorded. An “example” is also referred to as an “instance”, or “case” or
“record.”

CSEC-4142 & PHDCS-4 @2022
Understanding data (cont.)
• Features
 A “feature” is a recorded property or a characteristic of examples. It is also referred to as
“attribute”, or “variable” or “feature.”
• Examples and features are generally collected in a “matrix format”.
Table 1. Examples and Features

CSEC-4142 & PHDCS-4 @2022
• Consider the problem of developing an algorithm for detecting cancer. In this study we note the
following.
 The units of observation are the patients.
 The examples are patients investigated for cancer (both positive and negative).
 The following attributes of the patients may be chosen as the features:
o gender
o age
o blood pressure
o the findings of the pathology report after a biopsy

CSEC-4142 & PHDCS-4 @2022
• Spam e-mail detection
 The unit of observation could be an e-mail messages.
 The examples would be specific messages (both spam and genuine email).
 The features might consist of the words used in the messages.

CSEC-4142 & PHDCS-4 @2022
Different forms of features (data)
• Numeric
 If a feature represents a characteristic measured in numbers, it is called a numeric feature.
o Example: “year”, “price” and “mileage” (ref. to Table-1)
• Categorical or nominal
 A categorical feature is an attribute that can take on one of a limited, and usually fixed,
number of possible values on the basis of some qualitative property. A categorical feature is
also called a nominal feature.
o Example: “model”, “color” and “transmission” (ref. to Table-1)
• Ordinal data
 This denotes a nominal variable with categories falling in an ordered list.
o Examples include clothing sizes such as small, medium, and large
o measurement of customer satisfaction on a scale from “not at all happy” to “very happy.”

CSEC-4142 & PHDCS-4 @2022
Mathematical Basics: Probability Theory
• Broadly speaking, probability theory is the mathematical study of uncertainty
• It plays the central role in machine learning and pattern recognition as the design of learning
algorithms often relies on probability assumption of data

CSEC-4142 & PHDCS-4 @2022
Probability Space
• When we speak about probability, we often refer to the probability of an event of
uncertain nature taking place
• Therefore, in order to discuss probability theory formally, we must first classify what
the possible events are to which we would like to attach probability.
• Formally, a probability space is defined by the triple (Ω, Ƒ, p), where
Ω is the space of possible outcomes or outcome space.
ℱ ⊆ 2Ω is the space of measurable events or event space.
P is the probability measure or probability distribution that maps
an event 𝐸𝐸 ∈ ℱ to a real value between 0 to 1.

CSEC-4142 & PHDCS-4 @2022
Probability Space
Given the outcome space Ω, there are some restrictions as to what subset of 2Ω
can be
considered as event space Ƒ:
(i) Trivial event Ω and the empty event 𝜙𝜙 is in Ƒ
(ii) The event space Ƒ is closed under union, that 𝛼𝛼, 𝛽𝛽 ∈ ℱ then 𝛼𝛼 ∪ 𝛽𝛽 ∈ ℱ
(iii) The event space ℱ is closed under complement, i.e. if α ∈ ℱ then Ω − 𝛼𝛼 ∈ ℱ

CSEC-4142 & PHDCS-4 @2022
Probability Space (cont.)
• Given the event space Ƒ, the probability measures P must satisfy certain axioms:
i) Non-negativity: For all 𝛼𝛼 ∈ ℱ, 𝑝𝑝(𝛼𝛼) ≥ 0
ii) Trivial event: 𝑝𝑝 Ω = 1
iii) Additivity : For all 𝛼𝛼, 𝛽𝛽 ∈ ℱ and 𝛼𝛼 ∩ 𝛽𝛽 = 𝜙𝜙, P 𝛼𝛼 ∪ 𝛽𝛽 = 𝑝𝑝 𝛼𝛼 + 𝑝𝑝 𝛽𝛽

CSEC-4142 & PHDCS-4 @2022
Random variable
• Random variables play important role in probability theory
• The most important fact is the random variables are not variables but functions, that map
outcomes (or outcome space) to real values.
• Random variables are denoted by capital letter
Example:
• Consider the process of throwing a dice. Let X be a random variable that depends on the
outcome of the throw
• A natural choice for X would be to map the outcome 𝑖𝑖 to the value 𝑖𝑖, that is mapping the
event of throwing an ‘one’ to 1.
• Suppose we have another random variable Y that maps all outcome to 0
• Suppose we have yet another random variable Z that maps all even outcome to 2𝑖𝑖 and odd
outcomes to (-𝑖𝑖).

CSEC-4142 & PHDCS-4 @2022
Random variable (cont.)
• In other sense, random variables allow us to abstract away the formal notation of event space, as
we can define random variables that captures the appropriate events.
• For example, consider the event space of odd and even dice throw. We could have a random
variable that takes on value 1 if outcome 𝑖𝑖 is even and 0 otherwise.
• These type of binary random variables are very common and known as indicator variable as
they indicates whether certain event happened or not.

CSEC-4142 & PHDCS-4 @2022
Joint Probability
• Joint probability is a statistical measure that calculates the likelihood of two events
occurring together at the same point of time
• We denote the probability of X taking a and Y taking b by (p(X=a, Y=b) or 𝑝𝑝𝑥𝑥,𝑦𝑦 𝑎𝑎, 𝑏𝑏
• Example. Let random variable X be defined on the outcome space Ω of a dice throw
𝑝𝑝𝑥𝑥 1 = 𝑝𝑝𝑥𝑥 2 = 𝑝𝑝𝑥𝑥 3 … . = 𝑝𝑝𝑥𝑥 6 =
1
6
Let Y be an indicator variable that takes a value 1 if a coin flip turns head and 0 if tail
Assuming both dice and coin are fair, the probability distribution of X and Y is given by

CSEC-4142 & PHDCS-4 @2022
Joint Probability
P X=1 X=2 X=3 X=4 X=5 X=6 𝒑𝒑𝒙𝒙(𝒊𝒊)
Y=0 1
12
1
12
1
12
1
12
1
12
1
12
1
2
Y=1 1
12
1
12
1
12
1
12
1
12
1
12
1
2
𝒑𝒑𝒚𝒚(𝒋𝒋) 1
6
1
6
1
6
1
6
1
6
1
6
Here , 𝑝𝑝 𝑋𝑋 = 1, 𝑌𝑌 = 0 =
1
6
∗
1
2
=
1
12
𝒑𝒑𝒙𝒙 𝒊𝒊 , 𝒑𝒑𝒚𝒚(𝒋𝒋) are called marginal probability
Definition: Marginal Probability Distribution
• The marginal distribution refers to the probability distribution of a random
variable on its own
• Given a joint distribution, say over random variables X and Y, we can find the
marginal distribution of X or that of Y

CSEC-4142 & PHDCS-4 @2022
Joint Probability
• If X and Y are discrete random variables and p(𝑥𝑥, 𝑦𝑦) is the value of their joint probability
distribution at (𝑥𝑥, 𝑦𝑦) the function given by
𝑝𝑝 𝑥𝑥 = �
𝑏𝑏∈𝑣𝑣𝑣𝑣𝑣𝑣(𝑦𝑦)
𝑝𝑝(𝑥𝑥, 𝑦𝑦 = 𝑏𝑏)
and
𝑝𝑝 𝑦𝑦 = �
𝑎𝑎∈𝑣𝑣𝑣𝑣𝑣𝑣(𝑥𝑥)
𝑝𝑝(𝑥𝑥 = 𝑎𝑎, 𝑦𝑦)
𝑝𝑝 𝑥𝑥 and 𝑝𝑝 𝑦𝑦 are the marginal distribution of X and Y respectively.

CSEC-4142 & PHDCS-4 @2022
Conditional Distribution
• Conditional distributions specify the distribution of a random variable when the value of
another random variable is known or more generally when some events are know to be true.
• Formally, conditional probability of X=a given Y=b is defined as:
𝑝𝑝 𝑋𝑋 = 𝑎𝑎 𝑌𝑌 = 𝑏𝑏 =
𝑝𝑝(𝑋𝑋 = 𝑎𝑎, 𝑌𝑌 = 𝑏𝑏)
𝑝𝑝(𝑌𝑌 = 𝑏𝑏)
• This is not defined when 𝑝𝑝 𝑌𝑌 = 𝑏𝑏 = 0

CSEC-4142 & PHDCS-4 @2022
Some Important Theorems on Probability
Theorem-1: If A and B are mutually exclusive events, then 𝑝𝑝 𝐴𝐴 ∪ 𝐵𝐵 = 𝑝𝑝 𝐴𝐴 + 𝑝𝑝(𝐵𝐵)
Theorem-2: If two events A and B are exhaustive and mutually exclusive, then p 𝐴𝐴 +
𝑝𝑝 𝐵𝐵 = 1
Theorem-3: For any event A, p 𝐴𝐴′ = 1 − 𝑝𝑝(𝐴𝐴)
Theorem-4: If event A implies event B, then
(i)𝑝𝑝 𝐴𝐴 ≤ 𝑝𝑝(𝐵𝐵)
(ii) 𝑝𝑝 𝐵𝐵 − 𝐴𝐴 = 𝑝𝑝 𝐵𝐵 − 𝑝𝑝(𝐴𝐴)
Theorem-5: Suppose A and B are events that are not necessarily mutually exclusive, then
𝑝𝑝 𝐴𝐴 ∪ 𝐵𝐵 = 𝑝𝑝 𝐴𝐴 + 𝑝𝑝 𝐵𝐵 − 𝑝𝑝 𝐴𝐴 ∩ 𝐵𝐵
𝐴𝐴 → 𝐵𝐵 i. e. , 𝐴𝐴 ⊂ 𝐵𝐵

CSEC-4142 & PHDCS-4 @2022
Independence & Chain Rule
Independence
• In probability theory, independence means that the distribution of a random variable does
not change on learning t he value of another variable
• Mathematically if
𝑝𝑝 𝐴𝐴 = 𝑝𝑝 𝐴𝐴 𝐵𝐵
Then A and B are independent.
Example: The result of successive tosses of a coin is independent
Chain Rule
The Chain Rule is often used to evaluate the joint probability of some random variables and is
specially useful when there are conditional independence across variables
𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, … . , 𝑥𝑥𝑛𝑛 = p(𝑥𝑥1).p(𝑥𝑥2|𝑥𝑥1) … 𝑝𝑝(𝑥𝑥𝑛𝑛|𝑥𝑥1𝑥𝑥2 … 𝑥𝑥𝑛𝑛−1)

CSEC-4142 & PHDCS-4 @2022
Bayes Rule
• The Bayes Rule allows us to compute the conditional probability 𝑝𝑝 𝑥𝑥 𝑦𝑦 from 𝑝𝑝 𝑦𝑦 𝑥𝑥 in a
sense inverting the condition.
𝑝𝑝 𝑥𝑥 𝑦𝑦 =
𝑝𝑝 𝑦𝑦 𝑥𝑥 𝑝𝑝(𝑥𝑥)
𝑝𝑝(𝑦𝑦)
• 𝒑𝒑 𝒙𝒙 𝒚𝒚 is called the posterior; this is what we are trying to estimate. For example,
“probability of having cancer given that the person is a smoker”.
• 𝒑𝒑 𝒚𝒚 𝒙𝒙 is called the likelihood; this is the probability of observing the new evidence, given
our initial hypothesis. In the above example, this would be the “probability of being a smoker
given that the person has cancer”.
• 𝒑𝒑(𝒙𝒙) is called the prior; this is the probability of our hypothesis without any additional prior
information. In the above example, this would be the “probability of having cancer”.
• 𝒑𝒑(𝒚𝒚) is called the marginal likelihood; this is the total probability of observing the evidence.
In the above example, this would be the “probability of being a smoker”. In many applications
of Bayes Rule, this is ignored, as it mainly serves as normalization.

CSEC-4142 & PHDCS-4 @2022
Probability Distribution
• The probability distribution describes how the total probability is distributed over the possible
values of a random variable
• It tells us what the possible values of a random variable X are and how the probability are
assigned to those values.
• The set of possible values of a random variable together with their respective probability is
called the probability distribution of the random variable.
• Probability distribution can be continuous or random.
• The probability distribution of a continuous random variable is called continuous probability
distribution and the probability distribution of a random variable is called discrete probability
distribution

CSEC-4142 & PHDCS-4 @2022
PMF & CDF
Probability Mass Function (PMF)
The probability mass function (pmf) of a discrete random variable X is a function denoted by
𝑝𝑝 𝑥𝑥 or 𝑓𝑓 𝑥𝑥 , that satisfies:
i) 𝑝𝑝 𝑥𝑥 ≥ 0 for all values of X and
ii) ∑𝑖𝑖 𝑝𝑝(𝑥𝑥𝑖𝑖) = 1
Where 𝑝𝑝 𝑥𝑥 = 𝑝𝑝(𝑋𝑋 = 𝑥𝑥), is the probability that random variable X takes the value x
Cumulative Distribution Function (CDF)
For any real number x, the function 𝐹𝐹 𝑥𝑥 = 𝑝𝑝 𝑋𝑋 ≤ 𝑥𝑥 is known as the cumulative distribution
function (cdf) or simply distribution function.

CSEC-4142 & PHDCS-4 @2022
Mean and Variance of a Random Variable
• Mean of a random variable X, denoted by µ, describes where the population distribution of
X is centered.
• Mean or expectation of a discrete random variable X having p.m.f. p(x) is defined as
𝜇𝜇 = 𝐸𝐸 𝑋𝑋 = �
𝑥𝑥
𝑥𝑥. 𝑝𝑝(𝑥𝑥)
• Provided the sum exists, otherwise we say that mean does not exist.
Sum exist mean sum is
absolutely convergent
∑𝐱𝐱 𝐱𝐱. 𝐩𝐩(𝐱𝐱) < ∞
Variance (𝜎𝜎2
) of a discrete random variable X is defined as
𝜎𝜎2
= 𝑉𝑉 𝑋𝑋 = 𝐸𝐸(𝑋𝑋 − 𝐸𝐸 𝑋𝑋 )2
= 𝐸𝐸(𝑋𝑋 − 𝜇𝜇)2
= �
𝑥𝑥
(𝑥𝑥 − 𝜇𝜇)2
𝑝𝑝(𝑥𝑥)
Or 𝜎𝜎2
= 𝑉𝑉 𝑋𝑋 = 𝐸𝐸(𝑋𝑋 − 𝐸𝐸 𝑋𝑋 )2
= 𝐸𝐸 𝑋𝑋2
− 𝐸𝐸2
𝑋𝑋 = ∑𝑥𝑥 𝑥𝑥2
𝑝𝑝 𝑥𝑥 − 𝜇𝜇2
The standard deviation (𝜎𝜎) of a random variable is the positive square root of its variance, i.e.
𝜎𝜎 = + 𝜎𝜎2

CSEC-4142 & PHDCS-4 @2022
Median and Mode of a Random Variable
• Median of a random variable X is the value below (or above) which half of the total
probability lies.
• This if F(x) is the cumulative distribution function of a random variable X and ̅
𝜇𝜇 be the
median, then
𝐹𝐹 ̅
𝜇𝜇 =
1
2
Or we may write
𝑝𝑝 𝑋𝑋 ≤ ̅
𝜇𝜇 ≥
1
2
and 𝑝𝑝 𝑋𝑋 ≥ ̅
𝜇𝜇 ≥
1
2
• Mode of a random variable X is the value which occurs with the highest probability, i.e. the
value of X at which pmf 𝑓𝑓 𝑥𝑥 is maximum.
• Thus, if f 𝑥𝑥 be a pmf of a random variable X and �
𝜇𝜇 be the mode, then
𝑓𝑓 �
𝜇𝜇 ≥ 𝑓𝑓 𝑥𝑥 , ∀𝑥𝑥
• Mode is not unique

CSEC-4142 & PHDCS-4 @2022
Conditional Expectation
• The conditional expectation of X, given that the event A has happed is defined as
𝐸𝐸 𝑋𝑋 𝐴𝐴 = �
𝑥𝑥|𝐴𝐴
𝑥𝑥. 𝑝𝑝(𝑋𝑋 = 𝑥𝑥|𝐴𝐴)
Where (𝑋𝑋 = 𝑥𝑥|𝐴𝐴) indicates the value of X that are favourable to A.

CSEC-4142 & PHDCS-4 @2022
Probability Density Function
• To define a continuous distribution, we make use of probability density function (PDF).
• A probability density function f, is a non-negative, integrable function such that
�
𝑣𝑣𝑎𝑎𝑎𝑎(𝑋𝑋)
𝑓𝑓 𝑥𝑥 𝑑𝑑𝑑𝑑 = 1
The probability of a random variable X distributed according to a PDF f is computed as
follows:
𝑃𝑃 𝑎𝑎 ≤ 𝑋𝑋 ≤ 𝑏𝑏 = �
𝑎𝑎
𝑏𝑏
𝑓𝑓 𝑥𝑥 𝑑𝑑𝑑𝑑
This implies that the probability of a continuously distributed random variable taking on any
given single value is zero.

CSEC-4142 & PHDCS-4 @2022
Skewness and Kurtosis
• Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution,
or data set, is symmetric if it looks the same to the left and right of the center point.
• Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data
sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution
would be the extreme case.
• A heavy tailed distribution has a tail that’s heavier than an exponential
distribution (Bryson, 1974).
• In other words, a distribution that is heavy tailed goes to zero slower than one with lighter
tails.
• Heavy tailed distributions tend to have many outliers. An outlier is a data point that differs
significantly from other observations
• The histogram is an effective graphical technique for showing both the skewness and
kurtosis of data set.

CSEC-4142 & PHDCS-4 @2022
Skewness and Kurtosis (cont.)

CSEC-4142 & PHDCS-4 @2022
Symmetrical and Asymmetrical Distribution
• A symmetric distribution is a type of distribution where the left side of the distribution mirrors
the right side.
• The normal distribution is symmetric. It is also a unimodal distribution (it has one peak).
• Distributions don’t have to be unimodal to be symmetric. They can be bimodal (two peaks)
or multimodal (many peaks). The following bimodal distribution is symmetric, as the two
halves are mirror images of each other.
• In a symmetric distribution, the mean, mode and median all fall at the same point. The mode is
the most common number and it matches with the highest peak
• An exception is the bimodal distribution. The mean and median are still in the center, but there
are two modes: one on each peak.

CSEC-4142 & PHDCS-4 @2022
Symmetrical and Asymmetrical Distribution (cont.)
• If one tail is longer than another, the distribution is skewed. These distributions are sometimes
called asymmetric or asymmetrical distributions as they don’t show any kind of symmetry.
• A left-skewed distribution has a long left tail. Left-skewed distributions are also
called negatively-skewed distributions. That’s because there is a long tail in the negative
direction on the number line. The mean is also to the left of the peak.
• A right-skewed distribution has a long right tail. Right-skewed distributions are also called
positive-skew distributions. That’s because there is a long tail in the positive direction on the
number line. The mean is also to the right of the peak.

CSEC-4142 & PHDCS-4 @2022
Symmetrical and Asymmetrical Distribution (cont.)
• A left-skewed, negative distribution will have the mean to the left of the median.
• A right-skewed distribution will have the mean to the right of the median.
Real life distributions are usually skewed.
• If the skewness is high, and many statistical techniques don’t work.
• As a result, advanced mathematical techniques like logarithms are used.

CSEC-4142 & PHDCS-4 @2022
Some important Distribution: Binomial
• A binomial experiment is one that possesses the following properties:
1) The experiment consists of n repeated trials
2) Each trial results in an outcome that may be classified as a success or a failure (hence the
name, binomial)
3) The probability of a success, denoted by p, remains constant from trial to trial and repeated
trials are independent.
• The number of successes X in n trials of a binomial experiment is called a binomial random
variable.
• The probability distribution of the random variable X is called a binomial distribution, and
is given by the formula:
𝑃𝑃 𝑋𝑋 =
𝑛𝑛
𝑥𝑥
𝑝𝑝𝑥𝑥(1 − 𝑝𝑝)𝑛𝑛−𝑥𝑥, 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑥𝑥 = 0, 1, 2, … , 𝑛𝑛
• 𝑃𝑃 𝑋𝑋 gives the probability of successes in n binomial trials

CSEC-4142 & PHDCS-4 @2022
Some important Distribution: Bernoulli
• The Bernoulli distribution is a discrete distribution having two possible outcomes labelled
by n=0 and n=1 in which n=1 ("success") occurs with probability p and n=0 ("failure")
occurs with probability q=1-p, where 0<p<1.
• It therefore has probability density function
𝑝𝑝 𝑛𝑛 = �
1 − 𝑝𝑝 for n = 0
𝑝𝑝 for n = 1
Which can also be written as
𝑝𝑝 𝑛𝑛 = 𝑝𝑝𝑛𝑛
(1 − 𝑝𝑝)1−𝑛𝑛
• Bernoulli distribution is a special case of Binomial Distribution with n=1

CSEC-4142 & PHDCS-4 @2022
Some important Distribution: Poisson
• The Poisson distribution is a very useful distribution that deals with the arrival of events.
• It is a discrete probability distribution. It measures probability of the number of events
happening over a fixed period of time.
• Given a fixed average rate of occurrence, and that the events take place independently of the
time since the last event.
• It is parameterized by the average arrival rate 𝜆𝜆.
• The probability mass function (the probability of observing k events over the time period) is
given by
𝑃𝑃 𝑋𝑋 = 𝑘𝑘 =
exp(−𝜆𝜆)𝜆𝜆𝑘𝑘
𝑘𝑘!

CSEC-4142 & PHDCS-4 @2022
Some important Distribution: Gaussian
• A random variable X whose distribution has the shape of a normal curve is called a normal
random variable.
• This random variable X is said to be normally distributed with mean μ and standard
deviation σ if its probability distribution is given by
𝑓𝑓 𝑥𝑥 =
1
𝜎𝜎 2𝜋𝜋
exp −
(𝑥𝑥 − 𝜇𝜇)2
2𝜎𝜎2
• Normal distribution is also called Gaussian distribution

CSEC-4142 & PHDCS-4 @2022
Some important Distribution: Gaussian (cont.)
• Gaussian distribution is the most versatile distribution in probability theory and appears in
wide range of context.
• Whenever you measure things like people's height, weight, salary, opinions or votes, the
probability distribution is very often Gaussian
• It can be used to approximate the binomial distribution when the number of experiments is
large and Poisson distribution when the average arrival rate is high
• In may cases we deals with multivariate Gaussian Distributions.
• A k-dimensional multivariate Gaussian distribution is parameterized by (𝜇𝜇, Σ), where 𝜇𝜇 is a
vector of means in ℜ𝑛𝑛 and Σ is a covariance matrix in ℜ𝑛𝑛𝑥𝑥𝑥𝑥. In other words Σ𝑖𝑖𝑖𝑖 = 𝑣𝑣𝑣𝑣𝑣𝑣(𝑋𝑋𝑖𝑖)
and Σ𝑖𝑖𝑗𝑗 = 𝑐𝑐𝑐𝑐𝑐𝑐(𝑋𝑋𝑖𝑖, 𝑋𝑋𝑗𝑗).
• The probability density function is defined by
𝑓𝑓 𝑥𝑥 =
1
2𝜋𝜋𝑘𝑘 Σ
exp −
1
2
(𝑥𝑥 − 𝜇𝜇)𝑇𝑇Σ−1 (𝑥𝑥 − 𝜇𝜇)

Lecture-1.pdf

Recommended

Recommended

More Related Content

Similar to Lecture-1.pdf

Similar to Lecture-1.pdf (20)

Recently uploaded

Recently uploaded (20)

Lecture-1.pdf