Basic statistics 1

Data Science
Statistical Analysis : Estimation and Testing
By
Kumar P

Managerial Decisions
How many Programmers should I
staff for?
What is the right level of inventory
for our new product manufacturing
Where should we open our new
retail store?
What will be next year revenue?
Whether we are on right or wrong
track
How much should I invest in
advertising

Flow Diagram
Acknowledge Uncertainty
Characterize uncertainty
Make Inferences under
uncertainty
Make predictions under
uncertainty
Make optimal decisions under uncertainty

Type of Statistics
Statistics
Descriptive
Inferential

Descriptive statistics
Descriptive statistics utilizes numerical and graphical methods to look for patterns
in a data set, to summarize the information revealed in a data set and to present
that information in a convenient form.
• Average
• Spread
• Range
• Frequency
• Histogram
• Mode
• Scatter Plot
• Mode
• Interquartile Range

Inferential statistics
• Hypothesis Test
• Z score
• ANNOVA
• Confidence Interval
• Margin of error
• Ordinary least Square
• T test
• F Test

Types of Data
Type of Data Definition Example
Nominal The categories are in no logical order and have
no particular relationship
Your Previous Degree
Ordinal Can be ranked/ordered but not measured College Rankings
Interval Scale Set of numerical measurements in which the
distance between numbers is of a known
Temperature in Celsius
Ratio Scale Ratios are meaningful Sales of a new product
Source of data Definition Example
Observational Analyst Does not control data
generation process
Stock returns on BSE
Experimental Analyst has good control over data
generation
Clinical trials for drug
efficiency

Few Examples
1. The length of time until a pain reliever begins to work.
2. Ranking of racers in moto GP.
3. The number of colors used in a statistics textbook.
4. The brand of refrigerator in a home.
5. The overall satisfaction rating of a new car.
6. The number of files on a computer’s hard disk.
7. The pH level of the water in a swimming pool.
8. The number of staples in a stapler.

Population & Sample
Population: A collection, or set, of individuals or objects or events whose
properties are to be analyzed.
Typically, there are too many experimental units in a population to consider
every one.
Sample: A Subset of population

Measure of Central Tendency
Mode: The value in the data that occurs most frequently
Mean: The average of a given set of numbers
Mean of sample
Population Mean µ=
1
𝑁 𝑖
𝑛
𝑥𝑖
Percentiles: The pth Percentile of a group of numbers is that value below which
lie p% of the numbers in the group .
Pth percentile= (n+1)p/100 where n is the number of data points
Median: 50th percentile
Quartiles: These are percentiles which break down the distribution of the data.
1st (25 percentile),3rd (75th percentile)
Interquartile Range(IQR): Difference between 1st and 3rd quartile
value Frequency
18 4
19 1
20 3
21 1
22 2
23 2
24 1

Quick Exercise
Data- 33 26 24 21 18 52 19
Mean ??
Mode ??
Median ??
IQR??

Measure of Variability
Range: Difference between largest number and smallest number in a given data
set
Variance: Is the average squared deviation of the data points from their mean
Sample Variance
Population Variance
Standard Deviation: Square root of variance of the data set
Sample sd S=√𝑆2
Population sd 𝜎 = √𝜎2

Spare some thoughts
Why SD & VAR
Why different denominator
Why not mod

Histogram
• Histogram is a chart made of bars where height of each bars represent frequency
of values
• Frequency of values can be absolute frequencies of counts or relative frequency
• Relative frequency of data points counts of the data points divided by total
number of data points

Boxplot
Boxplot is a measure of five point summary measures of the distribution of the
data

Skew ness
Skew ness is the measure of the degree of asymmetry of a frequency
distribution

Kurtosis
Kurtosis is a measure of peakedness of a distribution
Kurtosis for normal distribution is 3

What Is Random Variable?
How To Summarize Random Variable?
How to pictorially Represent Probability Distribution?
Random Variable

Random Variable
A Random Variable describes the probabilities for an uncertain future numerical
outcome of a random process
It is a variable that can take on several possible value
It is random because there is some chance associated with each possible values
Random variable is of 2 types
• Discrete
• Continuous

Probability Distribution
• Probability
o Long Run average of a random event occurring
o Different from subjective beliefs
• A Probability distribution is a rule that identifies possible outcomes of a
random variable and assigns a probability to each
• A discrete distribution has finite number of values
o E.g. face value of a card, height of students in class
• A continuous distribution has all possible values in some range
o E.g. salaries per month, Temperature in a month

PDF & CDF of Random Variable
The PDF(probability distribution function) for a discrete random variable x is the
relative frequency distributions of the x. It is a graph, table or formula that gives
the possible values of x and the probability p(x) associated with each value.
For all xi pdf must satisfy
CDF(Cumulative distribution function), F(x) of a discrete random variable is
F(x)=P(X≤x)= 𝑎𝑙𝑙 𝑖≤𝑥 𝑃(𝑖)
1)(and1)(0
havemustWe
 xpxp
X p(X=x) F(x)
0 0.1 0.1
1 0.2 0.3
2 0.3 0.6
3 0.2 0.8
4 0.1 0.9
5 0.1 1.00
1.00

Example
Toss a fair coin three times and define
x = number of heads.
P(x = 0) = 1/8
P(x = 1) = 3/8
P(x = 2) = 3/8
P(x = 3) = 1/8
HHH
HHT
HTH
THH
HTT
THT
TTH
TTT
x p(x)
0 1/8
1 3/8
2 3/8
3 1/8
Probability Histogram
for x
1/8
1/8
1/8
1/8
1/8
1/8
1/8
1/8
x
3
2
2
2
1
1
1
0

Quick exercise
Randomly chosen card from a deck of cards
What is the probability of getting an ace?
What is the probability of getting a card less than 3?
What is the probability of getting 1 head if I toss 2 unbiased coin?
What is the probability of getting 2 head if I toss 3 unbiased coin?

An Example
X p(X=x)
0 0.4
1 0.25
2 0.2
3 0.05
4 0.1
• Daily sales of TVs at store
• What is the probability of a sale?
• What is the probability of selling at least three TVs?

Expected Value or Mean
• The expected value or mean(µ) of a random variable is
the weighted average of its values
‒ The probabilities serve as weights
‒ E(x)= 𝒊
𝒏
𝒙𝒊 𝒑(𝑿 = 𝒙𝒊)
• What is the mean number of TVs sold per day
• What does this imply

Variance and Standard Deviation
• Both measures of variation or uncertainty in random variable
• Variance(σ2) :The weighted average of the squared deviations from the
mean
‒ Probabilities serve as weights
‒ σ2(x)= 𝑖
𝑛
𝑥𝑖 − µ 2 𝑝 𝑋 = 𝑥𝑖 = 𝐸 𝑥 − µ 2
‒ Units are squared of the units of the variables
‒ Another way Var(X)=E(X2)-[E(X)]2
• Standard Deviation(σ) :Square root of variance
‒ Has units same as variable

Sum of Random Variables
Let X1 and x2 be 2 random variables with means µ1 and µ2 and standard
deviation σ1 and σ2, suppose Y=aX1 +b X2
‒ What is the Mean of Y?
E[Y]=aE[X1] +bE[X2]
‒ What is the standard deviation of Y?
Var(Y)=a2var(X1)+b2Var(X2)
• Independent: When the value taken by random variable does not affect
the value taken by other random variable
‒ E.g. Rolls of 2 Dice
• Dependent : When the value of one random variable gives us more
information about the other random variable
‒ E.g. Height and weight of students

Example
Let X1 and X2 be the outcomes associated with a toss of a pair of dice
E(X1)=E(X2)=3.5
SD(X1)=SD(X2)=1.708
Compute the following:
E(x1+X2)=
SD(X1+X2)=

The Empirical Rule
• Approximately 68% of data points will be within 1 standard deviation of
the mean
• Approximately 95% of the data points will be within 2 standard
deviation of the mean
• A vast majority(almost all) will lie within 3 standard deviation of the
mean

Normal distribution
• The graph of the PDF is a bell shaped curve
• The normal random variable takes values from -∞ to +∞
• It is symmetric and centered around the mean(which is also the median and the
mode)
• Any normal distribution can be specified with just 2 parameters – the mean(µ)
and the standard deviation(σ)
• We write this as X~N(µ,σ2)

Comparing multiple normal
distributions

Probability Calculation for
continuous Distribution
• The probability associated with any single value of the random variable is always
zero
• Probability of values being in a range = Area under the pdf curve in that range
• Area under the entire curve is always equals 1

Z-scores, Standard Normal
Distribution
For every value(x) of the random variable X, we calculate its z-score:
Interpretation- How many standard deviations away is the value from the
mean?
If X~N(µ,σ2) then
‒ Z-scores have a normal distribution with µ=0 and σ=1
‒ i.e. Z~N(0,1)
‒ Standard normal distribution
• Inverse Transformation
‒ X=µ + zσ

Probability calculation for normal
distribution
• Consider a normal distribution X~N(µ,σ2)
• Methods to calculate P(X≤ 𝑥)
‒ Use R:pnorm(x,µ,σ)

Basic statistics 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Basic statistics 1

Similar to Basic statistics 1 (20)

More from Kumar P

More from Kumar P (6)

Recently uploaded

Recently uploaded (20)

Basic statistics 1