SlideShare a Scribd company logo
1 of 122
Download to read offline
Fundamentals to Biostatistics
Prof. Chandan Chakraborty
Associate Professor
School of Medical Science & Technology
IIT Kharagpur
Statistics
Math. Statistics
Applied Statistics
Biostatistics
collection, analysis, interpretation of data
development of
new statistical
theory &
inference
application of the
methods derived from
math. statistics to
subject specific areas
like psychology,
economics and public
health
statistical methods are applied to
medical, health and biological data
Areas of application of Biostatistics: Environmental Health, Genetics, Pharmaceutical research,
Nutrition, Epidemiology and Health surveys etc
Some Statistical Tools for Medical Data
Analysis
 Data collection and Variables under study
 Descriptive Statistics & Sampling Distribution
Statistical Inference – Estimation, Hypothesis Testing, Conf. Interval
 Association
Continuous: Correlation and Regression
Categorical: Chi-square test
 Multivariate Analysis
PCA, Clustering Techniques, Discriminantion & Classification
 Time Series Analysis
AR, MA, ARMA, ARIMA
Population vs. Sample
Parameter vs. Statistics
Variable
 Definition: characteristic of interest in a study that has different
values for different individuals.
 Two types of variable
Continuous: values form continuum
Discrete: assume discrete set of values
 Examples
Continuous: blood pressure
Discrete: case/control, drug/placebo
6
Univariate Data
• Measurements on a single variable X
• Consider a continuous (numerical) variable
• Summarizing X
– Numerically
• Center
• Spread
– Graphically
• Boxplot
• Histogram
7
Measures of center: Mean
• The mean value of a variable is obtained by
computing the total of the values divided by the
number of values
• Appropriate for distributions that are fairly symmetrical
• It is sensitive to presence of outliers, since all values
contribute equally
8
Measures of center: Median
• The median value of a variable is the number
having 50% (half) of the values smaller than it (and
the other half bigger)
• It is NOT sensitive to presence of outliers, since it
‘ignores’ almost all of the data values
• The median is thus usually a more appropriate
summary for skewed distributions
9
Measures of spread: SD
• The standard deviation (SD) of a variable is the
square root of the average* of squared deviations
from the mean (*for uninteresting technical reasons,
instead of dividing by the number of values n, you
usually divide by n-1)
• The SD is an appropriate measure of spread when
center is measured with the mean
10
Quantiles
• The pth quantile is the number that has the proportion
p of the data values smaller than it
30%
5.53 = 30th percentile
11
Measures of spread: IQR
• The 25th (Q1), 50th (median), and 75th (Q3)
percentiles divide the data into 4 equal parts; these
special percentiles are called quartiles
• The interquartile range (IQR) of a variable is the
distance between Q1 and Q3:
IQR = Q3 – Q1
• The IQR is one way to measure spread when center
is measured with the median
12
Five-number summary and boxplot
• An overall summary of the distribution of variable
values is given by the five values:
Min, Q1, Median, Q3, and Max
• A boxplot provides a visual summary of this five-
number summary
• Display boxplots side-by-side to compare
distributions of different data sets
13
Boxplot
median
Q3
Q1
suspected
outliers
`whiskers
14
Histogram
• A histogram is a special kind of bar plot
• It allows you to visualize the distribution of values for
a numerical variable
• When drawn with a density scale:
– the AREA (NOT height) of each bar is the proportion of
observations in the interval
– the TOTAL AREA is 100% (or 1)
15
Bivariate Data
• Bivariate data are just what they sound
like – data with measurements on two
variables; let’s call them X and Y
• Here, we are looking at two continuous
variables
• Want to explore the relationship between
the two variables
16
Scatter plot
• We can graphically summarize a bivariate data set
with a scatter plot (also sometimes called a scatter
diagram)
• Plots values of one variable on the horizontal axis
and values of the other on the vertical axis
• Can be used to see how values of 2 variables tend
to move with each other (i.e. how the variables are
associated)
17
Scatter plot: positive association
18
Scatter plot: negative association
19
Scatter plot: real data example
20
Correlation Coefficient
• r is a unitless quantity
• -1  r  1
• r is a measure of LINEAR ASSOCIATION
• When r = 0, the points are not LINEARLY
ASSOCIATED – this does NOT mean there is NO
ASSOCIATION
Breast cancer example
 Study on whether age at first child birth is an important risk
factor for breast cancer (BC)
 How does taking Oral Conceptive (OC) affect Blood Pressure (BP) in women
 paired samples
Blood Pressure Example
Birthweight Example
 Determine the effectiveness of drug
A on preventing premature birth.
 Independent samples.
Parameter
Parameter
Statistic
Statistic
Mean:
Standard
deviation:
Proportion:
s
X ____
____
____
estimates
estimates
estimates
from sample
from entire
population
p
Mean, , is
unknown
Population Point estimate
I am 95%
confident that
 is between
40 & 60
Mean
X = 50
Sampl
e
Interval estimate
Parameter
= Statistic ± Its Error
Sampling Distribution
X or P
X or P X or P
Standard Error
SE (Mean) =
S
n
SE (p) =
p(1-p)
n
Quantitative Variable
Qualitative Variable
95% Samples
Confidence Interval
X
_
X - 1.96 SE X + 1.96 SE
 SE
SE Z-axis
1 - α
α/2
α/2
95% Samples
Confidence Interval
SE
SE  p
p + 1.96 SE
p - 1.96 SE
Z-axis
1 - α
α/2
α/2
Interpretation of
CI
Probabilistic Practical
We are 100(1-)%
single
confident that the
computed CI contains 
In repeated sampling 100(1-
around
all intervals
)% of

sample means will in the
long run include 
Example (Sample size≥30)
An epidemiologist studied the blood
glucose level of a random sample of 100
patients. The mean was 170, with a SD of
10.
SE = 10/10 = 1
Then CI:
 = 170 + 1.96  1 168.04    171.96
95%
 = X + Z SE
In a survey of 140 asthmatics, 35%
had allergy to house dust. Construct the
95% CI for the population proportion.
 = p + Z
0.35 – 1.96  0.04   ≥ 0.35 + 1.96 
0.04
0.27   ≥ 0.43
27%   ≥ 43%
Example (Proportion)
In a survey of 140 asthmatics, 35%
had allergy to house dust. Construct the
95% CI for the population proportion.
 = p + Z
0.35 – 1.96  0.04   ≥ 0.35 + 1.96 
0.04
0.27   ≥ 0.43
27%   ≥ 43%
P(1-p)
n 140
0.35(1-
0.35)
= 0.04
SE =
Hypothesis testing
A statistical method that uses
sample data to evaluate a
hypothesis about a population
parameter. It is intended to help
researchers differentiate
between real and random
patterns in the data.
• H0 Null Hypothesis states the
Assumption to be tested e.g. SBP of
participants = 120 (H0:  = 120).
• H1 Alternative Hypothesis is the
opposite of the null hypothesis (SBP of
participants ≠ 120 (H1:  ≠ 120). It may or
may not be accepted and it is the
hypothesis that is believed to be true
by the researcher
Null & Alternative Hypotheses
• Defines unlikely values of sample
statistic if null hypothesis is true.
Called rejection region of sampling
distribution
• Typical values are 0.01, 0.05
• Selected by the Researcher at the
Start
• Provides the Critical Value(s) of the
Test
Level of Significance, a
Level of Significance, a and the Rejection
Region
0
a
Critical
Value(s)
Rejection
Regions
H0: Innocent
Jury Trial Hypothesis Test
Actual Situation Actual Situation
Verdict Innocent Guilty Decision H 0
True H 0
False
Innocent Correct Error
Accept
H
0
1 - 
Type II
Error ( b )
Guilty Error Correct
H
0
Type I
Error
( )
Power
(1 - b )
Result Possibilities
False
Negative
False
Positive
Reject
• True Value of Population Parameter
– Increases When Difference Between
Hypothesized Parameter & True Value
Decreases
• Significance Level 
– Increases When  Decreases
• Population Standard Deviation 
– Increases When  Increases
• Sample Size n
– Increases When n Decreases
Factors Increasing
Type II Error

b
b

b
n
β
b d
• Probability of Obtaining a Test
Statistic More Extreme  or ) than
Actual Sample Value Given H0 Is True
• Called Observed Level of Significance
• Used to Make Rejection Decision
– If p value   Do Not Reject H0
– If p value < , Reject H0
p Value Test
State H0 H0 : m = 120
State H1 H1 : m  120
Choose   = 0.05
Choose n n = 100
Choose Test: Z, t, X2 Test (or p Value)
Hypothesis Testing: Steps
Test the Assumption that the true mean SBP of
participants is 120 mmHg.
Compute Test Statistic (or compute P value)
Search for Critical Value
Make Statistical Decision rule
Express Decision
Hypothesis Testing: Steps
• Assumptions
– Population is normally
distributed
• t test statistic
One sample-mean Test
n
s
x
t 0
error
standard
value
null
mean
sample 




Example Normal Body Temperature
What is normal body temperature? Is it
actually 37.6oC (on average)?
State the null and alternative hypotheses
H0:  = 37.6oC
Ha:   37.6oC
Example Normal Body Temp (cont)
Data: random sample of n = 18 normal body
temps
37.2 36.8 38.0 37.6 37.2 36.8 37.4 38.7 37.2
36.4 36.6 37.4 37.0 38.2 37.6 36.1 36.2 37.5
Variable n Mean SD SE t P
Temperature 18 37.22 0.68 0.161 2.38 0.029
n
s
x
t 0
error
standard
value
null
mean
sample 




Summarize data with a test statistic
STUDENT’S t DISTRIBUTION TABLE
Degrees of
freedom
Probability (p value)
0.10 0.05 0.01
1 6.314 12.706 63.657
5 2.015 2.571 4.032
10 1.813 2.228 3.169
17 1.740 2.110 2.898
20 1.725 2.086 2.845
24 1.711 2.064 2.797
25 1.708 2.060 2.787
 1.645 1.960 2.576
Example Normal Body Temp (cont)
Find the p-value
Df = n – 1 = 18 – 1 = 17
From SPSS: p-value = 0.029
From t Table: p-value is
between 0.05 and 0.01.
Area to left of t = -2.11 equals
area to right of t = +2.11.
The value t = 2.38 is between
column headings 2.110& 2.898 in
table, and for df =17, the p-
values are 0.05 and 0.01.
-2.11 +2.11 t
Example Normal Body Temp (cont)
Decide whether or not the result is
statistically significant based on the p-
value
Using a = 0.05 as the level of significance
criterion, the results are statistically significant
because 0.029 is less than 0.05. In other words,
we can reject the null hypothesis.
Report the Conclusion
We can conclude, based on these data, that the
mean temperature in the human population
does not equal 37.6.
• Involves categorical variables
• Fraction or % of population in a category
• Sample proportion (p)
One-sample test for proportion
size
sample
successes
of
number
n
X
p 

n
p
Z
)
1
( 





 Test is called Z test
where:
 Z is computed value
 π is proportion in
population
(null hypothesis
value) Critical Values: 1.96 at α=0.05
2.58 at α=0.01
Example
• In a survey of diabetics in a large city, it
was found that 100 out of 400 have diabetic
foot. Can we conclude that 20 percent of
diabetics in the sampled population have
diabetic foot.
• Test at the  =0.05 significance level.
Solution
Critical Value: 1.96
Decision:
We have sufficient evidence to reject the
Ho value of 20%
We conclude that in the population of
diabetic the proportion who have
diabetic foot does not equal 0.20
Z
0
Reject Reject
.025
.025
=
2.50
Ho: π = 0.20
H1: π  0.20 Z =
0.25 – 0.20
0.20 (1- 0.20)
400
+1.96
-1.96
Example
3. It is known that 1% of population suffers from a
particular disease. A blood test has a 97% chance to
identify the disease for a diseased individual, by also
has a 6% chance of falsely indicating that a healthy
person has a disease.
a. What is the probability that a random person has a
positive blood test.
b. If a blood test is positive, what’s the probability that
the person has the disease?
c. If a blood test is negative, what’s the probability that
the person does not have the disease?
• A is the event that a person has a disease. P(A) =
0.01; P(A’) = 0.99.
• B is the event that the test result is positive.
– P(B|A) = 0.97; P(B’|A) = 0.03;
– P(B|A’) = 0.06; P(B’|A’) = 0.94;
• (a) P(B) = P(A) P(B|A) + P(A’)P(B|A’) = 0.01*0.97
+0.99 * 0.06 = 0.0691
• (b) P(A|B)=P(B|A)*P(A)/P(B) = 0.97* 0.01/0.0691 =
0.1403
• (c) P(A’|B’) = P(B’|A’)P(A’)/P(B’)= P(B’|A’)P(A’)/(1-
P(B))= 0.94*0.99/(1-.0691)=0.9997
Normal Distributions
• Gaussian distribution
• Mean
• Variance
• Central Limit Theorem says sums of random variables tend
toward a Normal distribution.
• Mahalanobis Distance:
x
x
E 

)
(
2
2
/
2
)
(
2
1
)
,
(
)
( x
x
x
x
e
N
x
p x
x










2
2]
)
[( x
x
x
E 
 

x
x
x
r




Multivariate Normal Density
• x is a vector of d Gaussian variables
• Mahalanobis Distance
• All conditionals and marginals are also Gaussian


























dx
x
p
x
x
x
x
E
dx
x
xp
x
E
x
T
x
e
d
N
x
p
T
T
)
(
)
)(
(
]
)
)(
[(
)
(
]
[
)
(
1
)
(
2
1
2
/
1
|
|
2
/
2
1
)
,
(
)
(









)
(
)
( 1
2 
 


  x
x
r T
104
Bayesian Decision Making
Classification problem in probabilistic terms
Create models for how features are
distributed for objects of different classes
We will use probability calculus to make
classification decisions
105
X
Y
Lets Look at Just One Feature
• Each object can
be associated
with multiple
features
• We will look at
the case of just
one feature for
now
RBC
We are going to define two key concepts….
106
The First Key Concept
Features for each class drawn from class-conditional probability distributions
(CCPD)
P(X|Class1) P(X|Class2)
Our first goal will be to model these distributions
X
107
The Second Key Concept
We model prior probabilities to quantify the expected a priori chance of seeing
a class
P(Class2) & P(Class1)
108
But How Do We Classify?
• So we have priors defining the a priori probability
of a class
• We also have models for the probability of a
feature given each class
But we want the probability of the class given a feature
How do we get P(Class1|X)?
P(Class1), P(Class2)
P(X|Class1), P(X|Class2)
109
Bayes Rule
( | ) ( )
( | )
( )
P Feature Class P Class
P Class Feature
P Feature

Prior
Likelihood
Posterior
Evidence
Belief before
evidence
Evaluate
evidence
Belief after
evidence
Bayes, Thomas (1763) An essay
towards solving a problem in the
doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
Copyright@SMST 110
Bayes Decision Rule
If we observe an object with feature X, how do decide if the object is
from Class 1?
The Bayes Decision Rule is simply choose Class1 if:
( 1| ) ( 2 | )
( | 1) ( 1) ( | 2) ( 2)
(
( | 1) ( 1) ( | 2) (
) )
)
(
2
P Class X P Class X
P X Class P L P X Class
P X Class P Class P X
P L
P X P X
Class P Class



This is the same number on both sides!
111
Discriminant Function
We can create a convenient representation of the
Bayes Decision Rule
( | 1) ( 1) ( | 2) ( 2)
( | 1) ( 1)
1
( | 2) ( 2)
( | 1) ( 1)
( ) log 0
( | 2) ( 2)
P X Class P Class P X Class P Class
P X Class P Class
P X Class P Class
P X Class P Class
G X
P X Class P Class


 
If G(X) > 0, we classify as Class 1
112
Stepping back
( | ) ( )
1 1
( ) log 0
( | ) ( )
2 2
Cla
Class Class
ss Clas
P X P
G X
P s
X P
 
Given a new feature, X, we plug
it into this equation…
…and if G(X)> 0 we classify as Class1
What do we have so far?
P(X|Class1), P(X|Class2) P(Class1), P(Class2)
We have defined the two components, class-conditional distributions and
priors
We have used Bayes Rule to create a discriminant function for
classification from these components
113
Getting P(X|Class) from Training Set
P(X|Class1)
One Simple Approach
Divide X values into bins
And then we simply count
frequencies
<1 1-3 3-5 5-7 >7
2/13
0
7/13
3/13
1/13
There are 13 data
points
X
Class conditional from Univariate Normal Distribution
114
2
2
/
2
)
(
2
1
)
,
(
)
( x
x
x
x
e
N
x
p x
x










x
x
E 

)
(
2
2]
)
[( x
x
x
E 
 

x
x
x
r




Mean :
Variance :
Mahalanobis Distance :
115
We Are Just About There….
We have created the class-conditional distributions and priors
( | ) ( )
1 1
( ) log 0
( | ) ( )
2 2
Cla
Class Class
ss Clas
P X P
G X
P s
X P
 
P(X|Class1), P(X|Class2) P(Class1), P(Class2)
And we are ready to plug these into our discriminant function
But there is one more little complication…..
116
Multidimensional feature space ?
So P(X|Class) become P(X1,X2,X3,…,X8|Class)
and our discriminant function becomes
1 2 7
1 2 7
( , ,..., | ) ( )
( ) log 0
( , ,. 2
.., | ) ( )
1 1
2
Class
Class Clas
P X X X
Cla
s
s
P
G X
P X X s
X P
 
117
Naïve Bayes Classifier
We are going to make the following assumption:
All features are independent given the class
1 2 1 2
1
( , ,..., | ) ( | ) ( | )... ( | )
( | )
n n
n
i
i
P X X X Class P X Class P X Class P X Class
P X Class


 
We can thus estimate individual distributions for each
feature and just multiply them together!
118
Naïve Bayes Discriminant Function
1 2 7
1 7
1 2 7
( , ,..., | ) ( )
1
( ,..., ) log 0
( , ,..., |
1
(
2 2
) )
P X X X P
G X X
P X
Class
Class Clas
X s
X
ss
P
Cla
 
1 7
( | ) ( )
( ,..., ) log 0
( | ) (
2 2)
1 1
i
i
P X P
G X X
P X
Class
Class Clas
P
Class
s
 


Thus, with the Naïve Bayes assumption, we can now rewrite, this:
As this:
119
Classifying Parasitic RBC
1 7
( | ) ( )
( ,..., ) log 0
( | )
~ )
~
(
i
i
Mito Mito
P X P
G X X
P Mito P o
X Mit
 


Plug these and priors into the discriminant function
IF G>0, we predict that the parasite is from class Malaria
Intensity
Hu’s moments
Entropy
Fractal dimension
Homogeneity
Correlation
Chromatin dots
P(Xi|Malaria)
P(Xi|~Malaria)
Xi
120
How Good is the Classifier?
The Rule
We must test our classifier on a different set
from the training set: the labeled test set
The Task
We will classify each object in the test set
and count the number of each type of error
121
Binary Classification Errors
• Sensitivity
– Fraction of all Class1 (True) that we correctly predicted at Class 1
– How good are we at finding what we are looking for
• Specificity
– Fraction of all Class 2 (False) called Class 2
– How many of the Class 2 do we filter out of our Class 1 predictions
True (Mito) False (~Mito)
Predicted True TP FP
Predicted False FN TN
Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP)
In both cases, the higher the better
Thank you

More Related Content

Similar to ChandanChakrabarty_1.pdf

Basics in Epidemiology & Biostatistics 2 RSS6 2014
Basics in Epidemiology & Biostatistics 2 RSS6 2014Basics in Epidemiology & Biostatistics 2 RSS6 2014
Basics in Epidemiology & Biostatistics 2 RSS6 2014RSS6
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in StatisticsVikash Keshri
 
inferentialstatistics-210411214248.pdf
inferentialstatistics-210411214248.pdfinferentialstatistics-210411214248.pdf
inferentialstatistics-210411214248.pdfChenPalaruan
 
Statistical analysis.pptx
Statistical analysis.pptxStatistical analysis.pptx
Statistical analysis.pptxChinna Chadayan
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminardrdeepika87
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiranKiran Ramakrishna
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis TestingSr Edith Bogue
 
Basics of Hypothesis testing for Pharmacy
Basics of Hypothesis testing for PharmacyBasics of Hypothesis testing for Pharmacy
Basics of Hypothesis testing for PharmacyParag Shah
 
Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesisJunaid Ijaz
 
bio statistics for clinical research
bio statistics for clinical researchbio statistics for clinical research
bio statistics for clinical researchRanjith Paravannoor
 
Introduction to Statistical Methods
Introduction to Statistical MethodsIntroduction to Statistical Methods
Introduction to Statistical MethodsMichael770443
 

Similar to ChandanChakrabarty_1.pdf (20)

Basics in Epidemiology & Biostatistics 2 RSS6 2014
Basics in Epidemiology & Biostatistics 2 RSS6 2014Basics in Epidemiology & Biostatistics 2 RSS6 2014
Basics in Epidemiology & Biostatistics 2 RSS6 2014
 
Basics of Statistics.pptx
Basics of Statistics.pptxBasics of Statistics.pptx
Basics of Statistics.pptx
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in Statistics
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
inferentialstatistics-210411214248.pdf
inferentialstatistics-210411214248.pdfinferentialstatistics-210411214248.pdf
inferentialstatistics-210411214248.pdf
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
 
Statistical analysis.pptx
Statistical analysis.pptxStatistical analysis.pptx
Statistical analysis.pptx
 
Test of significance
Test of significanceTest of significance
Test of significance
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminar
 
3b. Introductory Statistics - Julia Saperia
3b. Introductory Statistics - Julia Saperia3b. Introductory Statistics - Julia Saperia
3b. Introductory Statistics - Julia Saperia
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis Testing
 
Basics of Hypothesis testing for Pharmacy
Basics of Hypothesis testing for PharmacyBasics of Hypothesis testing for Pharmacy
Basics of Hypothesis testing for Pharmacy
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesis
 
statistics in nursing
 statistics in nursing statistics in nursing
statistics in nursing
 
Sampling theory
Sampling theorySampling theory
Sampling theory
 
bio statistics for clinical research
bio statistics for clinical researchbio statistics for clinical research
bio statistics for clinical research
 
Introduction to Statistical Methods
Introduction to Statistical MethodsIntroduction to Statistical Methods
Introduction to Statistical Methods
 

Recently uploaded

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

ChandanChakrabarty_1.pdf

  • 1. Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur
  • 2. Statistics Math. Statistics Applied Statistics Biostatistics collection, analysis, interpretation of data development of new statistical theory & inference application of the methods derived from math. statistics to subject specific areas like psychology, economics and public health statistical methods are applied to medical, health and biological data Areas of application of Biostatistics: Environmental Health, Genetics, Pharmaceutical research, Nutrition, Epidemiology and Health surveys etc
  • 3. Some Statistical Tools for Medical Data Analysis  Data collection and Variables under study  Descriptive Statistics & Sampling Distribution Statistical Inference – Estimation, Hypothesis Testing, Conf. Interval  Association Continuous: Correlation and Regression Categorical: Chi-square test  Multivariate Analysis PCA, Clustering Techniques, Discriminantion & Classification  Time Series Analysis AR, MA, ARMA, ARIMA
  • 5. Variable  Definition: characteristic of interest in a study that has different values for different individuals.  Two types of variable Continuous: values form continuum Discrete: assume discrete set of values  Examples Continuous: blood pressure Discrete: case/control, drug/placebo
  • 6. 6 Univariate Data • Measurements on a single variable X • Consider a continuous (numerical) variable • Summarizing X – Numerically • Center • Spread – Graphically • Boxplot • Histogram
  • 7. 7 Measures of center: Mean • The mean value of a variable is obtained by computing the total of the values divided by the number of values • Appropriate for distributions that are fairly symmetrical • It is sensitive to presence of outliers, since all values contribute equally
  • 8. 8 Measures of center: Median • The median value of a variable is the number having 50% (half) of the values smaller than it (and the other half bigger) • It is NOT sensitive to presence of outliers, since it ‘ignores’ almost all of the data values • The median is thus usually a more appropriate summary for skewed distributions
  • 9. 9 Measures of spread: SD • The standard deviation (SD) of a variable is the square root of the average* of squared deviations from the mean (*for uninteresting technical reasons, instead of dividing by the number of values n, you usually divide by n-1) • The SD is an appropriate measure of spread when center is measured with the mean
  • 10. 10 Quantiles • The pth quantile is the number that has the proportion p of the data values smaller than it 30% 5.53 = 30th percentile
  • 11. 11 Measures of spread: IQR • The 25th (Q1), 50th (median), and 75th (Q3) percentiles divide the data into 4 equal parts; these special percentiles are called quartiles • The interquartile range (IQR) of a variable is the distance between Q1 and Q3: IQR = Q3 – Q1 • The IQR is one way to measure spread when center is measured with the median
  • 12. 12 Five-number summary and boxplot • An overall summary of the distribution of variable values is given by the five values: Min, Q1, Median, Q3, and Max • A boxplot provides a visual summary of this five- number summary • Display boxplots side-by-side to compare distributions of different data sets
  • 14. 14 Histogram • A histogram is a special kind of bar plot • It allows you to visualize the distribution of values for a numerical variable • When drawn with a density scale: – the AREA (NOT height) of each bar is the proportion of observations in the interval – the TOTAL AREA is 100% (or 1)
  • 15. 15 Bivariate Data • Bivariate data are just what they sound like – data with measurements on two variables; let’s call them X and Y • Here, we are looking at two continuous variables • Want to explore the relationship between the two variables
  • 16. 16 Scatter plot • We can graphically summarize a bivariate data set with a scatter plot (also sometimes called a scatter diagram) • Plots values of one variable on the horizontal axis and values of the other on the vertical axis • Can be used to see how values of 2 variables tend to move with each other (i.e. how the variables are associated)
  • 19. 19 Scatter plot: real data example
  • 20. 20 Correlation Coefficient • r is a unitless quantity • -1  r  1 • r is a measure of LINEAR ASSOCIATION • When r = 0, the points are not LINEARLY ASSOCIATED – this does NOT mean there is NO ASSOCIATION
  • 21. Breast cancer example  Study on whether age at first child birth is an important risk factor for breast cancer (BC)
  • 22.  How does taking Oral Conceptive (OC) affect Blood Pressure (BP) in women  paired samples Blood Pressure Example
  • 23. Birthweight Example  Determine the effectiveness of drug A on preventing premature birth.  Independent samples.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 35. Mean, , is unknown Population Point estimate I am 95% confident that  is between 40 & 60 Mean X = 50 Sampl e Interval estimate
  • 37. Sampling Distribution X or P X or P X or P
  • 38. Standard Error SE (Mean) = S n SE (p) = p(1-p) n Quantitative Variable Qualitative Variable
  • 39. 95% Samples Confidence Interval X _ X - 1.96 SE X + 1.96 SE  SE SE Z-axis 1 - α α/2 α/2
  • 40. 95% Samples Confidence Interval SE SE  p p + 1.96 SE p - 1.96 SE Z-axis 1 - α α/2 α/2
  • 41. Interpretation of CI Probabilistic Practical We are 100(1-)% single confident that the computed CI contains  In repeated sampling 100(1- around all intervals )% of  sample means will in the long run include 
  • 42. Example (Sample size≥30) An epidemiologist studied the blood glucose level of a random sample of 100 patients. The mean was 170, with a SD of 10. SE = 10/10 = 1 Then CI:  = 170 + 1.96  1 168.04    171.96 95%  = X + Z SE
  • 43. In a survey of 140 asthmatics, 35% had allergy to house dust. Construct the 95% CI for the population proportion.  = p + Z 0.35 – 1.96  0.04   ≥ 0.35 + 1.96  0.04 0.27   ≥ 0.43 27%   ≥ 43% Example (Proportion) In a survey of 140 asthmatics, 35% had allergy to house dust. Construct the 95% CI for the population proportion.  = p + Z 0.35 – 1.96  0.04   ≥ 0.35 + 1.96  0.04 0.27   ≥ 0.43 27%   ≥ 43% P(1-p) n 140 0.35(1- 0.35) = 0.04 SE =
  • 44. Hypothesis testing A statistical method that uses sample data to evaluate a hypothesis about a population parameter. It is intended to help researchers differentiate between real and random patterns in the data.
  • 45. • H0 Null Hypothesis states the Assumption to be tested e.g. SBP of participants = 120 (H0:  = 120). • H1 Alternative Hypothesis is the opposite of the null hypothesis (SBP of participants ≠ 120 (H1:  ≠ 120). It may or may not be accepted and it is the hypothesis that is believed to be true by the researcher Null & Alternative Hypotheses
  • 46. • Defines unlikely values of sample statistic if null hypothesis is true. Called rejection region of sampling distribution • Typical values are 0.01, 0.05 • Selected by the Researcher at the Start • Provides the Critical Value(s) of the Test Level of Significance, a
  • 47. Level of Significance, a and the Rejection Region 0 a Critical Value(s) Rejection Regions
  • 48. H0: Innocent Jury Trial Hypothesis Test Actual Situation Actual Situation Verdict Innocent Guilty Decision H 0 True H 0 False Innocent Correct Error Accept H 0 1 -  Type II Error ( b ) Guilty Error Correct H 0 Type I Error ( ) Power (1 - b ) Result Possibilities False Negative False Positive Reject
  • 49. • True Value of Population Parameter – Increases When Difference Between Hypothesized Parameter & True Value Decreases • Significance Level  – Increases When  Decreases • Population Standard Deviation  – Increases When  Increases • Sample Size n – Increases When n Decreases Factors Increasing Type II Error  b b  b n β b d
  • 50. • Probability of Obtaining a Test Statistic More Extreme  or ) than Actual Sample Value Given H0 Is True • Called Observed Level of Significance • Used to Make Rejection Decision – If p value   Do Not Reject H0 – If p value < , Reject H0 p Value Test
  • 51. State H0 H0 : m = 120 State H1 H1 : m  120 Choose   = 0.05 Choose n n = 100 Choose Test: Z, t, X2 Test (or p Value) Hypothesis Testing: Steps Test the Assumption that the true mean SBP of participants is 120 mmHg.
  • 52. Compute Test Statistic (or compute P value) Search for Critical Value Make Statistical Decision rule Express Decision Hypothesis Testing: Steps
  • 53. • Assumptions – Population is normally distributed • t test statistic One sample-mean Test n s x t 0 error standard value null mean sample     
  • 54. Example Normal Body Temperature What is normal body temperature? Is it actually 37.6oC (on average)? State the null and alternative hypotheses H0:  = 37.6oC Ha:   37.6oC
  • 55. Example Normal Body Temp (cont) Data: random sample of n = 18 normal body temps 37.2 36.8 38.0 37.6 37.2 36.8 37.4 38.7 37.2 36.4 36.6 37.4 37.0 38.2 37.6 36.1 36.2 37.5 Variable n Mean SD SE t P Temperature 18 37.22 0.68 0.161 2.38 0.029 n s x t 0 error standard value null mean sample      Summarize data with a test statistic
  • 56. STUDENT’S t DISTRIBUTION TABLE Degrees of freedom Probability (p value) 0.10 0.05 0.01 1 6.314 12.706 63.657 5 2.015 2.571 4.032 10 1.813 2.228 3.169 17 1.740 2.110 2.898 20 1.725 2.086 2.845 24 1.711 2.064 2.797 25 1.708 2.060 2.787  1.645 1.960 2.576
  • 57. Example Normal Body Temp (cont) Find the p-value Df = n – 1 = 18 – 1 = 17 From SPSS: p-value = 0.029 From t Table: p-value is between 0.05 and 0.01. Area to left of t = -2.11 equals area to right of t = +2.11. The value t = 2.38 is between column headings 2.110& 2.898 in table, and for df =17, the p- values are 0.05 and 0.01. -2.11 +2.11 t
  • 58. Example Normal Body Temp (cont) Decide whether or not the result is statistically significant based on the p- value Using a = 0.05 as the level of significance criterion, the results are statistically significant because 0.029 is less than 0.05. In other words, we can reject the null hypothesis. Report the Conclusion We can conclude, based on these data, that the mean temperature in the human population does not equal 37.6.
  • 59. • Involves categorical variables • Fraction or % of population in a category • Sample proportion (p) One-sample test for proportion size sample successes of number n X p   n p Z ) 1 (        Test is called Z test where:  Z is computed value  π is proportion in population (null hypothesis value) Critical Values: 1.96 at α=0.05 2.58 at α=0.01
  • 60. Example • In a survey of diabetics in a large city, it was found that 100 out of 400 have diabetic foot. Can we conclude that 20 percent of diabetics in the sampled population have diabetic foot. • Test at the  =0.05 significance level.
  • 61. Solution Critical Value: 1.96 Decision: We have sufficient evidence to reject the Ho value of 20% We conclude that in the population of diabetic the proportion who have diabetic foot does not equal 0.20 Z 0 Reject Reject .025 .025 = 2.50 Ho: π = 0.20 H1: π  0.20 Z = 0.25 – 0.20 0.20 (1- 0.20) 400 +1.96 -1.96
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97. Example 3. It is known that 1% of population suffers from a particular disease. A blood test has a 97% chance to identify the disease for a diseased individual, by also has a 6% chance of falsely indicating that a healthy person has a disease. a. What is the probability that a random person has a positive blood test. b. If a blood test is positive, what’s the probability that the person has the disease? c. If a blood test is negative, what’s the probability that the person does not have the disease?
  • 98. • A is the event that a person has a disease. P(A) = 0.01; P(A’) = 0.99. • B is the event that the test result is positive. – P(B|A) = 0.97; P(B’|A) = 0.03; – P(B|A’) = 0.06; P(B’|A’) = 0.94; • (a) P(B) = P(A) P(B|A) + P(A’)P(B|A’) = 0.01*0.97 +0.99 * 0.06 = 0.0691 • (b) P(A|B)=P(B|A)*P(A)/P(B) = 0.97* 0.01/0.0691 = 0.1403 • (c) P(A’|B’) = P(B’|A’)P(A’)/P(B’)= P(B’|A’)P(A’)/(1- P(B))= 0.94*0.99/(1-.0691)=0.9997
  • 99.
  • 100. Normal Distributions • Gaussian distribution • Mean • Variance • Central Limit Theorem says sums of random variables tend toward a Normal distribution. • Mahalanobis Distance: x x E   ) ( 2 2 / 2 ) ( 2 1 ) , ( ) ( x x x x e N x p x x           2 2] ) [( x x x E     x x x r    
  • 101.
  • 102. Multivariate Normal Density • x is a vector of d Gaussian variables • Mahalanobis Distance • All conditionals and marginals are also Gaussian                           dx x p x x x x E dx x xp x E x T x e d N x p T T ) ( ) )( ( ] ) )( [( ) ( ] [ ) ( 1 ) ( 2 1 2 / 1 | | 2 / 2 1 ) , ( ) (          ) ( ) ( 1 2        x x r T
  • 103.
  • 104. 104 Bayesian Decision Making Classification problem in probabilistic terms Create models for how features are distributed for objects of different classes We will use probability calculus to make classification decisions
  • 105. 105 X Y Lets Look at Just One Feature • Each object can be associated with multiple features • We will look at the case of just one feature for now RBC We are going to define two key concepts….
  • 106. 106 The First Key Concept Features for each class drawn from class-conditional probability distributions (CCPD) P(X|Class1) P(X|Class2) Our first goal will be to model these distributions X
  • 107. 107 The Second Key Concept We model prior probabilities to quantify the expected a priori chance of seeing a class P(Class2) & P(Class1)
  • 108. 108 But How Do We Classify? • So we have priors defining the a priori probability of a class • We also have models for the probability of a feature given each class But we want the probability of the class given a feature How do we get P(Class1|X)? P(Class1), P(Class2) P(X|Class1), P(X|Class2)
  • 109. 109 Bayes Rule ( | ) ( ) ( | ) ( ) P Feature Class P Class P Class Feature P Feature  Prior Likelihood Posterior Evidence Belief before evidence Evaluate evidence Belief after evidence Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
  • 110. Copyright@SMST 110 Bayes Decision Rule If we observe an object with feature X, how do decide if the object is from Class 1? The Bayes Decision Rule is simply choose Class1 if: ( 1| ) ( 2 | ) ( | 1) ( 1) ( | 2) ( 2) ( ( | 1) ( 1) ( | 2) ( ) ) ) ( 2 P Class X P Class X P X Class P L P X Class P X Class P Class P X P L P X P X Class P Class    This is the same number on both sides!
  • 111. 111 Discriminant Function We can create a convenient representation of the Bayes Decision Rule ( | 1) ( 1) ( | 2) ( 2) ( | 1) ( 1) 1 ( | 2) ( 2) ( | 1) ( 1) ( ) log 0 ( | 2) ( 2) P X Class P Class P X Class P Class P X Class P Class P X Class P Class P X Class P Class G X P X Class P Class     If G(X) > 0, we classify as Class 1
  • 112. 112 Stepping back ( | ) ( ) 1 1 ( ) log 0 ( | ) ( ) 2 2 Cla Class Class ss Clas P X P G X P s X P   Given a new feature, X, we plug it into this equation… …and if G(X)> 0 we classify as Class1 What do we have so far? P(X|Class1), P(X|Class2) P(Class1), P(Class2) We have defined the two components, class-conditional distributions and priors We have used Bayes Rule to create a discriminant function for classification from these components
  • 113. 113 Getting P(X|Class) from Training Set P(X|Class1) One Simple Approach Divide X values into bins And then we simply count frequencies <1 1-3 3-5 5-7 >7 2/13 0 7/13 3/13 1/13 There are 13 data points X
  • 114. Class conditional from Univariate Normal Distribution 114 2 2 / 2 ) ( 2 1 ) , ( ) ( x x x x e N x p x x           x x E   ) ( 2 2] ) [( x x x E     x x x r     Mean : Variance : Mahalanobis Distance :
  • 115. 115 We Are Just About There…. We have created the class-conditional distributions and priors ( | ) ( ) 1 1 ( ) log 0 ( | ) ( ) 2 2 Cla Class Class ss Clas P X P G X P s X P   P(X|Class1), P(X|Class2) P(Class1), P(Class2) And we are ready to plug these into our discriminant function But there is one more little complication…..
  • 116. 116 Multidimensional feature space ? So P(X|Class) become P(X1,X2,X3,…,X8|Class) and our discriminant function becomes 1 2 7 1 2 7 ( , ,..., | ) ( ) ( ) log 0 ( , ,. 2 .., | ) ( ) 1 1 2 Class Class Clas P X X X Cla s s P G X P X X s X P  
  • 117. 117 Naïve Bayes Classifier We are going to make the following assumption: All features are independent given the class 1 2 1 2 1 ( , ,..., | ) ( | ) ( | )... ( | ) ( | ) n n n i i P X X X Class P X Class P X Class P X Class P X Class     We can thus estimate individual distributions for each feature and just multiply them together!
  • 118. 118 Naïve Bayes Discriminant Function 1 2 7 1 7 1 2 7 ( , ,..., | ) ( ) 1 ( ,..., ) log 0 ( , ,..., | 1 ( 2 2 ) ) P X X X P G X X P X Class Class Clas X s X ss P Cla   1 7 ( | ) ( ) ( ,..., ) log 0 ( | ) ( 2 2) 1 1 i i P X P G X X P X Class Class Clas P Class s     Thus, with the Naïve Bayes assumption, we can now rewrite, this: As this:
  • 119. 119 Classifying Parasitic RBC 1 7 ( | ) ( ) ( ,..., ) log 0 ( | ) ~ ) ~ ( i i Mito Mito P X P G X X P Mito P o X Mit     Plug these and priors into the discriminant function IF G>0, we predict that the parasite is from class Malaria Intensity Hu’s moments Entropy Fractal dimension Homogeneity Correlation Chromatin dots P(Xi|Malaria) P(Xi|~Malaria) Xi
  • 120. 120 How Good is the Classifier? The Rule We must test our classifier on a different set from the training set: the labeled test set The Task We will classify each object in the test set and count the number of each type of error
  • 121. 121 Binary Classification Errors • Sensitivity – Fraction of all Class1 (True) that we correctly predicted at Class 1 – How good are we at finding what we are looking for • Specificity – Fraction of all Class 2 (False) called Class 2 – How many of the Class 2 do we filter out of our Class 1 predictions True (Mito) False (~Mito) Predicted True TP FP Predicted False FN TN Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) In both cases, the higher the better