Unit 3: Statistical Methods in
Data Science
Probability, Statistics, Correlation,
and Regression
Your name, date, course, institution
Introduction
• Role of statistics in data science
• Helps in making data-driven decisions
• Key areas: probability, inference, correlation,
regression
Probability Theory Basics
• Definition: Likelihood of an event occurring
• Types of probability:
– Classical, Empirical, Subjective
• Events: independent vs dependent, mutually
exclusive
Probability Rules
• Addition Rule: P(A B)=P(A)+P(B)
∪
−P(A∩B)P(A cup B) = P(A) + P(B) - P(A 
cap B)P(A B)=P(A)+P(B)−P(A∩B)
∪
• Multiplication Rule:
P(A∩B)=P(A) P(B A)P(A cap B) = P(A) 
⋅ ∣
cdot P(B|A)P(A∩B)=P(A) P(B A)
⋅ ∣
• Complement Rule: P(A′)=1−P(A)P(A') = 1 -
P(A)P(A′)=1−P(A)
Random Variables and Distributions
• Random Variables: Quantitative outcome of
experiments
• Types: Discrete (e.g., dice), Continuous (e.g.,
height)
• Common distributions: Normal, Binomial,
Poisson
Inferential Statistics
• Use sample data to make population-level
conclusions
• Two key concepts:
– Estimation (point and interval)
– Hypothesis testing
Hypothesis Testing
• Steps:
– Null hypothesis (H0H_0H0​
) vs alternative
(H1H_1H1​
)
– Choose significance level (αalphaα)
– Compute test statistic
– Compare with critical value / p-value
• Types: t-test, chi-square, ANOVA
p-values
• Definition: Probability of observing results as
extreme as the sample, assuming H0H_0H0​is
true
• Interpretation:
– p < 0.05 → reject H0H_0H0​
– p ≥ 0.05 → fail to reject H0H_0H0​
Confidence Intervals
• Range of plausible values for a population
parameter
• Example: 95% CI → we are 95% confident the
true mean lies within this interval
• Formula (for mean): xˉ±Zα/2 σ
⋅ nbar{x} pm
Z_{alpha/2} cdot frac{sigma}{
sqrt{n}}xˉ±Zα/2​
⋅n​
σ​
Correlation
• Measures the strength and direction of a linear
relationship between two variables
• Correlation coefficient (r):
– r = 1 → perfect positive
– r = -1 → perfect negative
– r = 0 → no linear correlation
Causality vs Correlation
• Correlation ≠ Causation
• Causal inference requires experimental design
or advanced methods (e.g., regression, A/B
testing)
• Important to avoid misleading conclusions
Introduction to Linear Regression
• Predicts a dependent variable (Y) from one
or more independent variables (X)
• Simple linear regression formula:
Y=β0+β1X+ϵY = beta_0 + beta_1 X + 
epsilonY=β0​
+β1​
X+ϵ
Statistical Foundations of Regression
• Assumptions: Linearity, independence,
homoscedasticity, normality of errors
• Interpretation of coefficients:
– β0beta_0β0​→ intercept
– β1beta_1β1​→ change in Y for 1 unit change in X
• Goodness of fit: R2R^2R2
Hypothesis Testing in Regression
• Test if predictor variables significantly affect
Y
• Null hypothesis: βi=0beta_i = 0βi​
=0
• p-value from t-test → determines significance
Summary
• Probability theory underpins statistical
inference
• Inferential statistics allows decision-making
with uncertainty
• Correlation identifies relationships; regression
quantifies and predicts
• Proper statistical methods ensure reliable data-
driven insights
Thank You

Unit 3_statistical methods in data science.pptx

  • 1.
    Unit 3: StatisticalMethods in Data Science Probability, Statistics, Correlation, and Regression Your name, date, course, institution
  • 2.
    Introduction • Role ofstatistics in data science • Helps in making data-driven decisions • Key areas: probability, inference, correlation, regression
  • 3.
    Probability Theory Basics •Definition: Likelihood of an event occurring • Types of probability: – Classical, Empirical, Subjective • Events: independent vs dependent, mutually exclusive
  • 4.
    Probability Rules • AdditionRule: P(A B)=P(A)+P(B) ∪ −P(A∩B)P(A cup B) = P(A) + P(B) - P(A cap B)P(A B)=P(A)+P(B)−P(A∩B) ∪ • Multiplication Rule: P(A∩B)=P(A) P(B A)P(A cap B) = P(A) ⋅ ∣ cdot P(B|A)P(A∩B)=P(A) P(B A) ⋅ ∣ • Complement Rule: P(A′)=1−P(A)P(A') = 1 - P(A)P(A′)=1−P(A)
  • 5.
    Random Variables andDistributions • Random Variables: Quantitative outcome of experiments • Types: Discrete (e.g., dice), Continuous (e.g., height) • Common distributions: Normal, Binomial, Poisson
  • 6.
    Inferential Statistics • Usesample data to make population-level conclusions • Two key concepts: – Estimation (point and interval) – Hypothesis testing
  • 7.
    Hypothesis Testing • Steps: –Null hypothesis (H0H_0H0​ ) vs alternative (H1H_1H1​ ) – Choose significance level (αalphaα) – Compute test statistic – Compare with critical value / p-value • Types: t-test, chi-square, ANOVA
  • 8.
    p-values • Definition: Probabilityof observing results as extreme as the sample, assuming H0H_0H0​is true • Interpretation: – p < 0.05 → reject H0H_0H0​ – p ≥ 0.05 → fail to reject H0H_0H0​
  • 9.
    Confidence Intervals • Rangeof plausible values for a population parameter • Example: 95% CI → we are 95% confident the true mean lies within this interval • Formula (for mean): xˉ±Zα/2 σ ⋅ nbar{x} pm Z_{alpha/2} cdot frac{sigma}{ sqrt{n}}xˉ±Zα/2​ ⋅n​ σ​
  • 10.
    Correlation • Measures thestrength and direction of a linear relationship between two variables • Correlation coefficient (r): – r = 1 → perfect positive – r = -1 → perfect negative – r = 0 → no linear correlation
  • 11.
    Causality vs Correlation •Correlation ≠ Causation • Causal inference requires experimental design or advanced methods (e.g., regression, A/B testing) • Important to avoid misleading conclusions
  • 12.
    Introduction to LinearRegression • Predicts a dependent variable (Y) from one or more independent variables (X) • Simple linear regression formula: Y=β0+β1X+ϵY = beta_0 + beta_1 X + epsilonY=β0​ +β1​ X+ϵ
  • 13.
    Statistical Foundations ofRegression • Assumptions: Linearity, independence, homoscedasticity, normality of errors • Interpretation of coefficients: – β0beta_0β0​→ intercept – β1beta_1β1​→ change in Y for 1 unit change in X • Goodness of fit: R2R^2R2
  • 14.
    Hypothesis Testing inRegression • Test if predictor variables significantly affect Y • Null hypothesis: βi=0beta_i = 0βi​ =0 • p-value from t-test → determines significance
  • 15.
    Summary • Probability theoryunderpins statistical inference • Inferential statistics allows decision-making with uncertainty • Correlation identifies relationships; regression quantifies and predicts • Proper statistical methods ensure reliable data- driven insights
  • 16.