Intro to data science

Data Science – Hands On Introduction
1
Topics
Duration
(min)
Data Science Overview and Intro to R 90
Exploratory Data Analysis 60
Logistic Regression Model 30

Data Science
Collecting and analysing data to find insights which will help in decision making
 In Business Intelligence (ETL), we discover what and why of a (recurring) business problem
 In Data Science, we can proactively predict the problem before it occurs again
 Analytics are used to improve the operations, performance and innovations across domains
 Example, we can predict the customers who may default the loan before approving the loan
• Data Speaks, give is a canvas of methods to communicate with you!
 Data > Information > Knowledge > Insight > Action
 Data > Dashboards > Statistics > Predictive Models > Prescriptive
 What happened? Why? What's likely to happen?
9

What is Machine Learning ?
• Machine learning is a type of
artificial intelligence (AI) that
provides computers with the
ability to learn without being
explicitly programmed.
• Machine learning focuses on
the development of computer
programs that can teach
themselves to grow and change
when exposed to new data.

Prediction
Steps in Machine Learning
Training
Labels
Training on
Input Data
Training
Training
Input
Features
Input
Features
Testing
Test Input
Learned
model
Learned
model

20
Training ML Model for Face Recognition
Training examples of a person
Test images

Contrast: Traditional Machine Learning and Data Science
Data Science
Explore many models, build and
tune hybrids
Understand empirical properties of
models
Develop/use tools that can handle
massive datasets
Take action!
Machine Learning
Develop new (individual)
models
Prove mathematical properties
of models
Improve/validate on a few,
relatively clean, small datasets
Publish a paper

Parameters and Statistics
Parameters Statistics
Source Population Sample
Notation Greek ( μ) Roman (xbar)
Vary No Yes
Calculated No Yes
• Population  all possible values
• Sample  a portion of the population
• Statistical inference  generalizing from a
sample to a population with calculated
degree of certainty
• Parameter  a characteristic of population,
e.g., population mean µ
• Statistic  calculated from data in the
sample, e.g., sample mean ( )

Stats Notations
• Sample vs population (Population
notation = Greek letters)
• Individual value = x (lower case)
• Sample mean = x or M
• Population mean = m
• Summation sign =
• Sample size = n
• Population size = N

• Sample variance = S2
• Population variance = s2
• Sample standard deviation = S or SD
• Population standard deviation = s
• Interquartile range = IQR

Calculating Variance and Standard Deviation
• The standard deviation is a unit of distance that is useful for comparing scores. Standard
deviations cannot have a negative value. They can measure in both positive and negative
directions from the mean.
• Definition – easier to understand conceptually :
• The numerator is also called the Sum of the Squares (squared differences) :
• Computation formula – easier to use, especially with large data sets :
• Use n-1 in the denominator when using s or s2 to estimate s or s2 for a population.
• Variance for Sample :
• Variance for Population :
N
XX 2
)( 
s
2
)( XX 
n
XX  

22
2
)(
s
N
SS
s
1
2


n
SS
s
N
SS
2
s
1
)(....)( 22
12



n
xxxx
S n

An example of Variance using height of dogs
The green line shows the mean. Subtract the mean from each dog’s height. Because
some dogs are taller and others are shorter, some of the differences will be positive
and some negative numbers. These differences will cancel each other out because the
mean is the balance point in the distribution of dog heights.
Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394 mm
5 5

Square the differences and take the mean.
σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 = 21,704
5 5

Take the square root to return to the original units of measure.
• σ = √21,704 = 147
• As shown below most dogs are within one standard deviation of the mean
Rottweillers are unusally tall dogs. And Dachsunds are a bit short.

What Does Variance Describe?
• The range doesn’t tell us how scores are distributed
between the high and low values.
• Because the mean is the balance point, the mean of the
unsquared deviations is always zero.
• Variance and standard deviation describe the amount that
actual observations differ from the Mean. How spread out
are the scores?
• The variance is expressed in squared units (e.g., squared
lbs) which are hard to interpret.
• Taking the square root of the variance expresses the
average deviation in the original units.
• For most distributions, the majority of observations fall
within one standard deviation of the mean. A very small
minority fall outside two standard deviations.
• This generalization is true no matter what the shape of
the distribution (works for skewed distributions also)
95%
99%

Data Types and Plots
NA
Continous
Variable
Categorical / Factor
(Ordinal, Decrete,
Nominal)
Continous Variable
Histogram,
Box Plot
Scatter Plot,
Line Plot
Box Plot
Categorical / Factor
(Ordinal, Decrete,
Nominal)
Bar Charts Box Plot Bar Charts
57

Which statistical test to use
depends on the data, sample and
the purpose of the analysis:
 T Test is used to check the
difference means of two groups.
 ANOVA (Analysis of Variance) is
the test used to check the
difference between three or
more groups.
 Chi Square test is used to check
the difference between two or
more percentages or
proportions of categorical
variables. It does not use the
mean values.
T Test, ANOVA, Chi Square ..

Analogy: Legal trial
•Null Hypothesis (H0) : Suspect is Innocent
•Alternative Hypothesis (Ha): Suspect is Guilty
 Judge must collect all the evidences (positive and negative)
 Judge cannot give sentence until the evidence 100% supports that
suspect is guilty
Confidence Interval (99%; P<0.01) : Judge can let go off
100 criminals free under benefit of doubt but one innocent
should never be punished in the service of a Judge.
Hypothesis Test & Confidence Interval Intro

Significance Level (Critical Value) and Confidence Level / Interval
• Significance level or Critical value: The value of the known distribution of the test
statistic such that the probability of making a Type I error (α - alpha). It is typically
0.01, 0.05, or 0.10.
• The null hypothesis is rejected if the p-value is less than a predetermined level, α.
• α is called the significance level, and is the probability of rejecting the null
hypothesis given that it (a type I error) is true. It is usually set at or below 5%
• if P-value ≤ significance level, α (most common: 0.05), than the hypothesis test is
statistically significant (we reject H0 )
• If the significance level is 0.05, the corresponding confidence level is 95%.
• If the confidence interval does not contain the null hypothesis value, the results are
statistically significant

P Value
• The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was
actually observed, when the null hypothesis is true
• If P-value ≤ significance level, α (most common: 0.05), than we reject the original idea H0

The Confusion Matrix
Negative Positive
Negative TrueNegative FalsePositive
Postive FlaseNegative TruePositive
Predicted
Reality
ConsfusionMatrix

Calculating Accuracy, Precision, Recall & F1 Score from Confusion Matrix
Model-1 3000 600
400 6500
Model-3 4000 200
800 5500
Model-2 6000 100
900 3500
Model-4 4500 800
200 5000
Mod1 Mod2 Mod3 Mod4
Accuracy: TP+TN/ALL 90% 90% 90% 90%
Precision: TP/TP+FP 92% 97% 92% 97%
Recall: TP/TP+FN 94% 80% 94% 80%
F1 Score 2*Pre*Recall/Pre+Recall92.86% 87.50% 92.86% 87.50%

Calculating Accuracy, Precision, Recall & F1 Score from Confusion Matrix
Mod1 Mod2 Mod3 Mod4
Accuracy: TP+TN/ALL 90% 90% 90% 90%
Precision: TP/TP+FP 92% 97% 92% 97%
Recall: TP/TP+FN 94% 80% 94% 80%
F1 Score 2*Pre*Recall/Pre+Recall92.86% 87.50% 92.86% 87.50%
 Examples for Marketing Industry: More Precision required if Budget is low; If the product is
expensive (high-end cars), high Recall is imp (I don’t mind spending additional 10% of my
budget to not loose a potential customer)
 Law & Order Model: Precision must be High; For Health Care Model: Recall Must be High
 For Hiring: If you urgently need only few candidates, than go for high precision model; If you
need a lot of candidates and have enough time to interview, than build a high recall model;

Imp Model Metrics : Gain & Lift Vs ROC/AUC
Gain, lift charts and ROC/AUC are all visual aids for evaluating performance of
classification models, however there are subtle differences as explained below:
 Gain/lift measures the effectiveness of a classification model using the ratio
between the results obtained with and without the model where as ROC/AUC
measures the discrimination, that is, the ability of the model to correctly
classify plotted using the True Positive and False positive rates across all
predicted probabilities
 Gain/Lift chart evaluates model based on its performance on portion of the
population (i.e. the first few attempts) where as ROC/AUC is used to evaluate
model on the whole population
 Gain/Lift charts can be primary metrics for evaluating Campaign Targeting
models where as ROC/AUC are primary metrics used for evaluating Anti-
Fraud/ Liability models

Intro to data science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to data science

Similar to Intro to data science (20)

More from ANURAG SINGH

More from ANURAG SINGH (9)

Recently uploaded

Recently uploaded (20)

Intro to data science