SlideShare a Scribd company logo
Data Science – Hands On Introduction
1
Topics
Duration
(min)
Data Science Overview and Intro to R 90
Exploratory Data Analysis 60
Logistic Regression Model 30
2
3
4
5
6
7
8
Data Science
Collecting and analysing data to find insights which will help in decision making
 In Business Intelligence (ETL), we discover what and why of a (recurring) business problem
 In Data Science, we can proactively predict the problem before it occurs again
 Analytics are used to improve the operations, performance and innovations across domains
 Example, we can predict the customers who may default the loan before approving the loan
• Data Speaks, give is a canvas of methods to communicate with you!
 Data > Information > Knowledge > Insight > Action
 Data > Dashboards > Statistics > Predictive Models > Prescriptive
 What happened? Why? What's likely to happen?
9
10
11
12
13
14
15
16
17
What is Machine Learning ?
• Machine learning is a type of
artificial intelligence (AI) that
provides computers with the
ability to learn without being
explicitly programmed.
• Machine learning focuses on
the development of computer
programs that can teach
themselves to grow and change
when exposed to new data.
Prediction
Steps in Machine Learning
Training
Labels
Training on
Input Data
Training
Training
Input
Features
Input
Features
Testing
Test Input
Learned
model
Learned
model
20
Training ML Model for Face Recognition
Training examples of a person
Test images
Contrast: Traditional Machine Learning and Data Science
Data Science
Explore many models, build and
tune hybrids
Understand empirical properties of
models
Develop/use tools that can handle
massive datasets
Take action!
Machine Learning
Develop new (individual)
models
Prove mathematical properties
of models
Improve/validate on a few,
relatively clean, small datasets
Publish a paper
Supervised Machine Learning
Unsupervised Machine Learning
Machine Learning Models Intro
25
26
27
28
29
30
Parameters and Statistics
Parameters Statistics
Source Population Sample
Notation Greek ( μ) Roman (xbar)
Vary No Yes
Calculated No Yes
• Population  all possible values
• Sample  a portion of the population
• Statistical inference  generalizing from a
sample to a population with calculated
degree of certainty
• Parameter  a characteristic of population,
e.g., population mean µ
• Statistic  calculated from data in the
sample, e.g., sample mean ( )
Stats Notations
• Sample vs population (Population
notation = Greek letters)
• Individual value = x (lower case)
• Sample mean = x or M
• Population mean = m
• Summation sign =
• Sample size = n
• Population size = N

• Sample variance = S2
• Population variance = s2
• Sample standard deviation = S or SD
• Population standard deviation = s
• Interquartile range = IQR
Calculating Variance and Standard Deviation
• The standard deviation is a unit of distance that is useful for comparing scores. Standard
deviations cannot have a negative value. They can measure in both positive and negative
directions from the mean.
• Definition – easier to understand conceptually :
• The numerator is also called the Sum of the Squares (squared differences) :
• Computation formula – easier to use, especially with large data sets :
• Use n-1 in the denominator when using s or s2 to estimate s or s2 for a population.
• Variance for Sample :
• Variance for Population :
N
XX 2
)( 
s
2
)( XX 
n
XX  

22
2
)(
s
N
SS
s
1
2


n
SS
s
N
SS
2
s
1
)(....)( 22
12



n
xxxx
S n
34
An example of Variance using height of dogs
The green line shows the mean. Subtract the mean from each dog’s height. Because
some dogs are taller and others are shorter, some of the differences will be positive
and some negative numbers. These differences will cancel each other out because the
mean is the balance point in the distribution of dog heights.
Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394 mm
5 5
Square the differences and take the mean.
σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 = 21,704
5 5
Take the square root to return to the original units of measure.
• σ = √21,704 = 147
• As shown below most dogs are within one standard deviation of the mean
Rottweillers are unusally tall dogs. And Dachsunds are a bit short.
What Does Variance Describe?
• The range doesn’t tell us how scores are distributed
between the high and low values.
• Because the mean is the balance point, the mean of the
unsquared deviations is always zero.
• Variance and standard deviation describe the amount that
actual observations differ from the Mean. How spread out
are the scores?
• The variance is expressed in squared units (e.g., squared
lbs) which are hard to interpret.
• Taking the square root of the variance expresses the
average deviation in the original units.
• For most distributions, the majority of observations fall
within one standard deviation of the mean. A very small
minority fall outside two standard deviations.
• This generalization is true no matter what the shape of
the distribution (works for skewed distributions also)
95%
99%
R Intro
39
R Intro
40
Installing R
41
Installing R Studio
42
R Workspace
43
R Functions and Help
44
R Functions and Packages
45
Data Structures in R
46
Importing Data in R
47
Exploring Data in R
48
49
50
EDA – Example
51
EDA – Example (Cond..)
52
EDA – Example (Cond..)
53
54
55
56
Data Types and Plots
NA
Continous
Variable
Categorical / Factor
(Ordinal, Decrete,
Nominal)
Continous Variable
Histogram,
Box Plot
Scatter Plot,
Line Plot
Box Plot
Categorical / Factor
(Ordinal, Decrete,
Nominal)
Bar Charts Box Plot Bar Charts
57
Correlation
58
Z Test, T Test and ANOVA
Which statistical test to use
depends on the data, sample and
the purpose of the analysis:
 T Test is used to check the
difference means of two groups.
 ANOVA (Analysis of Variance) is
the test used to check the
difference between three or
more groups.
 Chi Square test is used to check
the difference between two or
more percentages or
proportions of categorical
variables. It does not use the
mean values.
T Test, ANOVA, Chi Square ..
Analogy: Legal trial
•Null Hypothesis (H0) : Suspect is Innocent
•Alternative Hypothesis (Ha): Suspect is Guilty
 Judge must collect all the evidences (positive and negative)
 Judge cannot give sentence until the evidence 100% supports that
suspect is guilty
Confidence Interval (99%; P<0.01) : Judge can let go off
100 criminals free under benefit of doubt but one innocent
should never be punished in the service of a Judge.
Hypothesis Test & Confidence Interval Intro
62
63
Significance Level (Critical Value) and Confidence Level / Interval
• Significance level or Critical value: The value of the known distribution of the test
statistic such that the probability of making a Type I error (α - alpha). It is typically
0.01, 0.05, or 0.10.
• The null hypothesis is rejected if the p-value is less than a predetermined level, α.
• α is called the significance level, and is the probability of rejecting the null
hypothesis given that it (a type I error) is true. It is usually set at or below 5%
• if P-value ≤ significance level, α (most common: 0.05), than the hypothesis test is
statistically significant (we reject H0 )
• If the significance level is 0.05, the corresponding confidence level is 95%.
• If the confidence interval does not contain the null hypothesis value, the results are
statistically significant
P Value
• The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was
actually observed, when the null hypothesis is true
• If P-value ≤ significance level, α (most common: 0.05), than we reject the original idea H0
The Confusion Matrix
Negative Positive
Negative TrueNegative FalsePositive
Postive FlaseNegative TruePositive
Predicted
Reality
ConsfusionMatrix
Calculating Accuracy, Precision, Recall & F1 Score from Confusion Matrix
Model-1 3000 600
400 6500
Model-3 4000 200
800 5500
Model-2 6000 100
900 3500
Model-4 4500 800
200 5000
Mod1 Mod2 Mod3 Mod4
Accuracy: TP+TN/ALL 90% 90% 90% 90%
Precision: TP/TP+FP 92% 97% 92% 97%
Recall: TP/TP+FN 94% 80% 94% 80%
F1 Score 2*Pre*Recall/Pre+Recall92.86% 87.50% 92.86% 87.50%
Calculating Accuracy, Precision, Recall & F1 Score from Confusion Matrix
Mod1 Mod2 Mod3 Mod4
Accuracy: TP+TN/ALL 90% 90% 90% 90%
Precision: TP/TP+FP 92% 97% 92% 97%
Recall: TP/TP+FN 94% 80% 94% 80%
F1 Score 2*Pre*Recall/Pre+Recall92.86% 87.50% 92.86% 87.50%
 Examples for Marketing Industry: More Precision required if Budget is low; If the product is
expensive (high-end cars), high Recall is imp (I don’t mind spending additional 10% of my
budget to not loose a potential customer)
 Law & Order Model: Precision must be High; For Health Care Model: Recall Must be High
 For Hiring: If you urgently need only few candidates, than go for high precision model; If you
need a lot of candidates and have enough time to interview, than build a high recall model;
Imp Model Metrics : Gain & Lift Vs ROC/AUC
Gain, lift charts and ROC/AUC are all visual aids for evaluating performance of
classification models, however there are subtle differences as explained below:
 Gain/lift measures the effectiveness of a classification model using the ratio
between the results obtained with and without the model where as ROC/AUC
measures the discrimination, that is, the ability of the model to correctly
classify plotted using the True Positive and False positive rates across all
predicted probabilities
 Gain/Lift chart evaluates model based on its performance on portion of the
population (i.e. the first few attempts) where as ROC/AUC is used to evaluate
model on the whole population
 Gain/Lift charts can be primary metrics for evaluating Campaign Targeting
models where as ROC/AUC are primary metrics used for evaluating Anti-
Fraud/ Liability models
70
Thank You!
71
72

More Related Content

What's hot

Business Statistics Chapter 8
Business Statistics Chapter 8Business Statistics Chapter 8
Business Statistics Chapter 8
Lux PP
 
Hypo
HypoHypo
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
Smarten Augmented Analytics
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
Smarten Augmented Analytics
 
Normal Curve and Standard Scores
Normal Curve and Standard ScoresNormal Curve and Standard Scores
Normal Curve and Standard Scores
Jenewel Azuelo
 
What is the Multinomial-Logistic Regression Classification Algorithm and How ...
What is the Multinomial-Logistic Regression Classification Algorithm and How ...What is the Multinomial-Logistic Regression Classification Algorithm and How ...
What is the Multinomial-Logistic Regression Classification Algorithm and How ...
Smarten Augmented Analytics
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
Smarten Augmented Analytics
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
Smarten Augmented Analytics
 
STATISTIC ESTIMATION
STATISTIC ESTIMATIONSTATISTIC ESTIMATION
STATISTIC ESTIMATION
Smruti Ranjan Parida
 
Confidence Interval Estimation
Confidence Interval EstimationConfidence Interval Estimation
Confidence Interval Estimation
Yesica Adicondro
 
Coefficient of Variance
Coefficient of VarianceCoefficient of Variance
Coefficient of Variance
Dr. Amjad Ali Arain
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
Smarten Augmented Analytics
 
Data analysis
Data analysisData analysis
Data analysis
metalkid132
 
statistical estimation
statistical estimationstatistical estimation
statistical estimation
Amish Akbar
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...
Smarten Augmented Analytics
 
Dispersion stati
Dispersion statiDispersion stati
Dispersion stati
Lanka Praneeth
 
Aed1222 lesson 2
Aed1222 lesson 2Aed1222 lesson 2
Aed1222 lesson 2
nurun2010
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation
Remyagharishs
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
Smarten Augmented Analytics
 
The Normal Distribution and Other Continuous Distributions
The Normal Distribution and Other Continuous DistributionsThe Normal Distribution and Other Continuous Distributions
The Normal Distribution and Other Continuous Distributions
Yesica Adicondro
 

What's hot (20)

Business Statistics Chapter 8
Business Statistics Chapter 8Business Statistics Chapter 8
Business Statistics Chapter 8
 
Hypo
HypoHypo
Hypo
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
 
Normal Curve and Standard Scores
Normal Curve and Standard ScoresNormal Curve and Standard Scores
Normal Curve and Standard Scores
 
What is the Multinomial-Logistic Regression Classification Algorithm and How ...
What is the Multinomial-Logistic Regression Classification Algorithm and How ...What is the Multinomial-Logistic Regression Classification Algorithm and How ...
What is the Multinomial-Logistic Regression Classification Algorithm and How ...
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
 
STATISTIC ESTIMATION
STATISTIC ESTIMATIONSTATISTIC ESTIMATION
STATISTIC ESTIMATION
 
Confidence Interval Estimation
Confidence Interval EstimationConfidence Interval Estimation
Confidence Interval Estimation
 
Coefficient of Variance
Coefficient of VarianceCoefficient of Variance
Coefficient of Variance
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
Data analysis
Data analysisData analysis
Data analysis
 
statistical estimation
statistical estimationstatistical estimation
statistical estimation
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...
 
Dispersion stati
Dispersion statiDispersion stati
Dispersion stati
 
Aed1222 lesson 2
Aed1222 lesson 2Aed1222 lesson 2
Aed1222 lesson 2
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
The Normal Distribution and Other Continuous Distributions
The Normal Distribution and Other Continuous DistributionsThe Normal Distribution and Other Continuous Distributions
The Normal Distribution and Other Continuous Distributions
 

Similar to Intro to data science

Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesis
Junaid Ijaz
 
Review of Chapters 1-5.ppt
Review of Chapters 1-5.pptReview of Chapters 1-5.ppt
Review of Chapters 1-5.ppt
NobelFFarrar
 
How to compute for sample size.pptx
How to compute for sample size.pptxHow to compute for sample size.pptx
How to compute for sample size.pptx
noelmartinez003
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
M K
 
Introduction to the t test
Introduction to the t testIntroduction to the t test
Introduction to the t test
Sr Edith Bogue
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
Cheryl Lawson
 
Measure of Dispersion in statistics
Measure of Dispersion in statisticsMeasure of Dispersion in statistics
Measure of Dispersion in statistics
Md. Mehadi Hassan Bappy
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
Jen Stirrup
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
Aman Vasisht
 
QT1 - 07 - Estimation
QT1 - 07 - EstimationQT1 - 07 - Estimation
QT1 - 07 - Estimation
Prithwis Mukerjee
 
Hypothsis testing
Hypothsis testingHypothsis testing
Hypothsis testing
University of Balochistan
 
Basic stat analysis using excel
Basic stat analysis using excelBasic stat analysis using excel
Basic stat analysis using excel
Parag Shah
 
SAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxSAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docx
anhlodge
 
Statistics for UX Professionals
Statistics for UX ProfessionalsStatistics for UX Professionals
Statistics for UX Professionals
Jessica Cameron
 
statical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptstatical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.ppt
NazarudinManik1
 
best for normal distribution.ppt
best for normal distribution.pptbest for normal distribution.ppt
best for normal distribution.ppt
DejeneDay
 
BIIntro.ppt
BIIntro.pptBIIntro.ppt
BIIntro.ppt
PerumalPitchandi
 
Statistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronStatistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica Cameron
User Vision
 
Statistics
StatisticsStatistics
Statistics
Eran Earland
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
hktripathy
 

Similar to Intro to data science (20)

Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesis
 
Review of Chapters 1-5.ppt
Review of Chapters 1-5.pptReview of Chapters 1-5.ppt
Review of Chapters 1-5.ppt
 
How to compute for sample size.pptx
How to compute for sample size.pptxHow to compute for sample size.pptx
How to compute for sample size.pptx
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
Introduction to the t test
Introduction to the t testIntroduction to the t test
Introduction to the t test
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Measure of Dispersion in statistics
Measure of Dispersion in statisticsMeasure of Dispersion in statistics
Measure of Dispersion in statistics
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
QT1 - 07 - Estimation
QT1 - 07 - EstimationQT1 - 07 - Estimation
QT1 - 07 - Estimation
 
Hypothsis testing
Hypothsis testingHypothsis testing
Hypothsis testing
 
Basic stat analysis using excel
Basic stat analysis using excelBasic stat analysis using excel
Basic stat analysis using excel
 
SAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxSAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docx
 
Statistics for UX Professionals
Statistics for UX ProfessionalsStatistics for UX Professionals
Statistics for UX Professionals
 
statical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptstatical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.ppt
 
best for normal distribution.ppt
best for normal distribution.pptbest for normal distribution.ppt
best for normal distribution.ppt
 
BIIntro.ppt
BIIntro.pptBIIntro.ppt
BIIntro.ppt
 
Statistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronStatistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica Cameron
 
Statistics
StatisticsStatistics
Statistics
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 

More from ANURAG SINGH

Microsoft Azure Cloud fundamentals
Microsoft Azure Cloud fundamentalsMicrosoft Azure Cloud fundamentals
Microsoft Azure Cloud fundamentals
ANURAG SINGH
 
Design thinkinga primer
Design thinkinga primerDesign thinkinga primer
Design thinkinga primer
ANURAG SINGH
 
Procurement Workflow in terms of SAP Tables Changes
Procurement Workflow in terms of SAP Tables ChangesProcurement Workflow in terms of SAP Tables Changes
Procurement Workflow in terms of SAP Tables Changes
ANURAG SINGH
 
Unit testing Behaviour Driven Development
Unit testing Behaviour Driven DevelopmentUnit testing Behaviour Driven Development
Unit testing Behaviour Driven Development
ANURAG SINGH
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
ANURAG SINGH
 
Intro todatascience casestudyapproach
Intro todatascience casestudyapproachIntro todatascience casestudyapproach
Intro todatascience casestudyapproach
ANURAG SINGH
 
Oops concept in c#
Oops concept in c#Oops concept in c#
Oops concept in c#
ANURAG SINGH
 
Introduction to ,NET Framework
Introduction to ,NET FrameworkIntroduction to ,NET Framework
Introduction to ,NET Framework
ANURAG SINGH
 
Introduction to C#
Introduction to C#Introduction to C#
Introduction to C#
ANURAG SINGH
 

More from ANURAG SINGH (9)

Microsoft Azure Cloud fundamentals
Microsoft Azure Cloud fundamentalsMicrosoft Azure Cloud fundamentals
Microsoft Azure Cloud fundamentals
 
Design thinkinga primer
Design thinkinga primerDesign thinkinga primer
Design thinkinga primer
 
Procurement Workflow in terms of SAP Tables Changes
Procurement Workflow in terms of SAP Tables ChangesProcurement Workflow in terms of SAP Tables Changes
Procurement Workflow in terms of SAP Tables Changes
 
Unit testing Behaviour Driven Development
Unit testing Behaviour Driven DevelopmentUnit testing Behaviour Driven Development
Unit testing Behaviour Driven Development
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
 
Intro todatascience casestudyapproach
Intro todatascience casestudyapproachIntro todatascience casestudyapproach
Intro todatascience casestudyapproach
 
Oops concept in c#
Oops concept in c#Oops concept in c#
Oops concept in c#
 
Introduction to ,NET Framework
Introduction to ,NET FrameworkIntroduction to ,NET Framework
Introduction to ,NET Framework
 
Introduction to C#
Introduction to C#Introduction to C#
Introduction to C#
 

Recently uploaded

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 

Recently uploaded (20)

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 

Intro to data science

  • 1. Data Science – Hands On Introduction 1 Topics Duration (min) Data Science Overview and Intro to R 90 Exploratory Data Analysis 60 Logistic Regression Model 30
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. Data Science Collecting and analysing data to find insights which will help in decision making  In Business Intelligence (ETL), we discover what and why of a (recurring) business problem  In Data Science, we can proactively predict the problem before it occurs again  Analytics are used to improve the operations, performance and innovations across domains  Example, we can predict the customers who may default the loan before approving the loan • Data Speaks, give is a canvas of methods to communicate with you!  Data > Information > Knowledge > Insight > Action  Data > Dashboards > Statistics > Predictive Models > Prescriptive  What happened? Why? What's likely to happen? 9
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. What is Machine Learning ? • Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. • Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.
  • 19. Prediction Steps in Machine Learning Training Labels Training on Input Data Training Training Input Features Input Features Testing Test Input Learned model Learned model
  • 20. 20 Training ML Model for Face Recognition Training examples of a person Test images
  • 21. Contrast: Traditional Machine Learning and Data Science Data Science Explore many models, build and tune hybrids Understand empirical properties of models Develop/use tools that can handle massive datasets Take action! Machine Learning Develop new (individual) models Prove mathematical properties of models Improve/validate on a few, relatively clean, small datasets Publish a paper
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. Parameters and Statistics Parameters Statistics Source Population Sample Notation Greek ( μ) Roman (xbar) Vary No Yes Calculated No Yes • Population  all possible values • Sample  a portion of the population • Statistical inference  generalizing from a sample to a population with calculated degree of certainty • Parameter  a characteristic of population, e.g., population mean µ • Statistic  calculated from data in the sample, e.g., sample mean ( )
  • 32. Stats Notations • Sample vs population (Population notation = Greek letters) • Individual value = x (lower case) • Sample mean = x or M • Population mean = m • Summation sign = • Sample size = n • Population size = N  • Sample variance = S2 • Population variance = s2 • Sample standard deviation = S or SD • Population standard deviation = s • Interquartile range = IQR
  • 33. Calculating Variance and Standard Deviation • The standard deviation is a unit of distance that is useful for comparing scores. Standard deviations cannot have a negative value. They can measure in both positive and negative directions from the mean. • Definition – easier to understand conceptually : • The numerator is also called the Sum of the Squares (squared differences) : • Computation formula – easier to use, especially with large data sets : • Use n-1 in the denominator when using s or s2 to estimate s or s2 for a population. • Variance for Sample : • Variance for Population : N XX 2 )(  s 2 )( XX  n XX    22 2 )( s N SS s 1 2   n SS s N SS 2 s 1 )(....)( 22 12    n xxxx S n
  • 34. 34
  • 35. An example of Variance using height of dogs The green line shows the mean. Subtract the mean from each dog’s height. Because some dogs are taller and others are shorter, some of the differences will be positive and some negative numbers. These differences will cancel each other out because the mean is the balance point in the distribution of dog heights. Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394 mm 5 5
  • 36. Square the differences and take the mean. σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 = 21,704 5 5
  • 37. Take the square root to return to the original units of measure. • σ = √21,704 = 147 • As shown below most dogs are within one standard deviation of the mean Rottweillers are unusally tall dogs. And Dachsunds are a bit short.
  • 38. What Does Variance Describe? • The range doesn’t tell us how scores are distributed between the high and low values. • Because the mean is the balance point, the mean of the unsquared deviations is always zero. • Variance and standard deviation describe the amount that actual observations differ from the Mean. How spread out are the scores? • The variance is expressed in squared units (e.g., squared lbs) which are hard to interpret. • Taking the square root of the variance expresses the average deviation in the original units. • For most distributions, the majority of observations fall within one standard deviation of the mean. A very small minority fall outside two standard deviations. • This generalization is true no matter what the shape of the distribution (works for skewed distributions also) 95% 99%
  • 44. R Functions and Help 44
  • 45. R Functions and Packages 45
  • 49. 49
  • 50. 50
  • 52. EDA – Example (Cond..) 52
  • 53. EDA – Example (Cond..) 53
  • 54. 54
  • 55. 55
  • 56. 56
  • 57. Data Types and Plots NA Continous Variable Categorical / Factor (Ordinal, Decrete, Nominal) Continous Variable Histogram, Box Plot Scatter Plot, Line Plot Box Plot Categorical / Factor (Ordinal, Decrete, Nominal) Bar Charts Box Plot Bar Charts 57
  • 59. Z Test, T Test and ANOVA
  • 60. Which statistical test to use depends on the data, sample and the purpose of the analysis:  T Test is used to check the difference means of two groups.  ANOVA (Analysis of Variance) is the test used to check the difference between three or more groups.  Chi Square test is used to check the difference between two or more percentages or proportions of categorical variables. It does not use the mean values. T Test, ANOVA, Chi Square ..
  • 61. Analogy: Legal trial •Null Hypothesis (H0) : Suspect is Innocent •Alternative Hypothesis (Ha): Suspect is Guilty  Judge must collect all the evidences (positive and negative)  Judge cannot give sentence until the evidence 100% supports that suspect is guilty Confidence Interval (99%; P<0.01) : Judge can let go off 100 criminals free under benefit of doubt but one innocent should never be punished in the service of a Judge. Hypothesis Test & Confidence Interval Intro
  • 62. 62
  • 63. 63
  • 64. Significance Level (Critical Value) and Confidence Level / Interval • Significance level or Critical value: The value of the known distribution of the test statistic such that the probability of making a Type I error (α - alpha). It is typically 0.01, 0.05, or 0.10. • The null hypothesis is rejected if the p-value is less than a predetermined level, α. • α is called the significance level, and is the probability of rejecting the null hypothesis given that it (a type I error) is true. It is usually set at or below 5% • if P-value ≤ significance level, α (most common: 0.05), than the hypothesis test is statistically significant (we reject H0 ) • If the significance level is 0.05, the corresponding confidence level is 95%. • If the confidence interval does not contain the null hypothesis value, the results are statistically significant
  • 65. P Value • The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true • If P-value ≤ significance level, α (most common: 0.05), than we reject the original idea H0
  • 66. The Confusion Matrix Negative Positive Negative TrueNegative FalsePositive Postive FlaseNegative TruePositive Predicted Reality ConsfusionMatrix
  • 67. Calculating Accuracy, Precision, Recall & F1 Score from Confusion Matrix Model-1 3000 600 400 6500 Model-3 4000 200 800 5500 Model-2 6000 100 900 3500 Model-4 4500 800 200 5000 Mod1 Mod2 Mod3 Mod4 Accuracy: TP+TN/ALL 90% 90% 90% 90% Precision: TP/TP+FP 92% 97% 92% 97% Recall: TP/TP+FN 94% 80% 94% 80% F1 Score 2*Pre*Recall/Pre+Recall92.86% 87.50% 92.86% 87.50%
  • 68. Calculating Accuracy, Precision, Recall & F1 Score from Confusion Matrix Mod1 Mod2 Mod3 Mod4 Accuracy: TP+TN/ALL 90% 90% 90% 90% Precision: TP/TP+FP 92% 97% 92% 97% Recall: TP/TP+FN 94% 80% 94% 80% F1 Score 2*Pre*Recall/Pre+Recall92.86% 87.50% 92.86% 87.50%  Examples for Marketing Industry: More Precision required if Budget is low; If the product is expensive (high-end cars), high Recall is imp (I don’t mind spending additional 10% of my budget to not loose a potential customer)  Law & Order Model: Precision must be High; For Health Care Model: Recall Must be High  For Hiring: If you urgently need only few candidates, than go for high precision model; If you need a lot of candidates and have enough time to interview, than build a high recall model;
  • 69. Imp Model Metrics : Gain & Lift Vs ROC/AUC Gain, lift charts and ROC/AUC are all visual aids for evaluating performance of classification models, however there are subtle differences as explained below:  Gain/lift measures the effectiveness of a classification model using the ratio between the results obtained with and without the model where as ROC/AUC measures the discrimination, that is, the ability of the model to correctly classify plotted using the True Positive and False positive rates across all predicted probabilities  Gain/Lift chart evaluates model based on its performance on portion of the population (i.e. the first few attempts) where as ROC/AUC is used to evaluate model on the whole population  Gain/Lift charts can be primary metrics for evaluating Campaign Targeting models where as ROC/AUC are primary metrics used for evaluating Anti- Fraud/ Liability models
  • 71. 71
  • 72. 72