SlideShare a Scribd company logo
1 of 41
Download to read offline
Data Cleaning
(Missing value, Outlier)
Exploratory Data Analysis
(Descriptive Statistics, Visualization)
Feature Engineering
(Data Transformation
(Encoding, Skew, Scale)
Feature Selection)
“Data is the fuel for
ML algorithms”
2
3
Case Study: A classification model for diagnosing Breast Cancer in women.
A sample of 1000 women were studied in a given population, 100 of them
with Breast Cancer while remaining 900 were without it. Split dataset into
70/30 train/test set.
The accuracy was 90% excellent.
A couple of months after deployment, some of the women who were
diagnosed by the model as having “no breast cancer” started showing
symptoms of Breast Cancer.
4
Actual
Predi
cted
Null Hypothesis
(H0) valid: Breast
Cancer
Null Hypothesis
(H0) invalid: No
Breast Cancer
Accept H0
(X has
disease)
TP = 0 FP (X might feel she
will die soon) = 0
0
Reject H0
(X does
not have
disease)
FN (X thinks she
is healthy when
suffering form
disease) = 30
TN = 270 300
30 270 300
Model has conveniently
classified all the test data as
“NO Breast Cancer”
Accuracy = (TP + TN) / (TP +
TN + FP + FN) = 90%
Precision (predict disease
correctly) = TP / (TP + FP) =
0%
Recall = TP / (TP + FN) = 0%
Isn’t it better to think you
have Breast Cancer and not
have it than to think you don’t
have Breast Cancer but
you’ve got it.
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ 5
https://towardsdatascience.com/fraud-detection-with-cost-sensitive-machine-learning-24b8760d35d9
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
6
Observed accuracy = (TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6
Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) +
((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) =
((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30
= (8.5+6.5)/30 = 0.5
Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)
= (0.6-0.5)/(1-0.5) = 0.20
Actual class
Model
classific
ation
Cats Dogs
Cats 10 7 17
Dogs 5 8 13
15 15
60 125
5 5000
0.47
Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK
7
https://towardsdatascience.com/the-best-
classification-metric-youve-never-heard-of-the-
matthews-correlation-coefficient-3bf50a2f3e9a
TNR=1-FPR
8
“No one size fits all”
9
https://machinelearningmastery.com/handle-missing-data-python/ 10
11
Simple Imputer https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/
12
13
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
14
Pearson and ANOVA (parametric)
Spearman and Kendall’s rank (non parametric)
Chi2 test, Mutual Information
15
I(X ; Y) = H(X) – H(X | Y)
χ2 = ∑ (O − E)2 / E
F = MST/MSE
MST = SST/ p-1
MSE = SSE/N-p
SSE = ∑ (n−1)s2
16
REVERSE CORRELATION
17
X Y X-XMEAN Y-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X-
(XMEAN)*
Y-YMEAN)
X-(XMEAN)*X-
(XMEAN)
*(Y-YMEAN)*(Y-
YMEAN)
3 6 1 2 1 4 1 4
2 3 0 -1 0 1 0 0
2 5 0 -1 0 1 0 0
1 2 -1 -2 1 4 1 4
ME
AN
2 4 2 10 4
= 4/√20 = 0.8944 > 0 high correlation
18
Independent
variable
# OF ANIMAL AV. DOMESTIC ANIMAL S.D. S.D.2
DOG 5 12 2 4
CAT 5 16 1 1
HAMSTER 5 20 4 16
Different groups must have equal sample size
No relationship between subjects in each sample
To test more than 2 levels within an indep var
ρ = 3 TOTAL POPULATION
n = 5 # of samples
N = 15 total # of observation
SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160
MST = SST/ ρ-1 = 160/(3-1) = 80
SSE = (4+1+16)*(n-1) = 84
MSE = SSE/(N- ρ) = 84/(15-3) = 7
F = MST/MSE = 80/7 = 11.429
19
τ = (15-6)/21 = 0.4287
Interpretation: agreement between 2 experts
20
Cat Dog
Men 207 282 489
Women 231 242 473
438 524 962
Expected value
Cat Dog
Men 489*438/962 =
222.64
489*524/962
= 266.36
489
Women 473*438/962
=215.36
473*524/962
= 257.64
473
438 524 962
(O-E)2/E
Cat Dog
Men (207-222.64)2 =
1.099
(282-266.36)2
= 0.918
489
Women (231-215.36)2 =
1.136
(242-257.64)2
= 0.949
473
438 524 962
χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102
Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1
21
https://machinelearningmastery.com/calculate-feature-importance-with-python/
22
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0)
sbs = SFS(LinearRegression(), k_features=11,
forward=False, floating=False, cv=0)
sbs.fit(X, y)
sbs.k_feature_names_
from sklearn.feature_selection import RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
23
from sklearn.feature_selection import SelectFromModel
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1'))
sel_.fit(scaler.transform(X_train.fillna(0)), y_train)
from sklearn.linear_model import ElasticNet
regr = ElasticNet(random_state=0)
24
25
26
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
27
https://machinelearningmas
tery.com/one-hot-encoding-
for-categorical-data/
df_dummies = pd.get_dummies(df, columgenderns=['sex'])
https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/
28
Assumptions by models:
1. Linear relationship between predictors and target variable
2. No noise i.e. there are no outliers in the data
3. No collinearity
4. Normal distribution of predictors and the target variable
5. Scale if it’s a distance-based algorithm
Solution
1. Log Transform (log(x))
2. Square Root (special case)
3. Power Transform - Box Cox (stabilize variance)
Reverse transformation while making predictions
29
30
https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
• displays information as a series of data points connected by straight line segments
• to visualize the directional movement of one or more data over time i.e. time series data
• X axis would be datetime and the Y axis contains the measured quantity like monthly sales
• Eg. Simple, Multiple, Time Series Analysis
Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31
• categorical data as rectangular bars with the height of bars proportional to the value
they represent
• example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc.
• used to compare between values of different categories in the data
• categorical data is nothing but a grouping of data into different logical groups
• Types include: Simple, Horizontal, Grouped and Stacked
https://www.machinelearningplus.co
m/plots/bar-plot-in-python/
32
• visualize the frequency distribution of numeric array by splitting it to small equal-sized bins.
• A histogram is drawn on large arrays. It computes the frequency distribution on an array and
makes a histogram out of it.
• Types include basic, grouped, Density curve, Facets
https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33
34
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
To obtain the
Winsorized mean,
you sort the data
and replace the
smallest k values
by the (k+1)st
smallest value.
You do the same
for the largest
values, replacing
the k largest
values with the
(k-1)st largest
value
A normal point (on the left) requires more partitions
to be identified than an abnormal point (right)
https://towardsdatascience.com/outlier-detection-with-
isolation-forest-3d190448d45e
• visualize how a given data (variable) is distributed using quartiles
• shows the minimum, maximum, median, first quartile and third quartile in the data set
• method to graphically show the spread of a numerical variable through quartiles
• Middle 50% of all datapoints: IQR = Q3-Q1
• upper and lower whisker mark 1.5 times the IQR
from the top (and bottom) of the box
• points that lie outside the whiskers, i.e. 1.5 x IQR
in both directions are generally considered as
outliers (< Q1-1.5*IQR | > Q3+1.5*IQR)
• Types include basic, notched, violinplot
36
https://www.khanacademy.org/math/statistics-
probability/summarizing-quantitative-data/box-whisker-
plots/a/box-plot-review
TASK
• the values of two variables are plotted along two axes
• used to visualize the relationship between two variables
• Types include basic, correlation, linearfitplot, bubble plot
https://www.machinelearningplus.com/plots/python-scatter-plot/
37
• Correlation between the variables indicates how the variables are inter-related
• Correlation is not Causation
1. Each cell in the grid represents the value of the correlation coefficient
between two variables.
2. It is a square and symmetric matrix.
3. All diagonal elements are 1.
4. The axes ticks denote the feature each of them represents.
5. A large positive value (near to 1.0) indicates a strong positive correlation.
6. A large negative value (near to -1.0) indicates a strong negative
correlation.
7. A value near to 0 (both positive or negative) indicates the absence of any
correlation between the two variables, and hence those variables are
independent of each other.
8. Each cell in the above matrix is also represented by shades of a color.
Here darker shades of the color indicate smaller values while brighter shades
correspond to larger values (near to 1).
9. This scale is given with the help of a color-bar on the right side of the plot.
38
• Eg. a person’s height and weight, age and sales price of a car, or years of education
and annual income
• Doesn’t affect DT
• kNN affected
• Cause
• Insufficient data
• Dummy variables
• Including a variable in the regression that is actually a combination of two
other variables.
• Identify (corr>0.4, Variance Inflation Factor score>5 high correlation )
• Sol
• Feature selection
• PCA
• More data
• Ridge regression reduces magnitude of model coefficients 39
Actual
Cats Dogs
Predic
ted
Cats 60 125
Dogs 5 5000
40
1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels.
2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on.
3. How is a missing value represented. What are the types and ways of dealing with missing values.
4. Discuss data transformation methods for categorical data and numerical data.
5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly.
6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled.
7. How can you determine which features are most important in your model? Which feature selection algorithm should be used
when. State with example.
8. Discuss Wrapper based Feature selection methods with example diagram.
9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation.
10. Compute Karl Pearson and Spearman Coefficient of Correlation.
11. Find Kendall’s Rank Correlation Coefficient Tau.
12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can
be applied.
41

More Related Content

Similar to ML MODULE 2.pdf

Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsLong Beach City College
 
Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by KodebayKodebay
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplotsLong Beach City College
 
韩国会议
韩国会议韩国会议
韩国会议YAO YUAN
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...ChemAxon
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better MathBrent Schneeman
 
Mimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithmMimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithmCemal Ardil
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionKuppusamy P
 
Dynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in InputsDynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in InputsJean Fecteau
 
1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docxpaynetawnya
 

Similar to ML MODULE 2.pdf (20)

Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and Boxplots
 
Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by Kodebay
 
Practice test1 solution
Practice test1 solutionPractice test1 solution
Practice test1 solution
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots
 
Regression
RegressionRegression
Regression
 
韩国会议
韩国会议韩国会议
韩国会议
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
 
Mimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithmMimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithm
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Dynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in InputsDynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in Inputs
 
1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx
 

More from Shiwani Gupta

module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfShiwani Gupta
 
module5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdfmodule5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdfShiwani Gupta
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfShiwani Gupta
 
module3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdfmodule3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdfShiwani Gupta
 
module2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdfmodule2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdfShiwani Gupta
 
module1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdfmodule1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdfShiwani Gupta
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfShiwani Gupta
 
Functionsandpigeonholeprinciple
FunctionsandpigeonholeprincipleFunctionsandpigeonholeprinciple
FunctionsandpigeonholeprincipleShiwani Gupta
 
Uncertain knowledge and reasoning
Uncertain knowledge and reasoningUncertain knowledge and reasoning
Uncertain knowledge and reasoningShiwani Gupta
 

More from Shiwani Gupta (20)

ML MODULE 6.pdf
ML MODULE 6.pdfML MODULE 6.pdf
ML MODULE 6.pdf
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdf
 
module5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdfmodule5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdf
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
 
module3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdfmodule3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdf
 
module2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdfmodule2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdf
 
module1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdfmodule1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdf
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
 
ML Module 3.pdf
ML Module 3.pdfML Module 3.pdf
ML Module 3.pdf
 
Problem formulation
Problem formulationProblem formulation
Problem formulation
 
Simplex method
Simplex methodSimplex method
Simplex method
 
Functionsandpigeonholeprinciple
FunctionsandpigeonholeprincipleFunctionsandpigeonholeprinciple
Functionsandpigeonholeprinciple
 
Relations
RelationsRelations
Relations
 
Logic
LogicLogic
Logic
 
Set theory
Set theorySet theory
Set theory
 
Uncertain knowledge and reasoning
Uncertain knowledge and reasoningUncertain knowledge and reasoning
Uncertain knowledge and reasoning
 
Introduction to ai
Introduction to aiIntroduction to ai
Introduction to ai
 
Planning Agent
Planning AgentPlanning Agent
Planning Agent
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

ML MODULE 2.pdf

  • 1. Data Cleaning (Missing value, Outlier) Exploratory Data Analysis (Descriptive Statistics, Visualization) Feature Engineering (Data Transformation (Encoding, Skew, Scale) Feature Selection) “Data is the fuel for ML algorithms”
  • 2. 2
  • 3. 3 Case Study: A classification model for diagnosing Breast Cancer in women. A sample of 1000 women were studied in a given population, 100 of them with Breast Cancer while remaining 900 were without it. Split dataset into 70/30 train/test set. The accuracy was 90% excellent. A couple of months after deployment, some of the women who were diagnosed by the model as having “no breast cancer” started showing symptoms of Breast Cancer.
  • 4. 4 Actual Predi cted Null Hypothesis (H0) valid: Breast Cancer Null Hypothesis (H0) invalid: No Breast Cancer Accept H0 (X has disease) TP = 0 FP (X might feel she will die soon) = 0 0 Reject H0 (X does not have disease) FN (X thinks she is healthy when suffering form disease) = 30 TN = 270 300 30 270 300 Model has conveniently classified all the test data as “NO Breast Cancer” Accuracy = (TP + TN) / (TP + TN + FP + FN) = 90% Precision (predict disease correctly) = TP / (TP + FP) = 0% Recall = TP / (TP + FN) = 0% Isn’t it better to think you have Breast Cancer and not have it than to think you don’t have Breast Cancer but you’ve got it.
  • 6. 6 Observed accuracy = (TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6 Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) + ((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) = ((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30 = (8.5+6.5)/30 = 0.5 Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy) = (0.6-0.5)/(1-0.5) = 0.20 Actual class Model classific ation Cats Dogs Cats 10 7 17 Dogs 5 8 13 15 15 60 125 5 5000 0.47 Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK
  • 8. 8 “No one size fits all”
  • 9. 9
  • 12. 12
  • 13. 13
  • 15. Pearson and ANOVA (parametric) Spearman and Kendall’s rank (non parametric) Chi2 test, Mutual Information 15 I(X ; Y) = H(X) – H(X | Y) χ2 = ∑ (O − E)2 / E F = MST/MSE MST = SST/ p-1 MSE = SSE/N-p SSE = ∑ (n−1)s2
  • 17. 17 X Y X-XMEAN Y-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X- (XMEAN)* Y-YMEAN) X-(XMEAN)*X- (XMEAN) *(Y-YMEAN)*(Y- YMEAN) 3 6 1 2 1 4 1 4 2 3 0 -1 0 1 0 0 2 5 0 -1 0 1 0 0 1 2 -1 -2 1 4 1 4 ME AN 2 4 2 10 4 = 4/√20 = 0.8944 > 0 high correlation
  • 18. 18 Independent variable # OF ANIMAL AV. DOMESTIC ANIMAL S.D. S.D.2 DOG 5 12 2 4 CAT 5 16 1 1 HAMSTER 5 20 4 16 Different groups must have equal sample size No relationship between subjects in each sample To test more than 2 levels within an indep var ρ = 3 TOTAL POPULATION n = 5 # of samples N = 15 total # of observation SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160 MST = SST/ ρ-1 = 160/(3-1) = 80 SSE = (4+1+16)*(n-1) = 84 MSE = SSE/(N- ρ) = 84/(15-3) = 7 F = MST/MSE = 80/7 = 11.429
  • 19. 19 τ = (15-6)/21 = 0.4287 Interpretation: agreement between 2 experts
  • 20. 20 Cat Dog Men 207 282 489 Women 231 242 473 438 524 962 Expected value Cat Dog Men 489*438/962 = 222.64 489*524/962 = 266.36 489 Women 473*438/962 =215.36 473*524/962 = 257.64 473 438 524 962 (O-E)2/E Cat Dog Men (207-222.64)2 = 1.099 (282-266.36)2 = 0.918 489 Women (231-215.36)2 = 1.136 (242-257.64)2 = 0.949 473 438 524 962 χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102 Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1
  • 22. 22 from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0) sbs = SFS(LinearRegression(), k_features=11, forward=False, floating=False, cv=0) sbs.fit(X, y) sbs.k_feature_names_ from sklearn.feature_selection import RFE rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
  • 23. 23 from sklearn.feature_selection import SelectFromModel sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1')) sel_.fit(scaler.transform(X_train.fillna(0)), y_train) from sklearn.linear_model import ElasticNet regr = ElasticNet(random_state=0)
  • 24. 24
  • 25. 25
  • 27. 27 https://machinelearningmas tery.com/one-hot-encoding- for-categorical-data/ df_dummies = pd.get_dummies(df, columgenderns=['sex']) https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/
  • 28. 28
  • 29. Assumptions by models: 1. Linear relationship between predictors and target variable 2. No noise i.e. there are no outliers in the data 3. No collinearity 4. Normal distribution of predictors and the target variable 5. Scale if it’s a distance-based algorithm Solution 1. Log Transform (log(x)) 2. Square Root (special case) 3. Power Transform - Box Cox (stabilize variance) Reverse transformation while making predictions 29
  • 31. • displays information as a series of data points connected by straight line segments • to visualize the directional movement of one or more data over time i.e. time series data • X axis would be datetime and the Y axis contains the measured quantity like monthly sales • Eg. Simple, Multiple, Time Series Analysis Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31
  • 32. • categorical data as rectangular bars with the height of bars proportional to the value they represent • example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc. • used to compare between values of different categories in the data • categorical data is nothing but a grouping of data into different logical groups • Types include: Simple, Horizontal, Grouped and Stacked https://www.machinelearningplus.co m/plots/bar-plot-in-python/ 32
  • 33. • visualize the frequency distribution of numeric array by splitting it to small equal-sized bins. • A histogram is drawn on large arrays. It computes the frequency distribution on an array and makes a histogram out of it. • Types include basic, grouped, Density curve, Facets https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33
  • 35. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k-1)st largest value A normal point (on the left) requires more partitions to be identified than an abnormal point (right) https://towardsdatascience.com/outlier-detection-with- isolation-forest-3d190448d45e
  • 36. • visualize how a given data (variable) is distributed using quartiles • shows the minimum, maximum, median, first quartile and third quartile in the data set • method to graphically show the spread of a numerical variable through quartiles • Middle 50% of all datapoints: IQR = Q3-Q1 • upper and lower whisker mark 1.5 times the IQR from the top (and bottom) of the box • points that lie outside the whiskers, i.e. 1.5 x IQR in both directions are generally considered as outliers (< Q1-1.5*IQR | > Q3+1.5*IQR) • Types include basic, notched, violinplot 36 https://www.khanacademy.org/math/statistics- probability/summarizing-quantitative-data/box-whisker- plots/a/box-plot-review TASK
  • 37. • the values of two variables are plotted along two axes • used to visualize the relationship between two variables • Types include basic, correlation, linearfitplot, bubble plot https://www.machinelearningplus.com/plots/python-scatter-plot/ 37
  • 38. • Correlation between the variables indicates how the variables are inter-related • Correlation is not Causation 1. Each cell in the grid represents the value of the correlation coefficient between two variables. 2. It is a square and symmetric matrix. 3. All diagonal elements are 1. 4. The axes ticks denote the feature each of them represents. 5. A large positive value (near to 1.0) indicates a strong positive correlation. 6. A large negative value (near to -1.0) indicates a strong negative correlation. 7. A value near to 0 (both positive or negative) indicates the absence of any correlation between the two variables, and hence those variables are independent of each other. 8. Each cell in the above matrix is also represented by shades of a color. Here darker shades of the color indicate smaller values while brighter shades correspond to larger values (near to 1). 9. This scale is given with the help of a color-bar on the right side of the plot. 38
  • 39. • Eg. a person’s height and weight, age and sales price of a car, or years of education and annual income • Doesn’t affect DT • kNN affected • Cause • Insufficient data • Dummy variables • Including a variable in the regression that is actually a combination of two other variables. • Identify (corr>0.4, Variance Inflation Factor score>5 high correlation ) • Sol • Feature selection • PCA • More data • Ridge regression reduces magnitude of model coefficients 39
  • 40. Actual Cats Dogs Predic ted Cats 60 125 Dogs 5 5000 40 1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels. 2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on. 3. How is a missing value represented. What are the types and ways of dealing with missing values. 4. Discuss data transformation methods for categorical data and numerical data. 5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly. 6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled. 7. How can you determine which features are most important in your model? Which feature selection algorithm should be used when. State with example. 8. Discuss Wrapper based Feature selection methods with example diagram. 9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation. 10. Compute Karl Pearson and Spearman Coefficient of Correlation. 11. Find Kendall’s Rank Correlation Coefficient Tau. 12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can be applied.
  • 41. 41