Data Science
Basics
Data Types
Has a Meaningful Zero
Ex: Height
No Meaningful Zero
Ex: Temperature
Standard deviation is how
close the values in the data
set are to the mean
on average, the data points differ from the mean
by
Statistical inference is the process of drawing conclusions about an underlying population based on a
sample or subset of the data. In most cases, it is not practical to obtain all the measurements in a given
population.
Population and Sample Point Estimators
Z-Score
measure that indicates how many standard deviations a data point is from the mean of a dataset
Application
Cross-Group Comparisons: Z-scores allow for the
comparison of scores from different groups that may have
different means , standard deviations and distributions.
For example, comparing test scores from students in
different schools or different countries.
Outliers detection in data sets. Observations with Z-
scores that are significantly higher or lower than the
typical range (usually considered to be Z-scores less than
-3 or greater than 3) are often regarded as outliers
Normalization of Data: Z-scores are used in
statistical analysis to normalize data, ensuring that
every datum has a comparable scale. This is useful
in multivariate statistics where data on different
scales are combined.
The total area under the curve for any pdf is always equal to 1 it shows the probability
Confident Interval
The degrees of freedom
Range of values such that with X % probability, the range will contain
the true unknown value of the parameter.
Z-Statistics
T-Statistics
T
T
S
Data Types:
o Continuous Data: Numerical data that can take on any value within a range. Examples
o Discrete Data: Numerical data that can take on a limited number of values. For example, the number of students
in a class.
o Nominal Data (Categorical)
• Gender (Male, Female, Other)
• Blood type (A, B, AB, O)
• Colors (Red, Blue, Green)
o Ordinal Data (Categorical): Order or ranking among them, but the differences between the ranks are not
necessarily equal
• Education level (High School, Bachelor's, Master's, PhD) — While you can say a PhD is higher than a Master's, the difference between the
levels is not measured.
• Satisfaction rating (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied)
• Economic status (Low Income, Middle Income, High Income)
o Interval Data (Numerical):Interval data are numerical data that have meaningful differences between values, and
the data have a specific order
• Calendar years — The year 2000 is as long as 1990, and the difference between years is consistent. But year zero does not mean "no year.“
o Temporal data :Also known as time-series data, refers to a sequence of data points collected or recorded at time
intervals, which can be regular or irregular
• Has Time stamp , It is sequential and cannot be shifted , Used for identifying the Pattern and Trend
Relationships Among data :
1.Linear Relationship: As described earlier, a linear relationship is one where the change in one variable
is proportional to the change in another variable, resulting in a straight line when plotted on a graph.
2.Exponential Relationship: In an exponential relationship, one variable grows or decays at a rate that
depends on an exponent of another variable. This relationship often appears as a curve that rises or falls
rapidly.
1.Logarithmic Relationship: A logarithmic relationship involves one variable being the logarithm of
another variable. This relationship may appear as a curve that rises or falls rapidly at first but then levels
off.
2.Polynomial Relationship: Polynomial relationships involve one variable being a polynomial function of
another variable. Depending on the degree of the polynomial, the relationship may exhibit different degrees
of curvature.
1.Periodic Relationship: Periodic relationships involve the values of one variable repeating at regular
intervals as the values of another variable change. This type of relationship is common in cyclic phenomena
1.Periodic Relationship: Periodic relationships involve the values of one variable repeating at regular
intervals as the values of another variable change. This type of relationship is common in cyclic phenomena
and periodic functions.
2.Monotonic Relationship: A monotonic relationship is one where the values of one variable consistently
increase or decrease as the values of another variable increase. There are two types of monotonic
relationships:
1. Strictly Monotonic: Every value of one variable corresponds to a unique value of another variable,
and the relationship never reverses direction.
2. Non-strictly Monotonic: Similar to strictly monotonic relationships, but some values of one variable
may correspond to the same value of another variable.
3.Nonlinear Relationship: Nonlinear relationships include any relationship that cannot be adequately
represented by a straight line. This category encompasses all relationships mentioned above except for
A monotonic relationship is one where the values of one variable consistently increase or decrease as the
values of another variable increase. There are two types of monotonic relationships:
1. Strictly Monotonic: Every value of one variable corresponds to a unique value of another variable,
and the relationship never reverses direction.
2. Non-strictly Monotonic: Similar to strictly monotonic relationships, but some values of one variable
may correspond to the same value of another variable.
Non Monotonic
Monotonic Relationship
While correlation measures strength and direction of the linear relationship, monotonicity captures any systematic
change in the relationship, whether linear or not.
Therefore, monotonicity can be present even if correlation is close to 0, indicating a weak linear relationship.
Correlation
Correlation measures the strength and direction of the linear relationship between two variables.
It ranges from -1 to 1, where:
• 1 indicates a perfect positive linear relationship,
• -1 indicates a perfect negative linear relationship,
and
• 0 indicates no linear relationship.
1.Concave , Convex:
Parametric vs Non Parametric Machine
Learning
Examples: Linear regression, logistic
regression, linear discriminant analysis
(LDA), and some neural networks
(when they have a fixed number of
layers and nodes).
Examples: k-nearest neighbors
(KNN), decision trees, random
forests, support vector machines
(SVM) with non-linear kernels, and
some types of neural networks (such
as deep learning).
Histogram:
A histogram is a graphical representation of the distribution of numerical data
Common data Distribution:
Central tendency and variation are two measures used in statistics to summarize data. Measure of central
tendency shows where the center or middle of the data set is located, whereas measure of variation shows
the dispersion among data values
Central tendency
Dispersion
Dispersion in statistics is a way of describing how to spread out a set of
data is
Reducible Error
Bias
Bias refers to the difference between the expected predictions of a model and the true values of the
target variable. A model with high bias is not complex enough to capture the underlying patterns in
the data, resulting in underfitting. This means that the model is too simple and cannot capture the
complexity of the data, leading to poor performance on both the training and test data.
Variance
Variance, on the other hand, refers to the variability of the model’s predictions for different training sets. A
model with high variance is too complex and captures noise in the training data, resulting in overfitting.
This means that the model is too complex and fits the training data too closely, leading to poor
performance on new, unseen data.
Noise
noise refers to the random variations and irrelevant information within a dataset that cannot be
attributed directly to the underlying relationships being modeled. Noise can come from various sources
and significantly impacts the quality of the predictions made by a model
Minimizing irreducible error
Machine learning
Collection and Data Exploration (EDA –Exploratory data analytics) :
Data Cleaning:
o Handle missing values: impute or drop them based on context.
o Detect and handle duplicates.
o Identify and handle outliers.
o Standardize data formats and units.
o Resolve inconsistencies and errors.
o Validate data against predefined rules or constraints.
Feature Engineering:
o Create new features based on domain knowledge.
o Feature scaling
o Generate interaction features (e.g., product, division).
o Extract time-based features (e.g., day of week, hour of day).
o Perform dimensionality reduction (e.g., PCA, t-SNE).
o Engineer features from raw data (e.g., text, images).
o Select relevant features for modeling.
Data Transformation:
o Normalize numeric features.
o Scale features to a consistent range.
o Encode categorical variables (one-hot encoding, label encoding, etc.).
o Extract features from text or datetime data.
o Aggregate data at different levels (e.g., group by, pivot tables).
o Apply mathematical transformations (log, square root, etc.).
Modeling
o Model Selection: Choose appropriate machine learning algorithms for the task.
o Model Training: Train models using the processed and engineered features.
o Model Evaluation: Evaluate model performance using appropriate metrics and validation
techniques.
Deployment
o Model Deployment: Deploy the model to a production environment where it can make predictions on
new data.
o Monitoring and Maintenance: Continuously monitor the model's performance and update it as
necessary when new data becomes available or when model performance degrades.
Feedback Loop
o Iterative Improvement: Use feedback from the model's performance and any new data collected to
refine the feature engineering and modeling steps, continuously improving the model over time.
Business Problem Understanding
Collection and
Data Exploration (EDA –Exploratory data analytics)
Data Collection
o Gather data from various sources such as databases, APIs, files, etc.
o Extract data using appropriate tools and techniques.
o Ensure data integrity during extraction.
Data Exploration – Univariate analysis – Expletory data analysis
o Review data documentation and metadata.
o Understand the general information of dataset
o Types / Count / Number of unique values / Missing values
o Numerical features understanding
o Min / Max / Mean / Mode / Quartiles / Missing Values / Coefficient of Variation
o Normality and spread
o Distribution / STD / Skewness / Kurtosis
o Categorical feature understanding
o Distribution / Frequency / Relationship / Credibility / Missing values
o Outliers identification
o Correlation analysis (dependent and independent variables )
o Multicollinearity testing
Coefficient of Variability
It is a measure of relative variability and is often used to compare the variability of different datasets or
variables, especially when their means are different.
Features Scaling and Transformation
Technique used to standardize or normalize the range of independent variables or features in a dataset. The
goal of feature scaling is to bring all features to a similar scale, which can be beneficial for various machine
learning algorithms.
Important In Not Important In
K-Nearest Neighbors (KNN):
Support Vector Machines (SVM):
Principal Component Analysis (PCA):
Linear Regression, Logistic Regression, and Regularized
Regression:
Neural Networks:
K-Means Clustering:
Gradient Boosting Algorithms (e.g., XGBoost, LightGBM):
Ridge Regression and Lasso Regression:
Tree-Based Algorithms:
Rule-Based Algorithms:
Sparse Data:
•If the dataset is sparse, meaning most feature values
are zero or close to zero, feature scaling may not be
necessary.
Non-Numerical Features:
•Categorical variables represented as one-hot encoded
vectors, ordinal variables, or binary features typically
do not require feature scaling.
Numerical features:
Identify the distribution of the each continuous variable Most of the time that will align to one of a know
distribution as follows
Based on the ML type we need to transform the feature to the appropriate
distribution for better performance of the model
Ex: Skewed distribution to the normal distribution using transformation
techniques
Log Transformation for X or Y
Label encoding : This works if the categorical variable has only two categories
Categorical features transformation:
One Hot Encoding :
First check the frequency of each category and the identify most used values other will be “Other Type”
each category value is converted into a new categorical column and assigned
is called dummy variable
Disadvantage:
Dimensionality Increase
Sparse Matrix : most value are zero
Loss of Information: ordinality (order) is lost.
Dummy Variable Encoding:
Dummy encoding uses N-1 features to represent N labels/categories.
The Dummy Variable Trap occurs when different input variables perfectly predict each other – leading to
multicollinearity
Multicollinearity is a scenario when two or more input variables are highly correlated with each other This scenario
we attempt to avoid as it won’t necessarily affect the overall predictive accuracy of the model.
To avoid this issue we drop one of the newly created columns produced by one-hot encoding
Frequency Encoding or Count Encoding:
Encodes categorical features based on the frequency of each category in the dataset.
To reduce the number of features (dimensions) in a dataset while preserving the most important
information
Features Engineering (Dimensional Reduction)
Feature Selection Main Technique
Feature selection is the process of identifying and selecting a subset of relevant features (variables, predictors)
from the original dataset, which are most useful for building a predictive model.
This process helps to improve the model's performance by removing redundant, irrelevant, or noisy data, leading
to better generalization, reduced overfitting, and often shorter training times.
categorized into three types: filter methods, wrapper methods, and embedded methods.
Filter Methods
These methods typically use statistical techniques to assess the relationship between each feature and the target
variable.
Correlation Coefficient: Measures the linear relationship between each feature and the target. Features with high
correlation with the target and low correlation with other features are preferred.
Useful for both regression and classification tasks, especially for linear relationships.
Chi-Square Test: Evaluates the independence between categorical features and the target variable.
Typically used for classification tasks with categorical features.
ANOVA (Analysis of Variance): Used to assess the significance of features in relation to the target, especially for
categorical features with continuous target variables.
Useful when dealing with categorical features and continuous targets, primarily for classification tasks.
https://medium.com/analytics-vidhya/feature-selection-extended-overview-b58f1d524c1c
Wrapper Methods
Wrapper methods evaluate feature subsets by training and evaluating a machine learning model. They search for
the best subset of features by considering the interaction between them and their combined impact on model
performance.
Forward Selection: Starts with an empty set of features and iteratively adds the feature that improves model
performance the most.
Backward Elimination: Starts with all features and iteratively removes the least significant feature.
These can be used with any type of machine learning model (e.g., linear regression, decision trees, SVMs) and are
applicable to both regression and classification tasks. However, they can be computationally expensive for models
with a large number of features.
Recursive Feature Elimination (RFE): Trains a model and removes the least important feature(s) based on model
weights, recursively until the desired number of features is reached.
Mutual Information: Measures the amount of information obtained about one variable through another variable,
capturing non-linear relationships.
Applicable to both regression and classification tasks, capturing non-linear relationships.
Features in the Model an be selected using following Evaluation Method
Interaction term is a product of two or more predictors
To provide a more accurate model when such interactions are present in the data.
Note : Goodness of fit
refers to how well a
statistical model
describes the
observed data
What is we do the model selection only on p-value of the predictors
Calculate the P value
Now Age is not Significant
When we consider all the evaluation criterial it is easy to decide the better model
Forward stepwise selection
1.Start with No Predictors:
1. Begin with the simplest model, which includes no predictors (just the intercept).
2.Add Predictors One by One:
1. At each step, evaluate all predictors that are not already in the model.
2. For each predictor not in the model, fit a model that includes all the predictors currently in the
model plus this new predictor.
3. Calculate a criterion for model performance, such as Residual Sum of Squares (RSS), Akaike
Information Criterion (AIC), or Bayesian Information Criterion (BIC), for each of these models.
3.Select the Best Predictor:
1. Identify the predictor whose inclusion in the model results in the best performance according to the
chosen criterion (e.g., the predictor that reduces the RSS the most, or has the lowest AIC/BIC).
Embedded Methods
These methods are specific to particular learning algorithms and incorporate feature selection as a part of the
model building phase.
Lasso Regression (L1 Regularization): Penalizes the absolute size of the regression coefficients, effectively
shrinking some coefficients to zero, thus performing feature selection.
Primarily used in linear regression and logistic regression for feature selection.
Ridge Regression (L2 Regularization): Penalizes the square of the coefficient magnitudes but does not perform
feature selection by shrinking coefficients exactly to zero.
Used in linear models but does not perform feature selection (included here for comparison).
Elastic Net: Combines both L1 and L2 regularization to balance between feature selection and regularization.
Combines L1 and L2 regularization, used in linear and logistic regression.
Tree-based methods (e.g., Random Forest, Gradient Boosting): Use feature importance scores derived from the tree-
building process to select the most important features.
Applicable to both regression and classification tasks. These methods provide feature importance scores, which
can be used for feature selection.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is indeed a powerful technique for dimensionality reduction and can be
applied to many types of machine learning tasks. The primary goal of PCA is to transform the original feature
space into a new set of orthogonal axes (principal components) that maximize the variance of the dat
PCA vs. Feature Selection
PCA: Aims to reduce dimensionality by transforming the original features into a new set of orthogonal
features (principal components) that capture the maximum variance in the data. It creates new features
rather than selecting a subset of existing ones.
Produces new composite features (principal components) that are linear combinations of the original
features. These components are not directly interpretable in terms of the original features.
The new features (principal components) are abstract and not easily interpretable. This can be a
disadvantage when model interpretability is crucial.
Feature Selection: Seeks to identify and retain the most relevant and informative subset of the original
features, improving model interpretability and performance by eliminating irrelevant, redundant, or noisy
features.
Retains a subset of the original features, making the model easier to interpret and understand, as it directly
works with the original features.
Selected features are part of the original feature set, maintaining their interpretability and relevance to the
Use Case:
PCA: Often used when the goal is to reduce dimensionality for visualization, to combat the curse of
dimensionality, or to preprocess data for other machine learning algorithms that may struggle with high-
dimensional data.
Feature Selection: Used when the goal is to improve model performance and interpretability by focusing on
the most relevant features.
PCA is widely used in clustering due to several key advantages it offers:
PCA reduces the number of features while preserving as much variability as possible.This simplification
helps clustering algorithms (like K-means or hierarchical clustering) perform better by reducing noise and
focusing on the most informative components.
By reducing dimensions, PCA enhances the performance and speed of clustering algorithms, making it
easier to identify distinct clusters.
How PCA Works
The primary goal of PCA is to transform the original feature space into a new set of orthogonal axes
(principal components) that maximize the variance of the dat
Low High
The number of principal components created for a given dataset is equal to the number of features in the
original dataset. However, not all principal components capture the same amount of variance in the data.
Typically, only a subset of the principal components is retained for dimensionality reduction, usually those
corresponding to the largest eigenvalues.
Regression
Model Engineering
Model Fitting and Model
Evaluation
Regularization
L1 Regularization
L2 Regularization
Unknown coefficients β0 and β1 in linear regression define the population regression line. We seek to
estimate these unknown coefficients using
Sample Mean –
Population Mean -
For linear regression, the 95 % confidence interval for β1 approximately takes the form
there is approximately a 95 % chance that the interval will contain the true value of β1
Model Fitting Technique
goal of these techniques is to find the best parameters that allow the model to predict or classify new data
accurately.
KNN Regression
https://medium.com/analytics-vidhya/k-neighbors-regression-analysis-in-python-61532d56d8e4
Low K (e.g., K=1):
Bias: With a low K value, the model tends to have lower bias because it captures more detailed patterns in
the training data. Each prediction is influenced by only a single data point, leading to more complex decision
boundaries.
Variance: However, with low K, the model tends to have higher variance because it is more sensitive to noise
in the training data. The predictions can be highly influenced by the specific training instances, leading to
overfitting.
High K (e.g., K=N, where N is the number of training instances):
Bias: With a high K value, the model tends to have higher bias because it averages over more data points,
potentially leading to oversimplified decision boundaries. It might miss subtle patterns in the data.
Variance: On the other hand, with high K, the model tends to have lower variance because it smooths out the
predictions by averaging over a larger number of neighbors. This can reduce the impact of individual noisy data
points, leading to more stable predictions.
Ordinary Least Squares (OLS)
– Model Fitting
Residual Sum of squares (RSS)
OLS
Regularization
Regularization is a technique used in machine learning and statistical modeling to prevent overfitting and improve
the generalization performance of models. Overfitting occurs when a model learns the training data too well,
capturing noise or random fluctuations in the data, which leads to poor performance on unseen data.
Regularization
Gradient Decent
Cost Functions
Learning Rate
Validation Set Approach
Cross validation techniques
•Resubstitution
•Hold-out
•K-fold cross-validation
•LOOCV
•Random subsampling
•Bootstrapping
Validation techniques in machine learning are used to get the error rate of the ML model, which can be
considered as close to the true error rate of the population
Ensemble Technique
Combining multiple models to improve the predictive performance over any single
model.
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning meth
B bootstrapped training set
In Regression
OR
Majority Vote In Classification
Another approach for improving the predictions resulting from a decision tree
Trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve
bootstrap sampling; instead each tree is ft on a modified version of the original data set
The number of trees B. Unlike bagging and random forests, boosting can overft if B is too large, although this overfitting
tends to occur slowly if at all. We use cross-validation to select B
To find the best split this
always consider a one
feature and iterate
through all the features –
This use Gini Index
Stump
Generate a random number between
0 and 1 and the pic a record from the bin
To create the second sample list and the
Do the same process
KNN
Regression
SVM
Classification
Logistic Regression
Logistic regression is a type of statistical model used for binary classification tasks. It
predicts the probability of a binary outcome (i.e., an event with two possible values,
such as 0 and 1, true and false, yes and no).
Probability Output:
Unlike linear regression, logistic regression provides probabilities for class
membership, which can be useful for decision-making processes.
The core of logistic regression is the logistic function (also called the sigmoid function), which maps any
real-valued number into the range (0, 1):
Or
In statistics and probability theory, odds represent the ratio of the probability of success to the probability
of failure in a given event. The odds of an event can be expressed in different ways: as odds in favor,
odds against, or simply as odds.
odd
s
log-odds
Likelihood Calculation
Maximum Likelihood Calculation
A Bayes classifier, also known as a Naive Bayes classifier, is a probabilistic
machine learning algorithm based on Bayes' theorem.
Decision Tree
Non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It
has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes
Decision Tree Regression
Decision tree regression is a type of supervised learning algorithm used in machine learning, primarily for
regression tasks. In decision tree regression, the algorithm builds a tree-like structure by recursively splitting
the data into subsets based on the features that best separate the target variable (continuous in regression)
into homogeneous groups.
An impurity measure, also known as a splitting criterion or splitting rule, is a metric used in decision tree
algorithms to evaluate the homogeneity of a set of data points with respect to the target variable
The impurity measure serves as a criterion for selecting the best feature and split point at each node of the
tree. The goal is to find the feature and split point that result in the most homogeneous child nodes, leading
to better predictions and a more accurate decision tree model.
1.Leaf Node Prediction: Once a leaf node is reached, the prediction is made based on
the majority class (for classification) or the mean (for regression) of the target variable in
that leaf node. This prediction becomes the output of the decision tree model for the given
instance.
Mean squared error (MSE) as the impurity measure in decision tree regression.
By minimizing the MSE at each split, decision tree regression effectively partitions the feature space into
regions that are more homogeneous with respect to the target variable, leading to accurate predictions
for unseen data points.
Xo Will be selected
three-region partition
Random Forest
Bagging vs Boosting
1.Feature Selection:
1. Bagging: Uses all features available for each split in the decision trees.
2. Random Forest: Randomly selects a subset of features for each split in the decision trees, which introduce
additional randomness and reduces the correlation between the trees.
2.Bias-Variance Tradeoff:
1. Bagging: bagging will not lead to a substantial reduction in variance over a single tree in this setting but by averag
multiple models it reduce the variance.But does not inherently reduce correlation between the models.
1. Random Forest: Reduces both variance and correlation between models by introducing randomness in fea
selection, leading to lower overall variance and improved model performance.
Random forests overcome model correlation problem by forcing each split to consider only a subset of the predictors
Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will hav
more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting tr
less variable and hence more reliable
3.Performance:
1. Bagging: Can be applied to any base model and improves performance by reducing overfitting through mo
averaging.
2. Random Forest: Specifically designed for decision trees, typically performs better than bagging with decisio
trees due to the reduced correlation between trees.
The k-nearest neighbors algorithm (k-NN) is a non-parametric, lazy learning method used for classification
and regression. The output based on the majority vote (for classification) or mean (or median, for regression) of
the k-nearest neighbors in the feature space.
SVM
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and
regression tasks. It's particularly effective for binary classification problems, where the goal is to classify data
points into one of two categories.
Hyperplane based in the dimension
One Dimension it is a point
Two Dimension it is a Line
3 Dimension it is a surface
Regression
Model Evaluation
Regression (Residual) Sum of Squares (RSS) = Sum of Squared
Errors (SSE)
Total Sum of Squares (TSS) = SST
Mean Squared Error (MSE)
MSE measures the average squared error, with higher values indicating
more significant discrepancies between predicted and actual values.
MSE penalizes more significant errors due to squaring, making it
sensitive to outliers. It is commonly used due to its mathematical
properties but may be less interpretable than other metrics.
It is widely used in optimization and model training because it is
differentiable, which is important for gradient-based methods.
Importance:
•MSE penalizes larger errors more than smaller ones due to squaring the errors.
•It is widely used in optimization and model training because it is differentiable, which is important for
gradient-based methods.
What It Tells About the Model:
•A lower MSE indicates a model with fewer large errors.
•It provides a sense of the average error squared, which can emphasize the impact of larger errors.
The common shape of the Mean Squared Error
(MSE) graph, when plotted as a function of the model
parameters, is typically a convex curve.
Mean Absolute Error (MAE)
MAE measures the average magnitude of errors in a set of predictions, without considering their
direction. It’s the average over the test sample of the absolute differences between prediction and
actual observation where all individual differences have equal weight.
Importance:
•MAE is a straightforward measure of error magnitude.
•It is less sensitive to outliers compared to MSE and RMSE because it doesn’t square the errors.
What It Tells About the Model:
•A lower MAE indicates a model that makes smaller errors on average.
•Since it uses absolute differences, it provides a clear indication of the typical size of the errors in the same
units as the target variable.
Root Mean Squared Error (RMSE)
RMSE is the square root of the average of squared differences between prediction and actual observation. It
represents the standard deviation of the prediction errors.
Importance:
•RMSE is the square root of MSE, bringing the metric back to the same units as the target variable.
•It is more sensitive to outliers than MAE due to the squaring of errors before averaging.
What It Tells About the Model:
•A lower RMSE indicates better fit, similar to MSE but more interpretable in the context of the target
variable's scale.
•It provides an idea of how large the errors are in absolute terms.
Why RMSE is Considered as Standard Deviation of Prediction Errors
•If we assume that the prediction errors (residuals) are normally distributed with a mean of zero, then the RMSE
provides an estimate of the standard deviation of this normal distribution.
•This is because, under the normal distribution, the standard deviation is a measure of the average distance of
the data points from the mean, which in this case is zero.
Residual Standard Error:
The Residual Standard Error (RSE) is a measure used in regression analysis to quantify the typical size of
the residuals (prediction errors) from a linear regression model. It provides an estimate of the standard
deviation of the residuals, which helps in understanding how well the model fits the data.
RSE is in the same units as the dependent variable, making it straightforward to interpret.
Adjustment for Predictors: Unlike simple measures like RMSE, RSE accounts for the number of
predictors in the model. This adjustment (using 𝑛−𝑝−1n−p−1 in the denominator) helps prevent overfitting
by penalizing models with more predictors.
Model Comparison:
1. Comparison Tool: RSE allows for the comparison of different models. When comparing models
with the same dependent variable, a lower RSE indicates a better fit.
2. Relative Measure: While RSE itself doesn't provide an absolute goodness-of-fit measure, it is
useful when comparing models to determine which one better explains the variability in the data.
Large RSE values may indicate a poor fit, suggesting that the model is not capturing all the relevant
information in the data.
Model Assessment: RSE helps assess the accuracy of a regression model. A lower RSE value indicates
a model that better captures the data's variability.
Predictive Accuracy: RSE provides insights into the model’s predictive accuracy, indicating how close the
predicted values are to the actual values on average.
Identification of Outliers or Influential Points: Large residuals can indicate outliers or influential points
that may unduly affect the model's performance. By examining these cases closely, researchers can
decide whether to include, exclude, or transform them to improve model fit.
Detection of Heteroscedasticity: Heteroscedasticity occurs when the variability of the residuals is not
constant across all levels of the predictor variables. RSE can help identify this issue, prompting
researchers to explore transformations or alternative modeling techniques to address it.
Residual plot
A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error
between a predicted value and the observed actual value.
A typical residual plot has the residual values on the Y-axis and the
independent variable on the x-axis
Heterogeneity in residuals" refers to the situation where the variability of the residuals is not consistent across
all levels of the predictor variables. In other words, the spread or dispersion of residuals varies systematically
with the values of one or more predictor variables.
characteristics of a good residual plot:
Identifying whether the Error is high or low
Scale of the Target Variable: If the target variable has a large range (e.g., house prices ranging from
$100,000 to $1,000,000), an RMSE of $10,000 might be considered low. Conversely, for smaller ranges,
such as predicting daily temperature, an RMSE of 10 degrees might be high.
•Industry Standards: Different fields have established benchmarks for acceptable error rates. For
instance, in some financial models, an RMSE of a few dollars might be acceptable, while in other
domains, such as temperature prediction, an RMSE of a few degrees could be too high.
•Historical Data: Compare the error values to those of previous models or known standards within the
same domain. This helps in understanding the expected range of errors.
•Impact of Errors: Consider the practical implications of the error. For instance, in medical diagnostics,
even small errors can be critical, whereas, in movie recommendation systems, higher errors might be
more tolerable.
•Business Goals: Align the acceptable error rates with business goals and requirements. Sometimes, a
slightly higher error might be acceptable if it results in significant cost savings or other benefits.
Residual Analysis
Coefficient Analysis
H0 : There is no relationship between X and Y
Mathematically, this corresponds to testing H0 : β1 = 0
Y = β0 + ", and X is not associated with Y . To test the null hypothesis, we need to determine whether βˆ1, our estimate for
β1, is sufficiently far from zero that we can be confident that β1 is non-zero
How far is far enough?
These coefficients represent the estimated change in the dependent variable (response variable)
for a one-unit change in the corresponding predictor variable, holding all other variables constant.
For example, if the estimate for a predictor variable X1 is 0.5, it means that, on average, for each one-unit
increase in X1, the dependent variable is estimated to increase by 0.5 units, assuming all other variables in
the model remain constant.
Coefficient Magnitude: Look at the magnitude of the coefficients. Larger coefficients imply a stronger
relationship between the predictor variable and the response variable.
For example, a coefficient of 2 means that a one-unit increase in the predictor variable is associated with a
two-unit
Coefficient Direction: Determine the direction of the relationship between the predictor variable and the
response variable. A positive coefficient indicates a positive relationship, meaning that as the predictor
variable increases, the response variable also tends to increase. Conversely, a negative coefficient suggests
a negative relationship, where an increase in the predictor variable is associated with a decrease in the
response variable.
Confounding Variables: Be aware of confounding variables or multicollinearity issues. If coefficients change
substantially when adding or removing variables from the model, it could indicate that the variables are
correlated with each other, leading to potential issues in interpretation.
Standard Error
Understanding the standard error helps in assessing the stability and robustness of the model's parameter
estimates
The standard error provides an estimate
of how much we would expect the
coefficient estimates to vary from the true
population parameters across different
samples of the same size from the
population
T Value
also known as the t-statistic, is calculated as the ratio of the coefficient estimate to its standard error in
regression analysis.
the t-value represents the standardized deviation of the coefficient estimate from zero, expressed in terms of
standard errors
Why is it important?
Significance Testing: t-value is used to conduct hypothesis tests on the coefficients. whether the corresponding
predictor variable has a statistically significant effect on the response variable. This is essential for understanding
which predictors are truly influential in the model
Higher t-values indicate stronger evidence against the null hypothesis (that the coefficient is zero), suggesting that
the corresponding predictor is more likely to be important in explaining the variation in the response variable
Comparing t-values across different coefficients allows researchers to assess the relative importance of different
predictors in the model
Lower t-values across all coefficients may indicate that the model is not capturing important relationships between
the predictors and the response variable.
P Value
The p-value, is probability that measure of the strength of evidence against the null hypothesis in statistical
hypothesis testing
If the p-value is less than the significance level, the coefficient is considered statistically significant
When interpreting p-values, it's essential to consider the chosen significance level (e.g., 0.05) and whether multiple
comparisons are being made (which may require adjustment of the significance level).
•A low p-value indicates strong evidence against the null hypothesis, suggesting that the coefficient estimate is
statistically significant.
•A high p-value suggests weak evidence against the null hypothesis, indicating that the coefficient estimate is not
statistically significant.
Importance:
•indicates the proportion of the variance in the dependent variable that is predictable from the independent
variable(s).
•It is a normalized metric, meaning it ranges from 0 to 1 (or can be negative if the model is worse than a
horizontal line).
What It Tells About the Model:
•A higher 𝑅2R2 (closer to 1) means a better fit.
•It shows how well the independent variables explain the variance in the dependent variable. However, it doesn't
provide information on the size of the errors.
Liner Regression
Classification
Under Fitting and Over-Fitting
Residual Plot
Bias vs Variance trade off
Bias vs Variance trade off
Training Data Testing Data
https://www.youtube.com/watch?v=BGlEv2CTfeg
Multicollinearity Testing
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, which
can lead to unstable estimates of the regression coefficients and inflated standard errors due to
1.Unreliable Estimates of Regression Coefficients: When predictor variables are highly correlated
with each other, it becomes difficult for the regression model to determine the individual effect of each
predictor on the outcome variable. As a result, the estimated regression coefficients may be unstable or
have high standard errors.
2.Uninterpretable Coefficients: In the presence of multicollinearity, the coefficients of the regression
model may have counterintuitive signs or magnitudes, making their interpretation challenging or
misleading.
3.Inflated Standard Errors: Multicollinearity inflates the standard errors of the regression coefficients,
which can lead to wider confidence intervals and less precise estimates of the coefficients' true values.
4.Reduced Statistical Power: High multicollinearity reduces the statistical power of the regression
model, making it less likely to detect significant relationships between predictor variables and the
outcome variable, even if those relationships truly exist.
The Variance Inflation Factor (VIF) is a measure used to quantify the severity of multicollinearity in regression
analysis
Log Loss
Log loss, also known as logistic loss or cross-entropy loss, is a performance metric for classification models,
particularly those that output probabilities for each class
Log loss quantifies the difference between the predicted probabilities and the actual class labels. For a
classification problem, it is defined as:
Interpretation
•Lower Log Loss: Indicates that the predicted probabilities are close to the actual class labels,
suggesting a better model.
•Higher Log Loss: Indicates that the predicted probabilities are far from the actual class labels,
suggesting a poorer model.
What Log Loss Tells About the Model
1.Probability Calibration: Log loss evaluates how well the predicted probabilities are calibrated with
respect to the true outcomes. It penalizes both overconfident wrong predictions and underconfident
correct predictions.
1.Model Performance: It provides a nuanced measure of model performance, beyond just accuracy.
While accuracy measures the fraction of correct predictions, log loss considers the confidence of those
predictions.
1.Handling Class Imbalance: Log loss can handle imbalanced classes better than accuracy because
it takes the predicted probabilities into account, rather than just the final classification.
Confusion Matrix
Evaluation of the performance of a classification model is based on the counts of test records correctly
and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is
not only the performance of a predictive model, but also which classes are being predicted correctly and
incorrectly, and what type of errors are being made. The matrix can be represented as
Precision and Recall should be calculated for each class
Precision is based on the prediction
Recall based on the ground truth
Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if
classes are imbalanced.
Precision
Always based on the prediction
is important when the cost of false positives is high. It helps assess the quality of positive predictions.
Recall (Sensitivity)
Always based on the ground truth
Is crucial when capturing all actual positives is essential. It measures the model's ability to identify positive
instances.
•F1 Score provides a balance between precision and recall, especially when there's an uneven class
distribution. It's a better measure of a model's performance when there's a trade-off between false positives
and false negatives.
Importance:
•Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if
classes are imbalanced.
•Precision is important when the cost of false positives is high. It helps assess the quality of positive
predictions.
Importance:
•Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if
classes are imbalanced.
•Precision is important when the cost of false positives is high. It helps assess the quality of positive
predictions.
•Recall (Sensitivity) is crucial when capturing all actual positives is essential. It measures the model's ability
to identify positive instances.
•F1 Score provides a balance between precision and recall, especially when there's an uneven class
distribution. It's a better measure of a model's performance when there's a trade-off between false positives
and false negatives.
Classification report
Overall Metrics:
•Accuracy: 0.60
• This means that 60% of the total predictions were correct.
•Macro Average:
• computing the metric independently for each class and then taking the average of these metrics. It treats all
classes equally, without considering the class distribution
These macro average metrics provide an overall measure of model performance that treats all classes equally,
regardless of their frequency in the dataset.
Weighted average
performance metrics for each class are weighted by the number of instances in that class, giving more
importance to classes with more instances
The weighted average provides a more realistic measure of overall model performance by giving more importance
to the classes with more instances. This is particularly useful in datasets with imbalanced class distributions, as it
ensures that the performance metrics reflect the model's ability to correctly classify the more prevalent classes.
How to use those result for model improvements
Weighted averages might be significantly higher than macro averages,
• indicating that the model performs well on frequent classes but poorly on rare ones.
• Oversampling Minority Classes: Use techniques like SMOTE (Synthetic Minority Over-sampling Technique)
to generate more samples for underrepresented classes.
• Under sampling Majority Classes: Reduce the number of samples in the overrepresented classes to
balance the class distribution.
• Class Weights: Modify the loss function to give higher weights to minority classes during training,
encouraging the model to focus more on these classes.
Low precision, recall, and F1 scores for specific classes
•Class-Specific Data Augmentation: Create additional synthetic data or collect more real data for the poorly
performing classes.
•Feature Engineering: Develop new features that may be more informative for the difficult classes.
•Class-Specific Models: Train separate models for each class or use ensemble methods that can better handle
class-specific peculiarities.
High performance on training data but low performance on certain test classes.
•Regularization: Apply techniques like L1/L2 regularization to prevent overfitting.
•Pruning Decision Trees: If using decision trees or random forests, prune the trees to reduce complexity and
prevent overfitting.
•Cross-Validation: Use cross-validation to ensure that the model generalizes well across different subsets of the
data.
Consistent low recall or precision across multiple classes in both macro and weighted averages.
•Hyperparameter Tuning: Use grid search or random search to find the optimal hyperparameters for your
model.
•Ensemble Methods: Combine multiple models to leverage their strengths and mitigate individual weaknesses.
Methods like bagging, boosting, and stacking can improve overall performance.
•Regular Updates: Regularly update the model with new data to ensure it captures the most recent patterns
and trends.
If current improvements are insufficient, it might be indicative of the need for a different model architecture.
•Algorithm Choice: Experiment with different algorithms (e.g., switching from a decision tree to a gradient
boosting machine or neural network) to find one that better captures the data patterns.
•Neural Network Layers: For deep learning models, adjust
Practical Steps:
1.Evaluate Metrics:
1. Carefully analyze the precision, recall, and F1-score for each class.
2. Compare macro and weighted averages to understand overall versus individual class performance.
2.Diagnose Issues:
1. Identify which classes are underperforming and why (e.g., lack of data, inherent difficulty).
3.Implement Improvements:
1. Choose and apply the appropriate techniques from the actions listed above based on your diagnosis.
2. Regularly monitor the impact of these changes on your model's performance metrics.
4.Iterate and Optimize:
1. Continuously iterate on the model, using new data and feedback to further refine performance.
2. Use tools like learning curves to understand the impact of more data or different algorithms.
Logistic regression Model evolution
deviance residuals in a logistic regression table provide detailed information about the fit of the model to
individual data points and help identify potential outliers or issues with the model.
high max value compared to the other values might suggest that there are outliers or poorly fitted
observations in the data.
The deviance is a measure of the difference between a fitted model and the perfect model (also called
the saturated model). The deviance for a logistic regression model can be divided into two parts:
1.Null Deviance: This is the deviance of a model with no predictors, only an intercept. It serves as a
baseline to compare with the fitted model.
1.Residual Deviance: This is the deviance of the fitted model with the predictors included.
Ridge
Lasso
There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.
What do we mean by the variance and bias of a statistical learning method? Variance refers to the amount by which ˆf
would change if we estimated it using a diferent training data set
Since the training data are used to ft the statistical learning method, diferent training data sets will result in a diferent
ˆf. But ideally the estimate for f should not vary too much between training sets. However, if a method has high
variance then small changes in the training data can result in large changes in ˆf. In general, more fexible statistical
methods have higher variance
population mean µ of a random variable Y
How far of will that single estimate of µˆ be?
standard error of µˆ
residual standard error
Standard errors can be used to compute confdence intervals
https://www.youtube.com/watch?v=7WPfuHLCn_k&t=427s
https://www.youtube.com/watch?v=-H5tcISshKg

DataScienceConcept_Kanchana_Weerasinghe.pptx

  • 1.
  • 3.
  • 4.
    Data Types Has aMeaningful Zero Ex: Height No Meaningful Zero Ex: Temperature
  • 7.
    Standard deviation ishow close the values in the data set are to the mean on average, the data points differ from the mean by
  • 9.
    Statistical inference isthe process of drawing conclusions about an underlying population based on a sample or subset of the data. In most cases, it is not practical to obtain all the measurements in a given population. Population and Sample Point Estimators
  • 11.
    Z-Score measure that indicateshow many standard deviations a data point is from the mean of a dataset Application Cross-Group Comparisons: Z-scores allow for the comparison of scores from different groups that may have different means , standard deviations and distributions. For example, comparing test scores from students in different schools or different countries. Outliers detection in data sets. Observations with Z- scores that are significantly higher or lower than the typical range (usually considered to be Z-scores less than -3 or greater than 3) are often regarded as outliers Normalization of Data: Z-scores are used in statistical analysis to normalize data, ensuring that every datum has a comparable scale. This is useful in multivariate statistics where data on different scales are combined.
  • 12.
    The total areaunder the curve for any pdf is always equal to 1 it shows the probability
  • 13.
    Confident Interval The degreesof freedom Range of values such that with X % probability, the range will contain the true unknown value of the parameter.
  • 14.
  • 15.
  • 17.
    Data Types: o ContinuousData: Numerical data that can take on any value within a range. Examples o Discrete Data: Numerical data that can take on a limited number of values. For example, the number of students in a class. o Nominal Data (Categorical) • Gender (Male, Female, Other) • Blood type (A, B, AB, O) • Colors (Red, Blue, Green) o Ordinal Data (Categorical): Order or ranking among them, but the differences between the ranks are not necessarily equal • Education level (High School, Bachelor's, Master's, PhD) — While you can say a PhD is higher than a Master's, the difference between the levels is not measured. • Satisfaction rating (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied) • Economic status (Low Income, Middle Income, High Income) o Interval Data (Numerical):Interval data are numerical data that have meaningful differences between values, and the data have a specific order • Calendar years — The year 2000 is as long as 1990, and the difference between years is consistent. But year zero does not mean "no year.“ o Temporal data :Also known as time-series data, refers to a sequence of data points collected or recorded at time intervals, which can be regular or irregular • Has Time stamp , It is sequential and cannot be shifted , Used for identifying the Pattern and Trend
  • 18.
    Relationships Among data: 1.Linear Relationship: As described earlier, a linear relationship is one where the change in one variable is proportional to the change in another variable, resulting in a straight line when plotted on a graph. 2.Exponential Relationship: In an exponential relationship, one variable grows or decays at a rate that depends on an exponent of another variable. This relationship often appears as a curve that rises or falls rapidly.
  • 19.
    1.Logarithmic Relationship: Alogarithmic relationship involves one variable being the logarithm of another variable. This relationship may appear as a curve that rises or falls rapidly at first but then levels off. 2.Polynomial Relationship: Polynomial relationships involve one variable being a polynomial function of another variable. Depending on the degree of the polynomial, the relationship may exhibit different degrees of curvature. 1.Periodic Relationship: Periodic relationships involve the values of one variable repeating at regular intervals as the values of another variable change. This type of relationship is common in cyclic phenomena
  • 20.
    1.Periodic Relationship: Periodicrelationships involve the values of one variable repeating at regular intervals as the values of another variable change. This type of relationship is common in cyclic phenomena and periodic functions. 2.Monotonic Relationship: A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: 1. Strictly Monotonic: Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. 2. Non-strictly Monotonic: Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. 3.Nonlinear Relationship: Nonlinear relationships include any relationship that cannot be adequately represented by a straight line. This category encompasses all relationships mentioned above except for
  • 21.
    A monotonic relationshipis one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: 1. Strictly Monotonic: Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. 2. Non-strictly Monotonic: Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Non Monotonic Monotonic Relationship
  • 22.
    While correlation measuresstrength and direction of the linear relationship, monotonicity captures any systematic change in the relationship, whether linear or not. Therefore, monotonicity can be present even if correlation is close to 0, indicating a weak linear relationship. Correlation Correlation measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where: • 1 indicates a perfect positive linear relationship, • -1 indicates a perfect negative linear relationship, and • 0 indicates no linear relationship.
  • 23.
  • 25.
    Parametric vs NonParametric Machine Learning Examples: Linear regression, logistic regression, linear discriminant analysis (LDA), and some neural networks (when they have a fixed number of layers and nodes). Examples: k-nearest neighbors (KNN), decision trees, random forests, support vector machines (SVM) with non-linear kernels, and some types of neural networks (such as deep learning).
  • 28.
    Histogram: A histogram isa graphical representation of the distribution of numerical data
  • 29.
  • 30.
    Central tendency andvariation are two measures used in statistics to summarize data. Measure of central tendency shows where the center or middle of the data set is located, whereas measure of variation shows the dispersion among data values Central tendency
  • 31.
    Dispersion Dispersion in statisticsis a way of describing how to spread out a set of data is
  • 33.
  • 34.
    Bias Bias refers tothe difference between the expected predictions of a model and the true values of the target variable. A model with high bias is not complex enough to capture the underlying patterns in the data, resulting in underfitting. This means that the model is too simple and cannot capture the complexity of the data, leading to poor performance on both the training and test data. Variance Variance, on the other hand, refers to the variability of the model’s predictions for different training sets. A model with high variance is too complex and captures noise in the training data, resulting in overfitting. This means that the model is too complex and fits the training data too closely, leading to poor performance on new, unseen data.
  • 35.
    Noise noise refers tothe random variations and irrelevant information within a dataset that cannot be attributed directly to the underlying relationships being modeled. Noise can come from various sources and significantly impacts the quality of the predictions made by a model
  • 38.
  • 41.
  • 43.
    Collection and DataExploration (EDA –Exploratory data analytics) : Data Cleaning: o Handle missing values: impute or drop them based on context. o Detect and handle duplicates. o Identify and handle outliers. o Standardize data formats and units. o Resolve inconsistencies and errors. o Validate data against predefined rules or constraints.
  • 44.
    Feature Engineering: o Createnew features based on domain knowledge. o Feature scaling o Generate interaction features (e.g., product, division). o Extract time-based features (e.g., day of week, hour of day). o Perform dimensionality reduction (e.g., PCA, t-SNE). o Engineer features from raw data (e.g., text, images). o Select relevant features for modeling. Data Transformation: o Normalize numeric features. o Scale features to a consistent range. o Encode categorical variables (one-hot encoding, label encoding, etc.). o Extract features from text or datetime data. o Aggregate data at different levels (e.g., group by, pivot tables). o Apply mathematical transformations (log, square root, etc.).
  • 45.
    Modeling o Model Selection:Choose appropriate machine learning algorithms for the task. o Model Training: Train models using the processed and engineered features. o Model Evaluation: Evaluate model performance using appropriate metrics and validation techniques. Deployment o Model Deployment: Deploy the model to a production environment where it can make predictions on new data. o Monitoring and Maintenance: Continuously monitor the model's performance and update it as necessary when new data becomes available or when model performance degrades. Feedback Loop o Iterative Improvement: Use feedback from the model's performance and any new data collected to refine the feature engineering and modeling steps, continuously improving the model over time.
  • 46.
  • 48.
    Collection and Data Exploration(EDA –Exploratory data analytics)
  • 49.
    Data Collection o Gatherdata from various sources such as databases, APIs, files, etc. o Extract data using appropriate tools and techniques. o Ensure data integrity during extraction. Data Exploration – Univariate analysis – Expletory data analysis o Review data documentation and metadata. o Understand the general information of dataset o Types / Count / Number of unique values / Missing values o Numerical features understanding o Min / Max / Mean / Mode / Quartiles / Missing Values / Coefficient of Variation o Normality and spread o Distribution / STD / Skewness / Kurtosis o Categorical feature understanding o Distribution / Frequency / Relationship / Credibility / Missing values o Outliers identification o Correlation analysis (dependent and independent variables ) o Multicollinearity testing
  • 50.
    Coefficient of Variability Itis a measure of relative variability and is often used to compare the variability of different datasets or variables, especially when their means are different.
  • 51.
    Features Scaling andTransformation Technique used to standardize or normalize the range of independent variables or features in a dataset. The goal of feature scaling is to bring all features to a similar scale, which can be beneficial for various machine learning algorithms. Important In Not Important In K-Nearest Neighbors (KNN): Support Vector Machines (SVM): Principal Component Analysis (PCA): Linear Regression, Logistic Regression, and Regularized Regression: Neural Networks: K-Means Clustering: Gradient Boosting Algorithms (e.g., XGBoost, LightGBM): Ridge Regression and Lasso Regression: Tree-Based Algorithms: Rule-Based Algorithms: Sparse Data: •If the dataset is sparse, meaning most feature values are zero or close to zero, feature scaling may not be necessary. Non-Numerical Features: •Categorical variables represented as one-hot encoded vectors, ordinal variables, or binary features typically do not require feature scaling.
  • 52.
    Numerical features: Identify thedistribution of the each continuous variable Most of the time that will align to one of a know distribution as follows Based on the ML type we need to transform the feature to the appropriate distribution for better performance of the model Ex: Skewed distribution to the normal distribution using transformation techniques
  • 57.
  • 58.
    Label encoding :This works if the categorical variable has only two categories Categorical features transformation: One Hot Encoding : First check the frequency of each category and the identify most used values other will be “Other Type” each category value is converted into a new categorical column and assigned is called dummy variable Disadvantage: Dimensionality Increase Sparse Matrix : most value are zero Loss of Information: ordinality (order) is lost.
  • 59.
    Dummy Variable Encoding: Dummyencoding uses N-1 features to represent N labels/categories. The Dummy Variable Trap occurs when different input variables perfectly predict each other – leading to multicollinearity Multicollinearity is a scenario when two or more input variables are highly correlated with each other This scenario we attempt to avoid as it won’t necessarily affect the overall predictive accuracy of the model. To avoid this issue we drop one of the newly created columns produced by one-hot encoding
  • 60.
    Frequency Encoding orCount Encoding: Encodes categorical features based on the frequency of each category in the dataset.
  • 61.
    To reduce thenumber of features (dimensions) in a dataset while preserving the most important information Features Engineering (Dimensional Reduction)
  • 62.
    Feature Selection MainTechnique Feature selection is the process of identifying and selecting a subset of relevant features (variables, predictors) from the original dataset, which are most useful for building a predictive model. This process helps to improve the model's performance by removing redundant, irrelevant, or noisy data, leading to better generalization, reduced overfitting, and often shorter training times. categorized into three types: filter methods, wrapper methods, and embedded methods. Filter Methods These methods typically use statistical techniques to assess the relationship between each feature and the target variable. Correlation Coefficient: Measures the linear relationship between each feature and the target. Features with high correlation with the target and low correlation with other features are preferred. Useful for both regression and classification tasks, especially for linear relationships. Chi-Square Test: Evaluates the independence between categorical features and the target variable. Typically used for classification tasks with categorical features. ANOVA (Analysis of Variance): Used to assess the significance of features in relation to the target, especially for categorical features with continuous target variables. Useful when dealing with categorical features and continuous targets, primarily for classification tasks.
  • 63.
    https://medium.com/analytics-vidhya/feature-selection-extended-overview-b58f1d524c1c Wrapper Methods Wrapper methodsevaluate feature subsets by training and evaluating a machine learning model. They search for the best subset of features by considering the interaction between them and their combined impact on model performance. Forward Selection: Starts with an empty set of features and iteratively adds the feature that improves model performance the most. Backward Elimination: Starts with all features and iteratively removes the least significant feature. These can be used with any type of machine learning model (e.g., linear regression, decision trees, SVMs) and are applicable to both regression and classification tasks. However, they can be computationally expensive for models with a large number of features. Recursive Feature Elimination (RFE): Trains a model and removes the least important feature(s) based on model weights, recursively until the desired number of features is reached. Mutual Information: Measures the amount of information obtained about one variable through another variable, capturing non-linear relationships. Applicable to both regression and classification tasks, capturing non-linear relationships.
  • 64.
    Features in theModel an be selected using following Evaluation Method Interaction term is a product of two or more predictors To provide a more accurate model when such interactions are present in the data.
  • 65.
    Note : Goodnessof fit refers to how well a statistical model describes the observed data
  • 66.
    What is wedo the model selection only on p-value of the predictors Calculate the P value Now Age is not Significant
  • 67.
    When we considerall the evaluation criterial it is easy to decide the better model
  • 68.
    Forward stepwise selection 1.Startwith No Predictors: 1. Begin with the simplest model, which includes no predictors (just the intercept). 2.Add Predictors One by One: 1. At each step, evaluate all predictors that are not already in the model. 2. For each predictor not in the model, fit a model that includes all the predictors currently in the model plus this new predictor. 3. Calculate a criterion for model performance, such as Residual Sum of Squares (RSS), Akaike Information Criterion (AIC), or Bayesian Information Criterion (BIC), for each of these models. 3.Select the Best Predictor: 1. Identify the predictor whose inclusion in the model results in the best performance according to the chosen criterion (e.g., the predictor that reduces the RSS the most, or has the lowest AIC/BIC).
  • 69.
    Embedded Methods These methodsare specific to particular learning algorithms and incorporate feature selection as a part of the model building phase. Lasso Regression (L1 Regularization): Penalizes the absolute size of the regression coefficients, effectively shrinking some coefficients to zero, thus performing feature selection. Primarily used in linear regression and logistic regression for feature selection. Ridge Regression (L2 Regularization): Penalizes the square of the coefficient magnitudes but does not perform feature selection by shrinking coefficients exactly to zero. Used in linear models but does not perform feature selection (included here for comparison). Elastic Net: Combines both L1 and L2 regularization to balance between feature selection and regularization. Combines L1 and L2 regularization, used in linear and logistic regression. Tree-based methods (e.g., Random Forest, Gradient Boosting): Use feature importance scores derived from the tree- building process to select the most important features. Applicable to both regression and classification tasks. These methods provide feature importance scores, which can be used for feature selection.
  • 70.
  • 71.
    Principal Component Analysis(PCA) is indeed a powerful technique for dimensionality reduction and can be applied to many types of machine learning tasks. The primary goal of PCA is to transform the original feature space into a new set of orthogonal axes (principal components) that maximize the variance of the dat PCA vs. Feature Selection PCA: Aims to reduce dimensionality by transforming the original features into a new set of orthogonal features (principal components) that capture the maximum variance in the data. It creates new features rather than selecting a subset of existing ones. Produces new composite features (principal components) that are linear combinations of the original features. These components are not directly interpretable in terms of the original features. The new features (principal components) are abstract and not easily interpretable. This can be a disadvantage when model interpretability is crucial. Feature Selection: Seeks to identify and retain the most relevant and informative subset of the original features, improving model interpretability and performance by eliminating irrelevant, redundant, or noisy features. Retains a subset of the original features, making the model easier to interpret and understand, as it directly works with the original features. Selected features are part of the original feature set, maintaining their interpretability and relevance to the
  • 72.
    Use Case: PCA: Oftenused when the goal is to reduce dimensionality for visualization, to combat the curse of dimensionality, or to preprocess data for other machine learning algorithms that may struggle with high- dimensional data. Feature Selection: Used when the goal is to improve model performance and interpretability by focusing on the most relevant features. PCA is widely used in clustering due to several key advantages it offers: PCA reduces the number of features while preserving as much variability as possible.This simplification helps clustering algorithms (like K-means or hierarchical clustering) perform better by reducing noise and focusing on the most informative components. By reducing dimensions, PCA enhances the performance and speed of clustering algorithms, making it easier to identify distinct clusters.
  • 73.
    How PCA Works Theprimary goal of PCA is to transform the original feature space into a new set of orthogonal axes (principal components) that maximize the variance of the dat
  • 74.
  • 75.
    The number ofprincipal components created for a given dataset is equal to the number of features in the original dataset. However, not all principal components capture the same amount of variance in the data. Typically, only a subset of the principal components is retained for dimensionality reduction, usually those corresponding to the largest eigenvalues.
  • 83.
  • 84.
  • 86.
    Unknown coefficients β0and β1 in linear regression define the population regression line. We seek to estimate these unknown coefficients using Sample Mean – Population Mean -
  • 88.
    For linear regression,the 95 % confidence interval for β1 approximately takes the form there is approximately a 95 % chance that the interval will contain the true value of β1
  • 90.
    Model Fitting Technique goalof these techniques is to find the best parameters that allow the model to predict or classify new data accurately.
  • 92.
  • 94.
    https://medium.com/analytics-vidhya/k-neighbors-regression-analysis-in-python-61532d56d8e4 Low K (e.g.,K=1): Bias: With a low K value, the model tends to have lower bias because it captures more detailed patterns in the training data. Each prediction is influenced by only a single data point, leading to more complex decision boundaries. Variance: However, with low K, the model tends to have higher variance because it is more sensitive to noise in the training data. The predictions can be highly influenced by the specific training instances, leading to overfitting. High K (e.g., K=N, where N is the number of training instances): Bias: With a high K value, the model tends to have higher bias because it averages over more data points, potentially leading to oversimplified decision boundaries. It might miss subtle patterns in the data. Variance: On the other hand, with high K, the model tends to have lower variance because it smooths out the predictions by averaging over a larger number of neighbors. This can reduce the impact of individual noisy data points, leading to more stable predictions.
  • 96.
    Ordinary Least Squares(OLS) – Model Fitting
  • 100.
    Residual Sum ofsquares (RSS)
  • 101.
  • 102.
    Regularization Regularization is atechnique used in machine learning and statistical modeling to prevent overfitting and improve the generalization performance of models. Overfitting occurs when a model learns the training data too well, capturing noise or random fluctuations in the data, which leads to poor performance on unseen data.
  • 103.
  • 104.
  • 106.
  • 116.
  • 121.
  • 124.
    Cross validation techniques •Resubstitution •Hold-out •K-foldcross-validation •LOOCV •Random subsampling •Bootstrapping Validation techniques in machine learning are used to get the error rate of the ML model, which can be considered as close to the true error rate of the population
  • 135.
  • 136.
    Combining multiple modelsto improve the predictive performance over any single model.
  • 139.
    Bootstrap aggregation, orbagging, is a general-purpose procedure for reducing the variance of a statistical learning meth B bootstrapped training set In Regression OR Majority Vote In Classification
  • 142.
    Another approach forimproving the predictions resulting from a decision tree
  • 143.
    Trees are grownsequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is ft on a modified version of the original data set The number of trees B. Unlike bagging and random forests, boosting can overft if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B To find the best split this always consider a one feature and iterate through all the features – This use Gini Index Stump
  • 144.
    Generate a randomnumber between 0 and 1 and the pic a record from the bin To create the second sample list and the Do the same process
  • 145.
  • 147.
  • 150.
  • 151.
    Logistic regression isa type of statistical model used for binary classification tasks. It predicts the probability of a binary outcome (i.e., an event with two possible values, such as 0 and 1, true and false, yes and no). Probability Output: Unlike linear regression, logistic regression provides probabilities for class membership, which can be useful for decision-making processes. The core of logistic regression is the logistic function (also called the sigmoid function), which maps any real-valued number into the range (0, 1): Or
  • 155.
    In statistics andprobability theory, odds represent the ratio of the probability of success to the probability of failure in a given event. The odds of an event can be expressed in different ways: as odds in favor, odds against, or simply as odds. odd s
  • 156.
  • 157.
  • 158.
  • 159.
    A Bayes classifier,also known as a Naive Bayes classifier, is a probabilistic machine learning algorithm based on Bayes' theorem.
  • 165.
  • 166.
    Non-parametric supervised learningalgorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes
  • 170.
  • 171.
    Decision tree regressionis a type of supervised learning algorithm used in machine learning, primarily for regression tasks. In decision tree regression, the algorithm builds a tree-like structure by recursively splitting the data into subsets based on the features that best separate the target variable (continuous in regression) into homogeneous groups.
  • 172.
    An impurity measure,also known as a splitting criterion or splitting rule, is a metric used in decision tree algorithms to evaluate the homogeneity of a set of data points with respect to the target variable The impurity measure serves as a criterion for selecting the best feature and split point at each node of the tree. The goal is to find the feature and split point that result in the most homogeneous child nodes, leading to better predictions and a more accurate decision tree model.
  • 173.
    1.Leaf Node Prediction:Once a leaf node is reached, the prediction is made based on the majority class (for classification) or the mean (for regression) of the target variable in that leaf node. This prediction becomes the output of the decision tree model for the given instance.
  • 176.
    Mean squared error(MSE) as the impurity measure in decision tree regression. By minimizing the MSE at each split, decision tree regression effectively partitions the feature space into regions that are more homogeneous with respect to the target variable, leading to accurate predictions for unseen data points.
  • 177.
    Xo Will beselected
  • 179.
  • 180.
  • 184.
    Bagging vs Boosting 1.FeatureSelection: 1. Bagging: Uses all features available for each split in the decision trees. 2. Random Forest: Randomly selects a subset of features for each split in the decision trees, which introduce additional randomness and reduces the correlation between the trees. 2.Bias-Variance Tradeoff: 1. Bagging: bagging will not lead to a substantial reduction in variance over a single tree in this setting but by averag multiple models it reduce the variance.But does not inherently reduce correlation between the models. 1. Random Forest: Reduces both variance and correlation between models by introducing randomness in fea selection, leading to lower overall variance and improved model performance. Random forests overcome model correlation problem by forcing each split to consider only a subset of the predictors Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will hav more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting tr less variable and hence more reliable 3.Performance: 1. Bagging: Can be applied to any base model and improves performance by reducing overfitting through mo averaging. 2. Random Forest: Specifically designed for decision trees, typically performs better than bagging with decisio trees due to the reduced correlation between trees.
  • 186.
    The k-nearest neighborsalgorithm (k-NN) is a non-parametric, lazy learning method used for classification and regression. The output based on the majority vote (for classification) or mean (or median, for regression) of the k-nearest neighbors in the feature space.
  • 187.
  • 188.
    Support Vector Machine(SVM) is a supervised machine learning algorithm used for classification and regression tasks. It's particularly effective for binary classification problems, where the goal is to classify data points into one of two categories.
  • 189.
    Hyperplane based inthe dimension One Dimension it is a point Two Dimension it is a Line 3 Dimension it is a surface
  • 193.
  • 195.
    Regression (Residual) Sumof Squares (RSS) = Sum of Squared Errors (SSE) Total Sum of Squares (TSS) = SST
  • 196.
    Mean Squared Error(MSE) MSE measures the average squared error, with higher values indicating more significant discrepancies between predicted and actual values. MSE penalizes more significant errors due to squaring, making it sensitive to outliers. It is commonly used due to its mathematical properties but may be less interpretable than other metrics. It is widely used in optimization and model training because it is differentiable, which is important for gradient-based methods. Importance: •MSE penalizes larger errors more than smaller ones due to squaring the errors. •It is widely used in optimization and model training because it is differentiable, which is important for gradient-based methods. What It Tells About the Model: •A lower MSE indicates a model with fewer large errors. •It provides a sense of the average error squared, which can emphasize the impact of larger errors.
  • 197.
    The common shapeof the Mean Squared Error (MSE) graph, when plotted as a function of the model parameters, is typically a convex curve.
  • 198.
    Mean Absolute Error(MAE) MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. Importance: •MAE is a straightforward measure of error magnitude. •It is less sensitive to outliers compared to MSE and RMSE because it doesn’t square the errors. What It Tells About the Model: •A lower MAE indicates a model that makes smaller errors on average. •Since it uses absolute differences, it provides a clear indication of the typical size of the errors in the same units as the target variable.
  • 199.
    Root Mean SquaredError (RMSE) RMSE is the square root of the average of squared differences between prediction and actual observation. It represents the standard deviation of the prediction errors. Importance: •RMSE is the square root of MSE, bringing the metric back to the same units as the target variable. •It is more sensitive to outliers than MAE due to the squaring of errors before averaging. What It Tells About the Model: •A lower RMSE indicates better fit, similar to MSE but more interpretable in the context of the target variable's scale. •It provides an idea of how large the errors are in absolute terms. Why RMSE is Considered as Standard Deviation of Prediction Errors •If we assume that the prediction errors (residuals) are normally distributed with a mean of zero, then the RMSE provides an estimate of the standard deviation of this normal distribution. •This is because, under the normal distribution, the standard deviation is a measure of the average distance of the data points from the mean, which in this case is zero.
  • 200.
    Residual Standard Error: TheResidual Standard Error (RSE) is a measure used in regression analysis to quantify the typical size of the residuals (prediction errors) from a linear regression model. It provides an estimate of the standard deviation of the residuals, which helps in understanding how well the model fits the data. RSE is in the same units as the dependent variable, making it straightforward to interpret. Adjustment for Predictors: Unlike simple measures like RMSE, RSE accounts for the number of predictors in the model. This adjustment (using 𝑛−𝑝−1n−p−1 in the denominator) helps prevent overfitting by penalizing models with more predictors. Model Comparison: 1. Comparison Tool: RSE allows for the comparison of different models. When comparing models with the same dependent variable, a lower RSE indicates a better fit. 2. Relative Measure: While RSE itself doesn't provide an absolute goodness-of-fit measure, it is useful when comparing models to determine which one better explains the variability in the data.
  • 201.
    Large RSE valuesmay indicate a poor fit, suggesting that the model is not capturing all the relevant information in the data. Model Assessment: RSE helps assess the accuracy of a regression model. A lower RSE value indicates a model that better captures the data's variability. Predictive Accuracy: RSE provides insights into the model’s predictive accuracy, indicating how close the predicted values are to the actual values on average. Identification of Outliers or Influential Points: Large residuals can indicate outliers or influential points that may unduly affect the model's performance. By examining these cases closely, researchers can decide whether to include, exclude, or transform them to improve model fit. Detection of Heteroscedasticity: Heteroscedasticity occurs when the variability of the residuals is not constant across all levels of the predictor variables. RSE can help identify this issue, prompting researchers to explore transformations or alternative modeling techniques to address it.
  • 202.
    Residual plot A residualis a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A typical residual plot has the residual values on the Y-axis and the independent variable on the x-axis
  • 203.
    Heterogeneity in residuals"refers to the situation where the variability of the residuals is not consistent across all levels of the predictor variables. In other words, the spread or dispersion of residuals varies systematically with the values of one or more predictor variables.
  • 204.
    characteristics of agood residual plot:
  • 205.
    Identifying whether theError is high or low Scale of the Target Variable: If the target variable has a large range (e.g., house prices ranging from $100,000 to $1,000,000), an RMSE of $10,000 might be considered low. Conversely, for smaller ranges, such as predicting daily temperature, an RMSE of 10 degrees might be high. •Industry Standards: Different fields have established benchmarks for acceptable error rates. For instance, in some financial models, an RMSE of a few dollars might be acceptable, while in other domains, such as temperature prediction, an RMSE of a few degrees could be too high. •Historical Data: Compare the error values to those of previous models or known standards within the same domain. This helps in understanding the expected range of errors. •Impact of Errors: Consider the practical implications of the error. For instance, in medical diagnostics, even small errors can be critical, whereas, in movie recommendation systems, higher errors might be more tolerable. •Business Goals: Align the acceptable error rates with business goals and requirements. Sometimes, a slightly higher error might be acceptable if it results in significant cost savings or other benefits.
  • 206.
  • 208.
    Coefficient Analysis H0 :There is no relationship between X and Y Mathematically, this corresponds to testing H0 : β1 = 0 Y = β0 + ", and X is not associated with Y . To test the null hypothesis, we need to determine whether βˆ1, our estimate for β1, is sufficiently far from zero that we can be confident that β1 is non-zero How far is far enough?
  • 209.
    These coefficients representthe estimated change in the dependent variable (response variable) for a one-unit change in the corresponding predictor variable, holding all other variables constant. For example, if the estimate for a predictor variable X1 is 0.5, it means that, on average, for each one-unit increase in X1, the dependent variable is estimated to increase by 0.5 units, assuming all other variables in the model remain constant.
  • 210.
    Coefficient Magnitude: Lookat the magnitude of the coefficients. Larger coefficients imply a stronger relationship between the predictor variable and the response variable. For example, a coefficient of 2 means that a one-unit increase in the predictor variable is associated with a two-unit Coefficient Direction: Determine the direction of the relationship between the predictor variable and the response variable. A positive coefficient indicates a positive relationship, meaning that as the predictor variable increases, the response variable also tends to increase. Conversely, a negative coefficient suggests a negative relationship, where an increase in the predictor variable is associated with a decrease in the response variable. Confounding Variables: Be aware of confounding variables or multicollinearity issues. If coefficients change substantially when adding or removing variables from the model, it could indicate that the variables are correlated with each other, leading to potential issues in interpretation.
  • 211.
    Standard Error Understanding thestandard error helps in assessing the stability and robustness of the model's parameter estimates The standard error provides an estimate of how much we would expect the coefficient estimates to vary from the true population parameters across different samples of the same size from the population
  • 212.
    T Value also knownas the t-statistic, is calculated as the ratio of the coefficient estimate to its standard error in regression analysis. the t-value represents the standardized deviation of the coefficient estimate from zero, expressed in terms of standard errors Why is it important? Significance Testing: t-value is used to conduct hypothesis tests on the coefficients. whether the corresponding predictor variable has a statistically significant effect on the response variable. This is essential for understanding which predictors are truly influential in the model Higher t-values indicate stronger evidence against the null hypothesis (that the coefficient is zero), suggesting that the corresponding predictor is more likely to be important in explaining the variation in the response variable Comparing t-values across different coefficients allows researchers to assess the relative importance of different predictors in the model Lower t-values across all coefficients may indicate that the model is not capturing important relationships between the predictors and the response variable.
  • 213.
  • 214.
    The p-value, isprobability that measure of the strength of evidence against the null hypothesis in statistical hypothesis testing If the p-value is less than the significance level, the coefficient is considered statistically significant When interpreting p-values, it's essential to consider the chosen significance level (e.g., 0.05) and whether multiple comparisons are being made (which may require adjustment of the significance level). •A low p-value indicates strong evidence against the null hypothesis, suggesting that the coefficient estimate is statistically significant. •A high p-value suggests weak evidence against the null hypothesis, indicating that the coefficient estimate is not statistically significant.
  • 215.
    Importance: •indicates the proportionof the variance in the dependent variable that is predictable from the independent variable(s). •It is a normalized metric, meaning it ranges from 0 to 1 (or can be negative if the model is worse than a horizontal line). What It Tells About the Model: •A higher 𝑅2R2 (closer to 1) means a better fit. •It shows how well the independent variables explain the variance in the dependent variable. However, it doesn't provide information on the size of the errors.
  • 217.
  • 218.
  • 220.
  • 224.
  • 226.
  • 228.
  • 229.
  • 230.
    Multicollinearity occurs whentwo or more predictor variables in a regression model are highly correlated, which can lead to unstable estimates of the regression coefficients and inflated standard errors due to 1.Unreliable Estimates of Regression Coefficients: When predictor variables are highly correlated with each other, it becomes difficult for the regression model to determine the individual effect of each predictor on the outcome variable. As a result, the estimated regression coefficients may be unstable or have high standard errors. 2.Uninterpretable Coefficients: In the presence of multicollinearity, the coefficients of the regression model may have counterintuitive signs or magnitudes, making their interpretation challenging or misleading. 3.Inflated Standard Errors: Multicollinearity inflates the standard errors of the regression coefficients, which can lead to wider confidence intervals and less precise estimates of the coefficients' true values. 4.Reduced Statistical Power: High multicollinearity reduces the statistical power of the regression model, making it less likely to detect significant relationships between predictor variables and the outcome variable, even if those relationships truly exist.
  • 231.
    The Variance InflationFactor (VIF) is a measure used to quantify the severity of multicollinearity in regression analysis
  • 233.
    Log Loss Log loss,also known as logistic loss or cross-entropy loss, is a performance metric for classification models, particularly those that output probabilities for each class Log loss quantifies the difference between the predicted probabilities and the actual class labels. For a classification problem, it is defined as:
  • 234.
    Interpretation •Lower Log Loss:Indicates that the predicted probabilities are close to the actual class labels, suggesting a better model. •Higher Log Loss: Indicates that the predicted probabilities are far from the actual class labels, suggesting a poorer model. What Log Loss Tells About the Model 1.Probability Calibration: Log loss evaluates how well the predicted probabilities are calibrated with respect to the true outcomes. It penalizes both overconfident wrong predictions and underconfident correct predictions. 1.Model Performance: It provides a nuanced measure of model performance, beyond just accuracy. While accuracy measures the fraction of correct predictions, log loss considers the confidence of those predictions. 1.Handling Class Imbalance: Log loss can handle imbalanced classes better than accuracy because it takes the predicted probabilities into account, rather than just the final classification.
  • 235.
    Confusion Matrix Evaluation ofthe performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made. The matrix can be represented as
  • 236.
    Precision and Recallshould be calculated for each class Precision is based on the prediction Recall based on the ground truth
  • 237.
    Accuracy gives anoverall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision Always based on the prediction is important when the cost of false positives is high. It helps assess the quality of positive predictions. Recall (Sensitivity) Always based on the ground truth Is crucial when capturing all actual positives is essential. It measures the model's ability to identify positive instances. •F1 Score provides a balance between precision and recall, especially when there's an uneven class distribution. It's a better measure of a model's performance when there's a trade-off between false positives and false negatives.
  • 241.
    Importance: •Accuracy gives anoverall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. •Precision is important when the cost of false positives is high. It helps assess the quality of positive predictions.
  • 242.
    Importance: •Accuracy gives anoverall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. •Precision is important when the cost of false positives is high. It helps assess the quality of positive predictions. •Recall (Sensitivity) is crucial when capturing all actual positives is essential. It measures the model's ability to identify positive instances. •F1 Score provides a balance between precision and recall, especially when there's an uneven class distribution. It's a better measure of a model's performance when there's a trade-off between false positives and false negatives.
  • 243.
    Classification report Overall Metrics: •Accuracy:0.60 • This means that 60% of the total predictions were correct. •Macro Average: • computing the metric independently for each class and then taking the average of these metrics. It treats all classes equally, without considering the class distribution
  • 244.
    These macro averagemetrics provide an overall measure of model performance that treats all classes equally, regardless of their frequency in the dataset. Weighted average performance metrics for each class are weighted by the number of instances in that class, giving more importance to classes with more instances
  • 245.
    The weighted averageprovides a more realistic measure of overall model performance by giving more importance to the classes with more instances. This is particularly useful in datasets with imbalanced class distributions, as it ensures that the performance metrics reflect the model's ability to correctly classify the more prevalent classes. How to use those result for model improvements Weighted averages might be significantly higher than macro averages, • indicating that the model performs well on frequent classes but poorly on rare ones. • Oversampling Minority Classes: Use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate more samples for underrepresented classes. • Under sampling Majority Classes: Reduce the number of samples in the overrepresented classes to balance the class distribution. • Class Weights: Modify the loss function to give higher weights to minority classes during training, encouraging the model to focus more on these classes.
  • 246.
    Low precision, recall,and F1 scores for specific classes •Class-Specific Data Augmentation: Create additional synthetic data or collect more real data for the poorly performing classes. •Feature Engineering: Develop new features that may be more informative for the difficult classes. •Class-Specific Models: Train separate models for each class or use ensemble methods that can better handle class-specific peculiarities. High performance on training data but low performance on certain test classes. •Regularization: Apply techniques like L1/L2 regularization to prevent overfitting. •Pruning Decision Trees: If using decision trees or random forests, prune the trees to reduce complexity and prevent overfitting. •Cross-Validation: Use cross-validation to ensure that the model generalizes well across different subsets of the data.
  • 247.
    Consistent low recallor precision across multiple classes in both macro and weighted averages. •Hyperparameter Tuning: Use grid search or random search to find the optimal hyperparameters for your model. •Ensemble Methods: Combine multiple models to leverage their strengths and mitigate individual weaknesses. Methods like bagging, boosting, and stacking can improve overall performance. •Regular Updates: Regularly update the model with new data to ensure it captures the most recent patterns and trends. If current improvements are insufficient, it might be indicative of the need for a different model architecture. •Algorithm Choice: Experiment with different algorithms (e.g., switching from a decision tree to a gradient boosting machine or neural network) to find one that better captures the data patterns. •Neural Network Layers: For deep learning models, adjust
  • 248.
    Practical Steps: 1.Evaluate Metrics: 1.Carefully analyze the precision, recall, and F1-score for each class. 2. Compare macro and weighted averages to understand overall versus individual class performance. 2.Diagnose Issues: 1. Identify which classes are underperforming and why (e.g., lack of data, inherent difficulty). 3.Implement Improvements: 1. Choose and apply the appropriate techniques from the actions listed above based on your diagnosis. 2. Regularly monitor the impact of these changes on your model's performance metrics. 4.Iterate and Optimize: 1. Continuously iterate on the model, using new data and feedback to further refine performance. 2. Use tools like learning curves to understand the impact of more data or different algorithms.
  • 249.
  • 250.
    deviance residuals ina logistic regression table provide detailed information about the fit of the model to individual data points and help identify potential outliers or issues with the model. high max value compared to the other values might suggest that there are outliers or poorly fitted observations in the data. The deviance is a measure of the difference between a fitted model and the perfect model (also called the saturated model). The deviance for a logistic regression model can be divided into two parts: 1.Null Deviance: This is the deviance of a model with no predictors, only an intercept. It serves as a baseline to compare with the fitted model. 1.Residual Deviance: This is the deviance of the fitted model with the predictors included.
  • 252.
  • 254.
  • 260.
    There is noguarantee that the method with the lowest training MSE will also have the lowest test MSE.
  • 265.
    What do wemean by the variance and bias of a statistical learning method? Variance refers to the amount by which ˆf would change if we estimated it using a diferent training data set Since the training data are used to ft the statistical learning method, diferent training data sets will result in a diferent ˆf. But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in ˆf. In general, more fexible statistical methods have higher variance
  • 275.
    population mean µof a random variable Y How far of will that single estimate of µˆ be? standard error of µˆ residual standard error Standard errors can be used to compute confdence intervals
  • 285.