DataScience-101

Data Science in a Day -
Workshop
Karthikeyan VK
Cloud Native Architect & Microsoft MVP

Basics
Data Cleaning
Exploratory Data Analysis(EDA)
Feature Selection and Engineering
Data Pre-Processing
Model Selection/Training/Evaluation
Deployment

Types of Data
• Quantitative
• Qualitative

Attributes
Attribute is a property or characteristic of an
object that may vary, either from one object
to another or form one time another
Eg:- Temperature of an object varies over
time

Properties of Attributes
Distinctness
= and ≠
Order <,
>,≤,≥
Addition +
and –
Multiplication
* and /

Types of Attributes
Nominal Ordinal Interval Ratio

Prerequisites – Readme.md
https://bit.ly/predtemplate

Prerequisites
Open Visual Studio Code and make sure the code is cloned.

Data Quality
Accuracy
Completeness
Consistencies
Interpretability

Exploratory Data
Analysis(EDA)

Exploratory Data Analysis
Visualization Insights for data Identify any patterns,
trends, or relationships.

Visualize Cars vs Price
Insights
•
Jaguar & Buick seems to have the highest price range cars.
• Car companies like Nissan, Renault & Mercury are having only one
to two datapoints.
• So we can't make any inference related to lowest price range car
companies.
Note:
Since there are too many categories in car compnay feature. So we
can derive a new feature Company Price Range which will show the price
range as Low Range, Medium Range, High Range.

Feature Engineering
Dimensionality reduction
Discretization and Binarization
Normalization or Standardization

Dimensionality Reduction
Feature Selection
Filter - Variance
Wrapper - Greedy approach
Embedded - L1 regularization(add bias)
Feature Extraction

Discretization and
binarization

Discretization
Unsupervised
Equal-Width
Equal-Frequency
K-Means
Supervised
Decision Trees

Binarization
Global
Thresholding
Local
Thresholding
Optimization
based
binarization

Check for mean to bin the data

Data Pre-processing
Scaling
Encoding
Normalization
Standardization

Normalization or
Standardization

Standardization & Normalization
• Decimal Scaling
• Min-Max Normalization
• Z-Score
Normalization(Standariz
zation)
Standardization or Normalization when features of the
input dataset have anomalies or simply when they are
measured in different units (e.g., pounds, meters, miles,
etc.).

Standardization
• Before Principal Component Analysis
(PCA)
• Before Clustering
• Before K-Nearest Neighbors (KNN)
• Before Support Vector Machine (SVM)
• Before Measuring Variable Importance
in Regression Models

Categorical to Numerical Transformation(Encoding)

Fit & Transform to remove the
anomalies

Model
Selection/Training/Evaluation

Model Selection
• Select a ML algorithm
• Selecting the one that performs
best on the problem and data.
• Linear regression, logistic regression,
decision trees, random forests, or
neural networks

Features for Training & Testing

Model Training
• Once the algorithm or model has been
selected, the next step is to train the
model on the training data.
• This involves fitting the model to the
data by adjusting the parameters of the
model to minimize the error or
maximize the accuracy.

Model Evaluation – Regression
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Mean Absolute Error (MAE)
• R-squared (R2)
• Adjusted R-squared
• Mean Absolute Percentage Error (MAPE)

Model Evaluation - Clustering
• Silhouette Coefficient
• Calinski-Harabasz Index
• Davies-Bouldin Index
• Dunn Index
• Rand Index
• Adjusted Rand Index
• Normalized Mutual Information

Model Evaluation - Classification
• Accuracy
• Precision
• Recall
• F1-score
• ROC curve
• AUC
• Confusion matrix

Model Evaluation – Anomaly Detection
• True Positive Rate (TPR)
• False Positive Rate (FPR)
• Precision
• F1-score
• Receiver Operating Characteristic (ROC) curve
• Area Under the ROC Curve (AUC)
• Confusion Matrix
• Precision-Recall (PR) Curve
• Average Precision (AP)

Model Evaluation – Reinforcement
• Rewards
• Average Reward per Episode
• Value Function
• Policy Gradient
• Exploration-Exploitation Trade-off
• Convergence Time
• Success Rate

DataScience-101

More Related Content

Similar to DataScience-101

More from Karthikeyan VK

Recently uploaded

DataScience-101