SlideShare a Scribd company logo
1 of 43
Module 2
Machine Learning Activities
Understand the type of data in the given input data set.
Explore the data to understand the nature and quality.
Explore the relationships amongst the data elements
Find potential issues in data.
Do the necessary remediations (impute missing data
values, etc.,)
Activity cont...
Apply pre-processing steps.
The input data is first divided into parts(The training data and
The testing data)
Consider different models or learning algorithms for selection.
Train the model based on the training data for supervised
learning problem and apply to unknown data.
Activity cont...
Directly apply the chosen unsupervised model on the input
data for unsupervised learning problem.
Basic Data Types
Data can be categorized into 4 basic
types from a Machine Learning
perspective: numerical data, categorical
data, time series data, and text.
Numerical and Categorial Data
Numerical Data
Numerical data is any data
where data points are exact
numbers. Statisticians also
might call numerical data,
quantitative data.
Exploring Numerical Data
There exists two major mathematical plot methods to
explore numerical data:
•Box plot
•Histogram
Exploring Cont...
Understanding Central tendency:
For understanding the nature of data(Numeric variables) we
need to apply measure of central tendency.
Mean: It is the sum of all data values divided by the count of all
data elements.
Median: It is the middle value. Median splits the dataset in to
half.
Mode: It is the most frequently occuring value in the data set.
Exploring Cont...
Measuring the Dispersion of Data (Range, Quartiles, Interquartile
Range):
Let x1,x2....,xN be a set of observations for some numeric attribute, X.
The range of the set is the difference between the largest(max()) and
the smallest (min()) values.
Quartiles: are points taken at regular intervals of data distribution,
dividing it into essentially equal size consecutive sets.
Interquartile range: The distance between the first and third quartiles
is a measure of spread that gives the range covered by the middle
half of the data.
Variance and Standard Deviation
These are measures of data dispersion. And it indicates that
how spread out a data distribution is.
A low standard deviation means that the data observations
observations tend to be very close to the mean, while high
high standard deviation indicates that the data are spread out
spread out over a large range of values.
Categorical Data
Categorical data represents
characteristics, such as a hockey
player’s position, team, hometown .
Time Series
Data
Time series data is a
sequence of numbers
collected at regular
intervals over some
period of time.
Text Data
Text data is basically just words.
Relationship between variables
Scatter-plots and two-way cross tabulation can be
effectively used.
Scatter- plots: a graph in which the values of two variables are
plotted along two axes, the pattern of the resulting points
revealing any correlation present.
Relationship Cont...
Two-way cross tabulation: It is also known as cross-tab, are
used to understand the relationship of two categorical attributes
in a concise way.
It has a matrix format that presents a summarized view of the
bivariate frequency distribution. It is much similar to scatter plot,
helps to understand how much the data values of the attribute
changes with the change in data values of another attributes.
Data Issues
Day by day we are generating tremendous amount of
data. Dealing with big data is much more complicated.
Real-world databases are highly susceptible to noisy,
missing, and inconsistent data due to their typically huge
size (often several gigabytes or more) and their likely origin
from multiple, heterogenous sources
Issues cont...
In accurate, incomplete, and inconsistent data are common-
place properties of large real-world databases and warehouses.
Main reasons for inaccurate data
• Having incorrect attribute values.
• The data collection instruments used may be faulty.
• There may have been human or computer errors
occurring at data entry.
Issues cont...
• Users may purposely submit incorrect data values for
mandatory fields when they do not wish to submit
personal information.
• Errors in data transmission can also occur.
• Inconsistent formats for input fields.
Remedies
Handling Outliers: Outliers are data elements with an
abnormally high value which may impact prediction accuracy.
•Remove outliers: If the outliers for the specific record is not
many, simple way is to remove.
•Imputation: impute the values with mean or median or mode.
•Capping: For values that lie outside the 1.5|x|IQR limits, we can
cap them by replacing those observations below the lower limit
with the value of 5th percentile and those that lie above upper
limit, with the value of 95th percentile.
Remedies Cont...
Handling Missing Values:
• Eliminate records having a missing value of data elements.
• Imputing missing values using mean/median/mode.
• Fill the missing value manually.
• Use the global constant to fill the missing value.
• Use the most probable value to fill in the missing value.
Major tasks in pre-processing
Data cleaning: routines work to “clean” the data by filling
in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies.
Data Integration: Integrating data from different sources
Pre Processing Cont...
Data Transformation: It is the process of converting data
from one format to another.
Data reduction: obtains a reduced representation of the
data set that is much smaller in volume, yet produces the
same (or almost the same) analytical results. Data
reduction strategies include dimensionality reduction and
numerosity reduction.
Model
Abstraction is a significant step as it represents raw input
data in a summarized and structured format, such that a
meaningful insight is obtained from the data. This
structured representation of raw input data to the
meaningful pattern is called a Model.
Model Selection
Models for supervised learning try to predict certain values
using the input data set.
Models for unsupervised learning used to describe a data
set or gain insight from a data set.
Model Training
The process of assigning a model, and fitting a specific
model to a data set is called model Training.
Bias: If the outcome of a model is systematically incorrect,
the learning is said to have a bias.
Model Representation &
Interpretability
Fitness of a target function approximated by a learning
algorithm determines how correctly it is able to classify a
set of data it has never seen.
Underfitting:
If the target function is kept too simple, it may not be able to
capture the essential nuances and represent the underlying
data well. This is known as underfitting.
Model Representation &
Interpretability Cont...
Overfitting:
Where the model has been designed in such a way that it
emulates the training data too closely. In such a case any
specific nuance in the training data, like noise or outliers,
gets embedded in the model. It adversely impacts the
performance of the model on the test data.
Model Representation &
Interpretability Cont...
Bias and Variance:(Supervised learning)
Errors due to bias arise from simplifying assumptions made
by the model whereas errors due to variance occur from
over-aligning the model with the training data sets.
Training a model.
Model evaluation aims to estimate the generalization
accuracy of a model on future data.
There exists two methods for evaluating model's
performance:
• Holdout
• Cross-validation
Training a model
Holdout: It tests a model on different data than it was
trained on. In this method the data set is divided into three
subsets:
• Training set: is a subset of the dataset used to build
predictive models.
• Validation set: is a subset of the dataset used to assess
the performance of the model built in the training phase.
Training a model con...
• Test set(unseen data): is a subset of the dataset used to
assess the likely future performance of a model.
The holdout approach is useful because of its speed,
simplicity, and flexibility.
Training a Model con..
Cross-Validation: It partitions the original observation
dataset into a training set, used to train the model, and an
independent set used to evaluate the analysis.
The most common cross-validation technique is K-fold
cross-validation, here original dataset is partitioned into k
equal size subsamples, called folds.
Training a Model con..
Bootstrap sampling: It is a popular way to identify training
and test data sets from the input data set. It uses the
technique of Simple Random Sampling with
Replacement(SRSWR). Bootstrapping randomly picks data
instances from the input data set, with the possibility of the
same data instance to be picked multiple times.
Evaluating performance of a model.
Classification Accuracy: Accuracy is a common evaluation
metric for classification problems. It's the number of correct
predictions made as a ratio of all predictions made.
Cross-Validation techniques can also be used to compare the
performance of different machine learning models on the same
data set and also be helpful in selecting the values for a
model's parameters that maximize the accuracy of the model-
also known as parameter tuning.
Evaluating performance of a model.
Confusion Matrix: It provides a more detailed breakdown of
correct and incorrect classification for each class.
Logarithmic Loss(logloss): measures the performance of a
classification model where the prediction input is a probability
value between 0 and 1.
Area under Curve(AUC): is a performance metric for
measuring the ability of binary classifier to discriminate
between positive and negative classes.
Evaluating performance of a model.
F-Measure: is a measure of a test's accuracy that
considers both the precision and recall of the test to
compute the score.
Precision is the number of correct positive results divided
by the total predicted positive observations.
Recall is the number of positive results divided by the
number of all relevant samples.
Feature Engineering
A feature is an attribute of a data set that is used in
machine learning process.
Feature engineering is an important pre-processing step
for machine learning, having two major elements
• Feature transformation
• Feature sub-set selection
Feature Engineering cont...
Feature Transformation: It transforms data into a new set of
features which can represent the underling machine learning
problem.
• Feature Construction
• Feature Extraction
Feature construction process discovers missing information
about the relationships between features and augments.
Feature Engineering cont...
Feature Extraction: Is the process of extracting or
creating a new set of features from the original set of
features using some functional mapping.
Examples: Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Linear Discriminant Analysis (LDA).
Thank You

More Related Content

What's hot

Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMaris R
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learningsafa cimenli
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 

What's hot (20)

Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data science
Data scienceData science
Data science
 
data science
data sciencedata science
data science
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Text mining
Text miningText mining
Text mining
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 

Similar to Machine learning module 2

Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Top 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdfTop 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdfAnanthReddy38
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research ReportDrMAlagupriyasafiq
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data ProcessingDrMAlagupriyasafiq
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxMurindanyiSudi1
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challengesijcnes
 

Similar to Machine learning module 2 (20)

Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
1234
12341234
1234
 
Top 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdfTop 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdf
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Classification
ClassificationClassification
Classification
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptx
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 

More from Gokulks007

Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Gokulks007
 
Elearning week12
Elearning week12Elearning week12
Elearning week12Gokulks007
 
Elearning week11
Elearning week11Elearning week11
Elearning week11Gokulks007
 
Elearning week10
Elearning week10Elearning week10
Elearning week10Gokulks007
 
Elearning week9
Elearning week9Elearning week9
Elearning week9Gokulks007
 
Elearning week8
Elearning week8Elearning week8
Elearning week8Gokulks007
 
Elearning week7
Elearning week7Elearning week7
Elearning week7Gokulks007
 
Elearning week6
Elearning week6Elearning week6
Elearning week6Gokulks007
 
Elearning week5
Elearning week5Elearning week5
Elearning week5Gokulks007
 
Elearning week4
Elearning week4Elearning week4
Elearning week4Gokulks007
 
Elearning week3
Elearning week3Elearning week3
Elearning week3Gokulks007
 
E learning week2
E learning week2E learning week2
E learning week2Gokulks007
 
E learning week1
E learning week1E learning week1
E learning week1Gokulks007
 
Machine Learning
Machine LearningMachine Learning
Machine LearningGokulks007
 

More from Gokulks007 (15)

Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024
 
Elearning week12
Elearning week12Elearning week12
Elearning week12
 
Elearning week11
Elearning week11Elearning week11
Elearning week11
 
Elearning week10
Elearning week10Elearning week10
Elearning week10
 
Elearning week9
Elearning week9Elearning week9
Elearning week9
 
Elearning week8
Elearning week8Elearning week8
Elearning week8
 
Elearning week7
Elearning week7Elearning week7
Elearning week7
 
Elearning week6
Elearning week6Elearning week6
Elearning week6
 
Elearning week5
Elearning week5Elearning week5
Elearning week5
 
Elearning week4
Elearning week4Elearning week4
Elearning week4
 
Elearning week3
Elearning week3Elearning week3
Elearning week3
 
E learning week2
E learning week2E learning week2
E learning week2
 
E learning week1
E learning week1E learning week1
E learning week1
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Text Mining
Text MiningText Mining
Text Mining
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Machine learning module 2

  • 1.
  • 3. Machine Learning Activities Understand the type of data in the given input data set. Explore the data to understand the nature and quality. Explore the relationships amongst the data elements Find potential issues in data. Do the necessary remediations (impute missing data values, etc.,)
  • 4. Activity cont... Apply pre-processing steps. The input data is first divided into parts(The training data and The testing data) Consider different models or learning algorithms for selection. Train the model based on the training data for supervised learning problem and apply to unknown data.
  • 5. Activity cont... Directly apply the chosen unsupervised model on the input data for unsupervised learning problem.
  • 6. Basic Data Types Data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time series data, and text.
  • 8. Numerical Data Numerical data is any data where data points are exact numbers. Statisticians also might call numerical data, quantitative data.
  • 9. Exploring Numerical Data There exists two major mathematical plot methods to explore numerical data: •Box plot •Histogram
  • 10. Exploring Cont... Understanding Central tendency: For understanding the nature of data(Numeric variables) we need to apply measure of central tendency. Mean: It is the sum of all data values divided by the count of all data elements. Median: It is the middle value. Median splits the dataset in to half. Mode: It is the most frequently occuring value in the data set.
  • 11. Exploring Cont... Measuring the Dispersion of Data (Range, Quartiles, Interquartile Range): Let x1,x2....,xN be a set of observations for some numeric attribute, X. The range of the set is the difference between the largest(max()) and the smallest (min()) values. Quartiles: are points taken at regular intervals of data distribution, dividing it into essentially equal size consecutive sets. Interquartile range: The distance between the first and third quartiles is a measure of spread that gives the range covered by the middle half of the data.
  • 12. Variance and Standard Deviation These are measures of data dispersion. And it indicates that how spread out a data distribution is. A low standard deviation means that the data observations observations tend to be very close to the mean, while high high standard deviation indicates that the data are spread out spread out over a large range of values.
  • 13. Categorical Data Categorical data represents characteristics, such as a hockey player’s position, team, hometown .
  • 14. Time Series Data Time series data is a sequence of numbers collected at regular intervals over some period of time.
  • 15. Text Data Text data is basically just words.
  • 16. Relationship between variables Scatter-plots and two-way cross tabulation can be effectively used. Scatter- plots: a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.
  • 17. Relationship Cont... Two-way cross tabulation: It is also known as cross-tab, are used to understand the relationship of two categorical attributes in a concise way. It has a matrix format that presents a summarized view of the bivariate frequency distribution. It is much similar to scatter plot, helps to understand how much the data values of the attribute changes with the change in data values of another attributes.
  • 18. Data Issues Day by day we are generating tremendous amount of data. Dealing with big data is much more complicated. Real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources
  • 19. Issues cont... In accurate, incomplete, and inconsistent data are common- place properties of large real-world databases and warehouses. Main reasons for inaccurate data • Having incorrect attribute values. • The data collection instruments used may be faulty. • There may have been human or computer errors occurring at data entry.
  • 20. Issues cont... • Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information. • Errors in data transmission can also occur. • Inconsistent formats for input fields.
  • 21. Remedies Handling Outliers: Outliers are data elements with an abnormally high value which may impact prediction accuracy. •Remove outliers: If the outliers for the specific record is not many, simple way is to remove. •Imputation: impute the values with mean or median or mode. •Capping: For values that lie outside the 1.5|x|IQR limits, we can cap them by replacing those observations below the lower limit with the value of 5th percentile and those that lie above upper limit, with the value of 95th percentile.
  • 22. Remedies Cont... Handling Missing Values: • Eliminate records having a missing value of data elements. • Imputing missing values using mean/median/mode. • Fill the missing value manually. • Use the global constant to fill the missing value. • Use the most probable value to fill in the missing value.
  • 23. Major tasks in pre-processing Data cleaning: routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Data Integration: Integrating data from different sources
  • 24.
  • 25. Pre Processing Cont... Data Transformation: It is the process of converting data from one format to another. Data reduction: obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data reduction strategies include dimensionality reduction and numerosity reduction.
  • 26. Model Abstraction is a significant step as it represents raw input data in a summarized and structured format, such that a meaningful insight is obtained from the data. This structured representation of raw input data to the meaningful pattern is called a Model.
  • 27. Model Selection Models for supervised learning try to predict certain values using the input data set. Models for unsupervised learning used to describe a data set or gain insight from a data set.
  • 28. Model Training The process of assigning a model, and fitting a specific model to a data set is called model Training. Bias: If the outcome of a model is systematically incorrect, the learning is said to have a bias.
  • 29. Model Representation & Interpretability Fitness of a target function approximated by a learning algorithm determines how correctly it is able to classify a set of data it has never seen. Underfitting: If the target function is kept too simple, it may not be able to capture the essential nuances and represent the underlying data well. This is known as underfitting.
  • 30. Model Representation & Interpretability Cont... Overfitting: Where the model has been designed in such a way that it emulates the training data too closely. In such a case any specific nuance in the training data, like noise or outliers, gets embedded in the model. It adversely impacts the performance of the model on the test data.
  • 31. Model Representation & Interpretability Cont... Bias and Variance:(Supervised learning) Errors due to bias arise from simplifying assumptions made by the model whereas errors due to variance occur from over-aligning the model with the training data sets.
  • 32. Training a model. Model evaluation aims to estimate the generalization accuracy of a model on future data. There exists two methods for evaluating model's performance: • Holdout • Cross-validation
  • 33. Training a model Holdout: It tests a model on different data than it was trained on. In this method the data set is divided into three subsets: • Training set: is a subset of the dataset used to build predictive models. • Validation set: is a subset of the dataset used to assess the performance of the model built in the training phase.
  • 34. Training a model con... • Test set(unseen data): is a subset of the dataset used to assess the likely future performance of a model. The holdout approach is useful because of its speed, simplicity, and flexibility.
  • 35. Training a Model con.. Cross-Validation: It partitions the original observation dataset into a training set, used to train the model, and an independent set used to evaluate the analysis. The most common cross-validation technique is K-fold cross-validation, here original dataset is partitioned into k equal size subsamples, called folds.
  • 36. Training a Model con.. Bootstrap sampling: It is a popular way to identify training and test data sets from the input data set. It uses the technique of Simple Random Sampling with Replacement(SRSWR). Bootstrapping randomly picks data instances from the input data set, with the possibility of the same data instance to be picked multiple times.
  • 37. Evaluating performance of a model. Classification Accuracy: Accuracy is a common evaluation metric for classification problems. It's the number of correct predictions made as a ratio of all predictions made. Cross-Validation techniques can also be used to compare the performance of different machine learning models on the same data set and also be helpful in selecting the values for a model's parameters that maximize the accuracy of the model- also known as parameter tuning.
  • 38. Evaluating performance of a model. Confusion Matrix: It provides a more detailed breakdown of correct and incorrect classification for each class. Logarithmic Loss(logloss): measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Area under Curve(AUC): is a performance metric for measuring the ability of binary classifier to discriminate between positive and negative classes.
  • 39. Evaluating performance of a model. F-Measure: is a measure of a test's accuracy that considers both the precision and recall of the test to compute the score. Precision is the number of correct positive results divided by the total predicted positive observations. Recall is the number of positive results divided by the number of all relevant samples.
  • 40. Feature Engineering A feature is an attribute of a data set that is used in machine learning process. Feature engineering is an important pre-processing step for machine learning, having two major elements • Feature transformation • Feature sub-set selection
  • 41. Feature Engineering cont... Feature Transformation: It transforms data into a new set of features which can represent the underling machine learning problem. • Feature Construction • Feature Extraction Feature construction process discovers missing information about the relationships between features and augments.
  • 42. Feature Engineering cont... Feature Extraction: Is the process of extracting or creating a new set of features from the original set of features using some functional mapping. Examples: Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Linear Discriminant Analysis (LDA).