This document discusses different techniques for selecting machine learning models, including random train/test splitting, resampling methods like k-fold cross-validation and bootstrap, and probabilistic measures. Resampling techniques like k-fold cross-validation estimate error by evaluating models on out-of-sample data. Probabilistic measures consider both a model's performance and complexity, seeking to balance fit and simplicity. Common probabilistic measures mentioned are the Akaike Information Criterion, Bayesian Information Criterion, Minimum Description Length, and Structural Risk Minimization.
Cross validation is a technique for evaluating machine learning models by splitting the dataset into training and validation sets and training the model multiple times on different splits, to reduce variance. K-fold cross validation splits the data into k equally sized folds, where each fold is used once for validation while the remaining k-1 folds are used for training. Leave-one-out cross validation uses a single observation from the dataset as the validation set. Stratified k-fold cross validation ensures each fold has the same class proportions as the full dataset. Grid search evaluates all combinations of hyperparameters specified as a grid, while randomized search samples hyperparameters randomly within specified ranges. Learning curves show training and validation performance as a function of training set size and can diagnose underfitting
The document discusses modelling and evaluation in machine learning. It defines what models are and how they are selected and trained for predictive and descriptive tasks. Specifically, it covers:
1) Models represent raw data in meaningful patterns and are selected based on the problem and data type, like regression for continuous numeric prediction.
2) Models are trained by assigning parameters to optimize an objective function and evaluate quality. Cross-validation is used to evaluate models.
3) Predictive models predict target values like classification to categorize data or regression for continuous targets. Descriptive models find patterns without targets for tasks like clustering.
4) Model performance can be affected by underfitting if too simple or overfitting if too complex,
6 Evaluating Predictive Performance and ensemble.pptxmohammedalherwi1
Bagging, also known as bootstrap aggregating, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It works by building multiple models (such as decision trees) and then averaging their predictions. Specifically, bagging fits each model on a random subset of the training set sampled with replacement. Then, the predictions from all the models are averaged or voted over to produce the final prediction.
Post Graduate Admission Prediction SystemIRJET Journal
This document presents a post graduate admission prediction system built using machine learning algorithms. The system analyzes factors like GRE scores, TOEFL scores, undergraduate GPA, research experience etc. to predict the universities a student is likely to get admission in. Various machine learning models like multiple linear regression, random forest regression, support vector machine and logistic regression are implemented and evaluated on an admission prediction dataset. Logistic regression achieved the highest accuracy of 97%. A web application called PostPred is developed using the logistic regression model to help students predict suitable universities to apply to based on their profile.
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
This document presents a comparative analysis of different supervised learning algorithms using various data splitting methods. It evaluates the dependence of algorithm performance on the data splitting approach. Four standard data splitting methods are compared: hold-out, leave-one-out cross-validation, k-fold cross-validation, and bootstrap. Six benchmark datasets from diverse domains are used to train and evaluate 15 supervised learning classifiers. Results show classifier performance varies with splitting method and dataset. To address this, a weighted mean rank risk adjusted model is proposed that ranks classifiers based on their average performance across splitting methods while accounting for variability.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
This document discusses different techniques for selecting machine learning models, including random train/test splitting, resampling methods like k-fold cross-validation and bootstrap, and probabilistic measures. Resampling techniques like k-fold cross-validation estimate error by evaluating models on out-of-sample data. Probabilistic measures consider both a model's performance and complexity, seeking to balance fit and simplicity. Common probabilistic measures mentioned are the Akaike Information Criterion, Bayesian Information Criterion, Minimum Description Length, and Structural Risk Minimization.
Cross validation is a technique for evaluating machine learning models by splitting the dataset into training and validation sets and training the model multiple times on different splits, to reduce variance. K-fold cross validation splits the data into k equally sized folds, where each fold is used once for validation while the remaining k-1 folds are used for training. Leave-one-out cross validation uses a single observation from the dataset as the validation set. Stratified k-fold cross validation ensures each fold has the same class proportions as the full dataset. Grid search evaluates all combinations of hyperparameters specified as a grid, while randomized search samples hyperparameters randomly within specified ranges. Learning curves show training and validation performance as a function of training set size and can diagnose underfitting
The document discusses modelling and evaluation in machine learning. It defines what models are and how they are selected and trained for predictive and descriptive tasks. Specifically, it covers:
1) Models represent raw data in meaningful patterns and are selected based on the problem and data type, like regression for continuous numeric prediction.
2) Models are trained by assigning parameters to optimize an objective function and evaluate quality. Cross-validation is used to evaluate models.
3) Predictive models predict target values like classification to categorize data or regression for continuous targets. Descriptive models find patterns without targets for tasks like clustering.
4) Model performance can be affected by underfitting if too simple or overfitting if too complex,
6 Evaluating Predictive Performance and ensemble.pptxmohammedalherwi1
Bagging, also known as bootstrap aggregating, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It works by building multiple models (such as decision trees) and then averaging their predictions. Specifically, bagging fits each model on a random subset of the training set sampled with replacement. Then, the predictions from all the models are averaged or voted over to produce the final prediction.
Post Graduate Admission Prediction SystemIRJET Journal
This document presents a post graduate admission prediction system built using machine learning algorithms. The system analyzes factors like GRE scores, TOEFL scores, undergraduate GPA, research experience etc. to predict the universities a student is likely to get admission in. Various machine learning models like multiple linear regression, random forest regression, support vector machine and logistic regression are implemented and evaluated on an admission prediction dataset. Logistic regression achieved the highest accuracy of 97%. A web application called PostPred is developed using the logistic regression model to help students predict suitable universities to apply to based on their profile.
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
This document presents a comparative analysis of different supervised learning algorithms using various data splitting methods. It evaluates the dependence of algorithm performance on the data splitting approach. Four standard data splitting methods are compared: hold-out, leave-one-out cross-validation, k-fold cross-validation, and bootstrap. Six benchmark datasets from diverse domains are used to train and evaluate 15 supervised learning classifiers. Results show classifier performance varies with splitting method and dataset. To address this, a weighted mean rank risk adjusted model is proposed that ranks classifiers based on their average performance across splitting methods while accounting for variability.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
The document discusses different techniques for cross-validation in machine learning. It defines cross-validation as a technique for validating model efficiency by training on a subset of data and testing on an unseen subset. It then describes various cross-validation methods like hold out validation, k-fold cross-validation, leave one out cross-validation, and their implementation in scikit-learn.
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
The document discusses various concepts related to machine learning models including prediction errors, overfitting, underfitting, bias, variance, hyperparameter tuning, and regularization techniques. It provides explanations of key terms and challenges in machine learning like the curse of dimensionality. Cross-validation methods like k-fold are presented as ways to evaluate model performance on unseen data. Optimization algorithms such as gradient descent and stochastic gradient descent are covered. Regularization techniques like Lasso, Ridge, and Elastic Net are introduced.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
The document provides an overview of machine learning activities including data exploration, preprocessing, model selection, training and evaluation. It discusses exploring different data types like numerical, categorical, time series and text data. It also covers identifying and addressing data issues, feature engineering, selecting appropriate models for supervised and unsupervised problems, training models using methods like holdout and cross-validation, and evaluating model performance using metrics like accuracy, confusion matrix, F-measure etc. The goal is to understand the data and apply necessary steps to build and evaluate effective machine learning models.
Overfitting and underfitting are modeling errors related to how well a model fits training data. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. Underfitting occurs when a model is too simple and does not fit the training data well. The bias-variance tradeoff aims to balance these issues by finding a model complexity that minimizes total error.
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
The document discusses validation techniques for machine learning models. It describes the train-test split method of dividing a dataset into training and test sets. It also explains k-fold and leave-one-out cross-validation as alternatives that reduce the impact of random partitions by repeatedly splitting the data into training and test subsets. K-fold validation divides the data into k subsets and uses k-1 for training and 1 for testing over k iterations, while leave-one-out uses a single sample for testing each time.
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
SMOTE is a technique used to handle class imbalance problems in data. It involves over-sampling the minority class by synthesizing new minority class examples and under-sampling the majority class. This helps improve recall, or the detection of truly positive instances from the minority class, which is often prioritized over precision in class imbalance situations. K-fold cross-validation is a resampling method used to evaluate machine learning models on limited data. It involves splitting the dataset into k groups, using each group as a test set while the remaining form the training set, and averaging the results.
This document discusses five ways to attain optimal model complexity in machine learning: 1) feature engineering and selection to optimize variables, 2) data augmentation to expand datasets, 3) dimensionality reduction to reduce high-dimensional data, 4) active learning where algorithms query users to label data, and 5) ensemble models that combine multiple models to improve performance over single models. These techniques help improve model performance, efficiency, and ability to learn from data.
The document discusses bagging, an ensemble machine learning method. Bagging (bootstrap aggregating) uses multiple models fitted on random subsets of a dataset to improve stability and accuracy compared to a single model. It works by training base models in parallel on random samples with replacement of the original dataset and aggregating their predictions. Key benefits are reduced variance, easier implementation through libraries like scikit-learn, and improved performance over single models. However, bagging results in less interpretable models compared to a single model.
Top 20 Data Science Interview Questions and Answers in 2023.pdfAnanthReddy38
Here are the top 20 data science interview questions along with their answers:
What is data science?
Data science is an interdisciplinary field that involves extracting insights and knowledge from data using various scientific methods, algorithms, and tools.
What are the different steps involved in the data science process?
The data science process typically involves the following steps:
a. Problem formulation
b. Data collection
c. Data cleaning and preprocessing
d. Exploratory data analysis
e. Feature engineering
f. Model selection and training
g. Model evaluation and validation
h. Deployment and monitoring
What is the difference between supervised and unsupervised learning?
Supervised learning involves training a model on labeled data, where the target variable is known, to make predictions or classify new instances. Unsupervised learning, on the other hand, deals with unlabeled data and aims to discover patterns, relationships, or structures within the data.
What is overfitting, and how can it be prevented?
Overfitting occurs when a model learns the training data too well, resulting in poor generalization to new, unseen data. To prevent overfitting, techniques like cross-validation, regularization, and early stopping can be employed.
What is feature engineering?
Feature engineering involves creating new features from the existing data that can improve the performance of machine learning models. It includes techniques like feature extraction, transformation, scaling, and selection.
Explain the concept of cross-validation.
Cross-validation is a resampling technique used to assess the performance of a model on unseen data. It involves partitioning the available data into multiple subsets, training the model on some subsets, and evaluating it on the remaining subset. Common types of cross-validation include k-fold cross-validation and holdout validation.
What is the purpose of regularization in machine learning?
Regularization is used to prevent overfitting by adding a penalty term to the loss function during model training. It discourages complex models and promotes simpler ones, ultimately improving generalization performance.
What is the difference between precision and recall?
Precision is the ratio of true positives to the total predicted positives, while recall is the ratio of true positives to the total actual positives. Precision measures the accuracy of positive predictions, whereas recall measures the coverage of positive instances.
Explain the term “bias-variance tradeoff.”
The bias-variance tradeoff refers to the relationship between a model’s bias (error due to oversimplification) and variance (error due to sensitivity to fluctuations in the training data). Increasing model complexity reduces bias but increases variance, and vice versa. The goal is to find the right balance that minimizes overall error.
This document discusses feature engineering, which is the process of transforming raw data into features that better represent the underlying problem for predictive models. It covers feature engineering categories like feature selection, feature transformation, and feature extraction. Specific techniques covered include imputation, handling outliers, binning, log transforms, scaling, and feature subset selection methods like filter, wrapper, and embedded methods. The goal of feature engineering is to improve machine learning model performance by preparing proper input data compatible with algorithm requirements.
Identifying and classifying unknown Network Disruptionjagan477830
This document discusses identifying and classifying unknown network disruptions using machine learning algorithms. It begins by introducing the problem and importance of identifying network disruptions. Then it discusses related work on classifying network protocols. The document outlines the dataset and problem statement of predicting fault severity. It describes the machine learning workflow and various algorithms like random forest, decision tree and gradient boosting that are evaluated on the dataset. Finally, it concludes with achieving the objective of classifying disruptions and discusses future work like optimizing features and using neural networks.
The document provides an overview of various machine learning algorithms including:
- K-means clustering which groups data into a specified number of clusters based on euclidean distance.
- Linear regression which finds the relationship between variables to predict outcomes.
- Logistic regression which predicts probabilities of categorical outcomes.
- Decision trees which use splitting criteria like gini impurity or information gain to create rules for classifications.
- Random forests which create an ensemble of decision trees to improve accuracy and handle large datasets.
Business applications and key concepts behind each algorithm are discussed along with advantages, disadvantages and evaluation metrics.
This document provides an overview of machine learning techniques for classification with imbalanced data. It discusses challenges with imbalanced datasets like most classifiers being biased towards the majority class. It then summarizes techniques for dealing with imbalanced data, including random over/under sampling, SMOTE, cost-sensitive classification, and collecting more data. [/SUMMARY]
Machine learning workshop, session 4.
- Generalization in Machine Learning
- Overfitting and Underfitting
- Algorithms by Similarity
- Real Application
- People to follow
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...csandit
The growing population of elders in the society calls for a new approach in care giving. By
inferring what activities elderly are performing in their houses it is possible to determine their
physical and cognitive capabilities. In this paper we show the potential of important
discriminative classifiers namely the Soft-Support Vector Machines (C-SVM), Conditional
Random Fields (CRF) and k-Nearest Neighbors (k-NN) for recognizing activities from sensor
patterns in a smart home environment. We address also the class imbalance problem in activity
recognition field which has been known to hinder the learning performance of classifiers. Cost
sensitive learning is attractive under most imbalanced circumstances, but it is difficult to
determine the precise misclassification costs in practice. We introduce a new criterion for
selecting the suitable cost parameter C of the C-SVM method. Through our evaluation on four
real world imbalanced activity datasets, we demonstrate that C-SVM based on our proposed
criterion outperforms the state-of-the-art discriminative methods in activity recognition.
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...IRJET Journal
This document summarizes research on improving predictions of potential clients for bank term deposits using machine learning approaches. The researchers analyzed bank customer data using logistic regression, support vector machines, random forests, and XGBoost models. They found that XGBoost performed best with an area under the ROC curve of 0.7368, an F1 score of 0.9291, and test accuracy of 0.8351. The study aimed to identify the most effective predictive model that can be used in bank telemarketing campaigns to target potential clients.
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
More Related Content
Similar to UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
The document discusses different techniques for cross-validation in machine learning. It defines cross-validation as a technique for validating model efficiency by training on a subset of data and testing on an unseen subset. It then describes various cross-validation methods like hold out validation, k-fold cross-validation, leave one out cross-validation, and their implementation in scikit-learn.
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
The document discusses various concepts related to machine learning models including prediction errors, overfitting, underfitting, bias, variance, hyperparameter tuning, and regularization techniques. It provides explanations of key terms and challenges in machine learning like the curse of dimensionality. Cross-validation methods like k-fold are presented as ways to evaluate model performance on unseen data. Optimization algorithms such as gradient descent and stochastic gradient descent are covered. Regularization techniques like Lasso, Ridge, and Elastic Net are introduced.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
The document provides an overview of machine learning activities including data exploration, preprocessing, model selection, training and evaluation. It discusses exploring different data types like numerical, categorical, time series and text data. It also covers identifying and addressing data issues, feature engineering, selecting appropriate models for supervised and unsupervised problems, training models using methods like holdout and cross-validation, and evaluating model performance using metrics like accuracy, confusion matrix, F-measure etc. The goal is to understand the data and apply necessary steps to build and evaluate effective machine learning models.
Overfitting and underfitting are modeling errors related to how well a model fits training data. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. Underfitting occurs when a model is too simple and does not fit the training data well. The bias-variance tradeoff aims to balance these issues by finding a model complexity that minimizes total error.
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
The document discusses validation techniques for machine learning models. It describes the train-test split method of dividing a dataset into training and test sets. It also explains k-fold and leave-one-out cross-validation as alternatives that reduce the impact of random partitions by repeatedly splitting the data into training and test subsets. K-fold validation divides the data into k subsets and uses k-1 for training and 1 for testing over k iterations, while leave-one-out uses a single sample for testing each time.
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
SMOTE is a technique used to handle class imbalance problems in data. It involves over-sampling the minority class by synthesizing new minority class examples and under-sampling the majority class. This helps improve recall, or the detection of truly positive instances from the minority class, which is often prioritized over precision in class imbalance situations. K-fold cross-validation is a resampling method used to evaluate machine learning models on limited data. It involves splitting the dataset into k groups, using each group as a test set while the remaining form the training set, and averaging the results.
This document discusses five ways to attain optimal model complexity in machine learning: 1) feature engineering and selection to optimize variables, 2) data augmentation to expand datasets, 3) dimensionality reduction to reduce high-dimensional data, 4) active learning where algorithms query users to label data, and 5) ensemble models that combine multiple models to improve performance over single models. These techniques help improve model performance, efficiency, and ability to learn from data.
The document discusses bagging, an ensemble machine learning method. Bagging (bootstrap aggregating) uses multiple models fitted on random subsets of a dataset to improve stability and accuracy compared to a single model. It works by training base models in parallel on random samples with replacement of the original dataset and aggregating their predictions. Key benefits are reduced variance, easier implementation through libraries like scikit-learn, and improved performance over single models. However, bagging results in less interpretable models compared to a single model.
Top 20 Data Science Interview Questions and Answers in 2023.pdfAnanthReddy38
Here are the top 20 data science interview questions along with their answers:
What is data science?
Data science is an interdisciplinary field that involves extracting insights and knowledge from data using various scientific methods, algorithms, and tools.
What are the different steps involved in the data science process?
The data science process typically involves the following steps:
a. Problem formulation
b. Data collection
c. Data cleaning and preprocessing
d. Exploratory data analysis
e. Feature engineering
f. Model selection and training
g. Model evaluation and validation
h. Deployment and monitoring
What is the difference between supervised and unsupervised learning?
Supervised learning involves training a model on labeled data, where the target variable is known, to make predictions or classify new instances. Unsupervised learning, on the other hand, deals with unlabeled data and aims to discover patterns, relationships, or structures within the data.
What is overfitting, and how can it be prevented?
Overfitting occurs when a model learns the training data too well, resulting in poor generalization to new, unseen data. To prevent overfitting, techniques like cross-validation, regularization, and early stopping can be employed.
What is feature engineering?
Feature engineering involves creating new features from the existing data that can improve the performance of machine learning models. It includes techniques like feature extraction, transformation, scaling, and selection.
Explain the concept of cross-validation.
Cross-validation is a resampling technique used to assess the performance of a model on unseen data. It involves partitioning the available data into multiple subsets, training the model on some subsets, and evaluating it on the remaining subset. Common types of cross-validation include k-fold cross-validation and holdout validation.
What is the purpose of regularization in machine learning?
Regularization is used to prevent overfitting by adding a penalty term to the loss function during model training. It discourages complex models and promotes simpler ones, ultimately improving generalization performance.
What is the difference between precision and recall?
Precision is the ratio of true positives to the total predicted positives, while recall is the ratio of true positives to the total actual positives. Precision measures the accuracy of positive predictions, whereas recall measures the coverage of positive instances.
Explain the term “bias-variance tradeoff.”
The bias-variance tradeoff refers to the relationship between a model’s bias (error due to oversimplification) and variance (error due to sensitivity to fluctuations in the training data). Increasing model complexity reduces bias but increases variance, and vice versa. The goal is to find the right balance that minimizes overall error.
This document discusses feature engineering, which is the process of transforming raw data into features that better represent the underlying problem for predictive models. It covers feature engineering categories like feature selection, feature transformation, and feature extraction. Specific techniques covered include imputation, handling outliers, binning, log transforms, scaling, and feature subset selection methods like filter, wrapper, and embedded methods. The goal of feature engineering is to improve machine learning model performance by preparing proper input data compatible with algorithm requirements.
Identifying and classifying unknown Network Disruptionjagan477830
This document discusses identifying and classifying unknown network disruptions using machine learning algorithms. It begins by introducing the problem and importance of identifying network disruptions. Then it discusses related work on classifying network protocols. The document outlines the dataset and problem statement of predicting fault severity. It describes the machine learning workflow and various algorithms like random forest, decision tree and gradient boosting that are evaluated on the dataset. Finally, it concludes with achieving the objective of classifying disruptions and discusses future work like optimizing features and using neural networks.
The document provides an overview of various machine learning algorithms including:
- K-means clustering which groups data into a specified number of clusters based on euclidean distance.
- Linear regression which finds the relationship between variables to predict outcomes.
- Logistic regression which predicts probabilities of categorical outcomes.
- Decision trees which use splitting criteria like gini impurity or information gain to create rules for classifications.
- Random forests which create an ensemble of decision trees to improve accuracy and handle large datasets.
Business applications and key concepts behind each algorithm are discussed along with advantages, disadvantages and evaluation metrics.
This document provides an overview of machine learning techniques for classification with imbalanced data. It discusses challenges with imbalanced datasets like most classifiers being biased towards the majority class. It then summarizes techniques for dealing with imbalanced data, including random over/under sampling, SMOTE, cost-sensitive classification, and collecting more data. [/SUMMARY]
Machine learning workshop, session 4.
- Generalization in Machine Learning
- Overfitting and Underfitting
- Algorithms by Similarity
- Real Application
- People to follow
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...csandit
The growing population of elders in the society calls for a new approach in care giving. By
inferring what activities elderly are performing in their houses it is possible to determine their
physical and cognitive capabilities. In this paper we show the potential of important
discriminative classifiers namely the Soft-Support Vector Machines (C-SVM), Conditional
Random Fields (CRF) and k-Nearest Neighbors (k-NN) for recognizing activities from sensor
patterns in a smart home environment. We address also the class imbalance problem in activity
recognition field which has been known to hinder the learning performance of classifiers. Cost
sensitive learning is attractive under most imbalanced circumstances, but it is difficult to
determine the precise misclassification costs in practice. We introduce a new criterion for
selecting the suitable cost parameter C of the C-SVM method. Through our evaluation on four
real world imbalanced activity datasets, we demonstrate that C-SVM based on our proposed
criterion outperforms the state-of-the-art discriminative methods in activity recognition.
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...IRJET Journal
This document summarizes research on improving predictions of potential clients for bank term deposits using machine learning approaches. The researchers analyzed bank customer data using logistic regression, support vector machines, random forests, and XGBoost models. They found that XGBoost performed best with an area under the ROC curve of 0.7368, an F1 score of 0.9291, and test accuracy of 0.8351. The study aimed to identify the most effective predictive model that can be used in bank telemarketing campaigns to target potential clients.
Similar to UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models (20)
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapitolTechU
Slides from a Capitol Technology University webinar held June 20, 2024. The webinar featured Dr. Donovan Wright, presenting on the Department of Defense Digital Transformation.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
3. Drawback
The division of data of different classes into the
training and test data may not be proportionate.
This problem can be solved by applying stratified
random sampling.
By using this technique, the whole data is broken
into several homogenous groups or strata and a
random sample is selected from each such stratum.
This ensures that the generated random partitions
have equal proportions of each class.
3
4. K-fold Cross-validation method
A special variant of holdout method, called repeated holdout
In this technique, several random holdouts are used to measure
the model performance.
In the end, the average of all performances is taken.
As multiple holdouts have been drawn, the training and test data
are more likely to contain representative data from all classes
and resemble the original input data closely.
This process of repeated holdout is the basis of k-fold cross-
validation technique. In k-fold cross-validation, the data set is
divided into k- completely distinct or non-overlapping random
partitions called folds.
4
6. The value of ‘k’ in k-fold cross-validation can be set to any
number. However, there are two approaches which are
extremely popular:
1. 10-fold cross-validation (10-fold CV)
2. Leave-one-out cross-validation (LOOCV)
10-fold cross-validation approach:
In this approach, for each of the 10-folds, each comprising of
approximately 10% of the data, one of the folds is used as
the test data for validating model performance trained based
on the remaining 9 folds (or 90% of the data). This is
repeated 10 times, once for each of the 10 folds being used
as the test data and the remaining folds as the training data.
7. The average performance across all folds is being reported.
Figure depicts the detailed approach of selecting the ‘k’ folds in k-fold
cross-validation.
As can be observed in the figure, each of the circles resembles a
record in the input data set whereas the different colours indicate
the different classes that the records belong to.
The entire data set is broken into ‘k’ folds – out of which one fold is
selected in each iteration as the test data set.
The fold selected as test data set in each of the ‘k’ iterations is
different. As shown in figure, the circles resemble the records in the
input data set, the contiguous circles represented as folds are not
subsequent records in the data set. The records in a fold are drawn by
using random sampling technique.
7
9. Bootstrap sampling: It is used to identify training and test data sets from the
input data set.
It uses the technique of Simple Random Sampling with Replacement (SRSWR)
This technique randomly picks data instances from the input data set, with
the possibility of the same data instance to be picked multiple times.
This means that from the input data set having ‘n’ data instances,
bootstrapping can create one or more training data sets having ‘n’ data
instances, some of the data instances being repeated multiple times. This
technique is useful in case of input data sets of small size, i.e. having very
less number of data instances.
9
Leave-one-out cross-validation (LOOCV) is an extreme case of k-fold
cross-validation using one record or data instance at a time as a test
data. This is done to maximize the count of data used to train the model. It
is obvious. that the number of iterations for which it has to be run is equal
to the total number of data in the input data set. Hence, obviously, it is
computationally very expensive and not used much in practice.
12. Lazy vs. Eager learner
• Eager learning:
It tries to construct an input- independent target function during the
model training phase.
It comes up with a trained model at the end of the learning phase.
Hence, when the test data comes in for classification, the eager
learner is ready with the model and doesn’t need to refer back to the
training data.
It takes more time in the learning phase than the lazy learners.
Some of the algorithms which adopt eager learning approach include
Decision Tree, Support Vector Machine, Neural Network, etc.
12
13. Lazy learning:
It doesn’t ‘learn’ anything.
It uses the training data in exact, and uses the knowledge to classify
the unlabelled test data.
Since lazy learning uses training data as-is, it is also known as rote learning
(i.e. memorization technique based on repetition).
Due to its heavy dependency on the given training data instance, it is also
known as instance learning.
They are also called non-parametric learning.
Lazy learners take very little time in training because not much of training
actually happens.
One of the most popular algorithm for lazy learning is k - nearest
neighbor.
13
14. MODEL REPRESENTATION AND INTERPRETABILITY
• The goal of supervised machine learning is to learn or derive a
target function which can best determine the target variable
from the set of input variables.
• A key consideration in learning the target function from the
training data is the extent of generalization. This is because the
input data is just a limited, specific view and the new, unknown data
in the test data set may be differing quite a bit from the
training data.
• Fitness of a target function approximated by a learning algorithm
determines how correctly it is able to classify a set of data it
has never seen.
14
15. Underfitting
• If the target function is kept too simple, it may not be able
to capture the essential nuances and represent the
underlying data well.
• A typical case of underfitting may occur when trying to
represent a non-linear data with a linear model as
demonstrated by both cases of underfitting shown in figure.
• Many times underfitting happens due to unavailability of
sufficient training data. Underfitting results in both poor
performance with training data as well as poor
generalization to test data. Underfitting can be avoided by
1. using more training data
2. reducing features by effective feature selection 15
17. Overfitting
The model has been designed in such a way that it matches the training
data too closely.
It has an impact on the performance of the model on the test data.
Overfitting occur as a result of trying to fit an excessively complex model
to closely match the training data.
The target function tries to make sure all training data points are
correctly partitioned by the decision boundary.
This exact nature is not replicated in the unknown test data set. Hence, the
target function results in wrong classification in the test data set.
Overfitting results in good performance with training data set, but poor
performance with test data set. Overfitting can be avoided by 1. using re-
sampling techniques like k-fold cross validation. 2.hold back of a validation
data set 3. remove the nodes which have little or no predictive power for
the given machine learning problem. 17
18. Bias – variance trade-off
• In supervised learning, the class value assigned by the learning
model built based on the training data may differ from the actual
class value. This error in learning can be of two types – errors due
to ‘bias’ and error due to ‘variance’.
Errors due to ‘Bias’ :
Arise from simplifying assumptions made by the model to make the
target function less complex or easier to learn. In short, it is due to
underfitting of the model. Parametric models generally have high bias
making them easier to understand/interpret and faster to learn.
These algorithms have a poor performance on data sets, which are
complex in nature and do not align with the simplifying assumptions
made by the algorithm. Underfitting results in high bias.
18
19. Errors due to ‘Variance’
• Errors due to variance occur from difference in training
data sets used to train the model.
• Different training data sets are used to train the model.
• Ideally the difference in the data sets should not be
significant and the model trained using different training
data sets should not be too different.
• However, in case of overfitting, since the model closely
matches the training data, even a small difference in
training data gets overstated in the model.
19
21. • So, the problems in training a model can happen because either (a)
the model is too simple and hence fails to interpret the data grossly
or (b) the model is extremely complex and magnifies even small
differences in the training data.
• Increasing the bias will decrease the variance, and Increasing the
variance will decrease the bias.
• Parametric algorithms demonstrate high bias but low variance.
• non-parametric algorithms demonstrate low bias & high variance.
• The goal of supervised machine learning is to achieve a balance
between bias and variance.
• For example, in k-Nearest Neighbors or k-NN, the user configurable
parameter ‘k’ can be used to do a trade-off between bias and
variance. When the value of ‘k’ is decreased, the model becomes
simpler to fit and bias increases. When ‘k’ is increased, the variance
increases.
21
22. 22
EVALUATING PERFORMANCE OF A MODEL
Supervised learning - classification
In supervised learning, one major task is classification.
The responsibility of the classification model is to assign class label
to the target feature based on the value of the predictor features.
For example, in the problem of predicting the win/loss in a cricket
match, the classifier will assign a class value win/loss to target feature
based on the values of other features like
whether the team won the toss,
number of spinners in the team,
number of wins the team had in the tournament, etc.
To evaluate the performance of the model, the number of correct
classifications or predictions made by the model has to be recorded.
A classification is said to be correct if, say for example in the given
problem, it has been predicted by the model that the team will win and
it has actually won.
23. 23
The accuracy of the model is calculated based on the number of correct and incorrect
classifications or predictions made by a model.
If 99 out of 100 times the model has classified correctly, e.g. 99% accuracy in case of a
sports win predictor model may be reasonably good but
the same number i.e., 99% may not be acceptable problem is to deal with predicting a
critical illness. In this case, even the 1% incorrect prediction may lead to loss of
many lives. So the model performance needs to be evaluated based on the learning
problem.
So, let’s start with looking at model accuracy more closely.
There are four possibilities with regards to the cricket match win/loss prediction:
1. the model predicted win and the team won 2. the model predicted win and the team
lost 3.the model predicted loss and the team won 4.the model predicted loss and
the team lost. In this problem, the obvious class of interest is ‘win’.
The first case, i.e. the model predicted win and the team won is a case where the model
has correctly classified data instances as the class of interest. These cases are referred
as True Positive (TP) cases. The second case, i.e. the model predicted win and the team
lost is a case where the model incorrectly classified data instances as the class of
interest. These cases are referred as False Positive (FP) cases.
24. 24
The third case, i.e. the model predicted loss and the team won is
a case where the model has incorrectly classified as not the class
of interest. These cases are referred as False Negative (FN)
cases.
The fourth case, i.e. the model predicted loss and the team lost is
a case where the model has correctly classified as not the class of
interest. These cases are referred as True Negative (TN) cases.
All these four cases are depicted in Figure 3.7 .
For any classification model, model accuracy is given by total number
of correct classifications (either as the class of interest, i.e. True
Positive or as not the class of interest, i.e. True Negative) divided by
total number of classifications done.
25. 25
A matrix containing correct and incorrect predictions in the form of TPs,
FPs, FNs and TNs is known as confusion matrix.
The win/loss prediction of cricket match has two classes of interest –
win and loss.
For that reason it will generate a 2 × 2 confusion matrix.
For a classification problem involving three classes, the confusion matrix
would be 3 × 3, etc.
Let’s assume the confusion matrix of the win/loss prediction of cricket match
problem to be as below:
In context of the above confusion matrix, total count of TPs = 85, count of
FPs = 4, count of FNs = 2 and count of TNs = 9.
26. 26
The percentage of misclassifications is indicated using error rate which is
measured as
In context of the above confusion matrix,
Sometimes, correct prediction, both TPs as well as TNs, may happen by
mere coincidence, but, it should not happen.
Kappa value of a model indicates the adjusted the model accuracy. It is
calculated using the formula below:
27. 27
In context of the above confusion matrix,
total count of TPs = 85, count of FPs = 4,
count of FNs = 2 and count of TNs = 9.
Kappa value can be 1 at the maximum, which represents perfect
agreement between model’s prediction and actual values.
28. 28
In certain learning problems, it is critical to have extremely low number of FN
cases, because, it will have a different consequence. For Ex, cancer detection. In
these problems there are some measures of model performance which are more
important than accuracy. Two such measurements are sensitivity and specificity
of the model.
The sensitivity of a model measures the proportion of TP examples or positive
cases which were correctly classified. It is measured as
In the context of the above confusion matrix for the cricket match win prediction
problem,
A high value of sensitivity is more desirable than a high value of accuracy.
Specificity of a model measures the proportion of negative examples which have
been correctly classified. In the context, of malignancy prediction of tumours,
specificity gives the proportion of benign tumours which have been correctly
classified. In the context of the above confusion matrix for the cricket match win
prediction problem,
A higher value of specificity will indicate a better model performance.
29. 29
There are two other performance measures of a supervised learning model
which are similar to sensitivity and specificity. These are precision and recall.
While precision gives the proportion of positive predictions which are truly
positive, recall gives the proportion of TP cases over all actually positive
cases.
Precision indicates the reliability of a model in predicting a class of
interest. When the model is related to win / loss prediction of cricket,
precision indicates how often it predicts the win correctly. In context of the
above confusion matrix for the cricket match win prediction problem,
A model with higher precision is perceived to be more reliable.
30. 30
Recall indicates the proportion of correct prediction of positives to the total
number of positives. In case of win/loss prediction of cricket, recall resembles
what proportion of the total wins were predicted correctly.
In the context of the above confusion matrix for the cricket match win
prediction problem,
F-measure is another measure of model performance which combines the
precision and recall. It takes the harmonic mean of precision and recall as
calculated as
In context of the above confusion matrix for the cricket match win prediction
problem,
31. 31
Supervised learning – regression
A regression model which ensures that the difference
between predicted and actual values is low can be
considered as a good model.
The following figure represents a very simple problem of
real estate value prediction solved using linear
regression model.
If ‘area’ is the predictor variable (Input variable say x)
and ‘value’ is the target variable (output variable say y),
the linear regression model can be represented in the
form:
33. 33
For a certain value of x, say x̂, the value of y is predicted as ŷ
whereas the actual value of y is Y (say).
The distance between the actual value and the fitted or predicted
value, i.e. ŷ is known as residual.
The regression model can be considered to be fitted well, if the
difference between actual and predicted value, i.e. the residual
value is less.
R-squared is a good measure to evaluate the model fitness. It is also
known as the coefficient of determination, or
for multiple regression, the coefficient of multiple determination.
The R-squared value lies between 0 to 1 (0%–100%) with a larger
value representing a better fit. It is calculated as:
34. 34
Sum of Squares Total (SST) = squared differences of
each observation from the overall mean =
where y̅ is the mean.
Sum of Squared Errors (SSE) (of prediction) = sum of
the squared residuals =
𝑖=1
𝑛
yi − y̅ 2
𝑖=1
𝑛
yi −^
𝑦 2
where ^y is the predicted value of yi and Yi is the actual value of yi.
35. 35
Unsupervised learning - clustering
Clustering algorithms try to prepare groupings amongst the data sets.
But, it is quite difficult to evaluate the performance of a clustering
algorithm.
There are two challenges which lie in the process of clustering:
It is generally not known how many clusters can be formulated
from a particular data set.
Even if the number of clusters is given, the same number of
clusters can be formed with different groups of data instances.
A clustering algorithm is successful if the clusters identified using the
algorithm is able to achieve the right results in the overall problem domain.
36. popular approaches which are adopted for cluster quality evaluation.
36
1. Internal evaluation
The internal evaluation methods generally measure cluster quality based on
homogeneity of data belonging to the same cluster and heterogeneity of
data belonging to different clusters.
The homogeneity/heterogeneity is decided by some similarity measure.
For example, silhouette coefficient, which is one of the most popular internal
evaluation methods, uses distance (Euclidean or Manhattan distances most
commonly used) between data elements as a similarity measure.
The value of silhouette width ranges between –1 and +1, with a high value
indicating high intra- cluster homogeneity and inter-cluster heterogeneity.
For a data set clustered into ‘k’ clusters, silhouette width is calculated as:
37. 37
FIG. Silhouette width calculation
a (i ) is the average distance between the i th data instance and all
other data instances belonging to the same cluster and b(i) is the
lowest average distance between the i-the data instance and data
instances of all other clusters.
38. 38
Let’s try to understand this in context of the example depicted in figure.
There are four clusters namely cluster 1, 2, 3, and 4.
Let’s consider an arbitrary data element ‘i’ in cluster 1, resembled by the
asterisk.
a(i) is the average of the distances ai1, ai2, ..., ain1 of the different data
elements from the i th data element in cluster 1, assuming there are n1 data
elements in cluster 1. Mathematically,
In the same way, let’s calculate the distance of an arbitrary data element
‘i’ in cluster 1 with the different data elements from another cluster, say
cluster 4 and take an average of all those distances. Hence,
39. 39
where n4 is the total number of elements in cluster 4. In the same way, we
can calculate the values of b12 (average) and b13 (average). b (i) is the minimum
of all these values. Hence, we can say that,
b(i) = minimum [b12(average), b13(average), b14(average)]
2. External evaluation
In this approach, class label is already known for the data set.
But, the known class labels are not a given during clustering time.
The cluster algorithm is assessed based on how close the results are
compared to those known class labels.
For example, purity is one of the most popular measures of cluster
algorithms – evaluates the extent to which clusters contain a single class.
For a data set having ‘n’ data instances and ‘c’ known class labels which
generates ‘k’ clusters,
purity is measured as:
40. 40
IMPROVING PERFORMANCE OF A MODEL
We have already discussed earlier that the model selection is done one
several aspects:
1.Type of learning the task in hand, i.e. supervised or unsupervised
2.Type of the data, i.e. categorical or numeric
3.Sometimes on the problem domain
4.Above all, experience in working with different models to solve
5.problems of diverse domains
What are the different ways to improve the performance of models?
Model parameter tuning is the process of adjusting the model fitting
options. For example, in the popular classification model k-Nearest
Neighbour (kNN), using different values of ‘k’ or the number of
nearest neighbours to be considered, the model can be tuned. In the
same way, a number of hidden layers can be adjusted to tune the
performance in neural networks model. Most machine learning models
have at least one parameter which can be tuned.
41. 41
As an alternate approach of increasing the performance of one model,
several models may be combined together. i.e. one model may learn one type
data sets well while struggle with another type of data set. Another model
may perform well with the data set which the first one struggled with. This
approach of combining different models with diverse strengths is known
as ensemble (depicted in Figure 3.11 ). Ensemble helps in averaging out
biases of the different underlying models and also reducing the variance.
Ensemble methods combine weaker learners to create stronger ones. A
performance boost can be expected even if models are built as usual and
then ensembled.
FIG. Ensemble
42. 42
Following are the typical steps in ensemble process:
Build a number of models based on the training data
For diversifying the models generated, the training data subset can be
varied using the allocation function.
Sampling techniques like bootstrapping may be used to generate unique
training data sets.
Alternatively, the same training data may be used but the models
combined are quite varying, e.g, SVM, neural network, kNN, etc.
The outputs from the different models are combined using a
combination function.
A very simple strategy of combining, say in case of a prediction task
using ensemble, can be majority voting of the different models
combined.
For example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then
the final outcome of the ensemble using majority vote would be a ‘win’.
43. 43
One of the earliest and most popular ensemble models is bootstrap aggregating
or bagging. Bagging uses bootstrap sampling method to generate multiple
training data sets. These training data sets are used to generate (or train) a set
of models. Then the outcomes of the models are combined by majority voting
(classification) or by average (regression).
Just like bagging, boosting is another key ensemble- based technique. In this type
of ensemble, weaker learning models are trained on resampled data and the
outcomes are combined using a weighted voting approach based on the
performance of different models. Adaptive boosting or AdaBoost is a special
variant of boosting algorithm. It is based on the idea of generating weak learners
and slowly learning.
Random forest is another ensemble-based technique. It is an ensemble of
decision trees – hence the name random forest to indicate a forest of decision
trees.