Machine Learning with Python-Data Visualization.pdfSHIBDASDUTTA
This document discusses various techniques for visualizing machine learning data using Python. It describes univariate visualization methods like histograms, density plots, and box plots to understand each attribute independently. It also covers multivariate visualization techniques like correlation matrix plots and scatter matrix plots to understand interactions between multiple attributes. Examples of generating histograms, density plots, box plots, correlation matrices, and scatter matrices on a diabetes dataset are provided to illustrate how to implement these techniques in Python.
How PROC SQL and SAS® Macro Programming Made My Statistical Analysis Easy? A ...Venu Perla
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how SAS macros are created on these programs and used for routine analysis of similar data without any further hard coding in a short period of time.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
This document discusses various statistical analysis and feature engineering techniques that can be used for model building in machine learning algorithms. It describes how proper feature extraction through techniques like correlation analysis, principal component analysis, recursive feature elimination, and feature importance can help improve the accuracy of machine learning models. The document provides examples of applying different feature selection methods like univariate selection, recursive feature elimination, and principal component analysis on a diabetes dataset. It also explains the mathematics behind principal component analysis and how feature importance is estimated using an extra trees classifier. Overall, the document emphasizes how statistical analysis and feature engineering are important for effective model building in machine learning.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate between algorithms with statistical implementation provides better consequence in terms of accurate prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical models, which provide less manual calculations. Presage is the essence of data science and machine learning requisitions that impart control over situations. Implementation of any dogmas require proper feature extraction which helps in the proper model building that assist in precision. This paper is predominantly based on different statistical analysis which includes correlation significance and proper categorical data distribution using feature engineering technique that unravel accuracy of different models of machine learning algorithms.
The document describes PheWAS, a method for phenome-wide association studies using the PheWAS R package. It discusses importing various types of data, transforming the data for analysis, performing PheWAS to identify associations between phenotypes and genotypes, and plotting the results. The package can be used to conduct GWAS, phenotype-only studies, or meta-analyses combining multiple studies. An end-to-end example analysis is also provided to demonstrate the PheWAS method.
The workshop is an overview of creating predictive models using R. An example data set will be used to demonstrate a typical workflow: data splitting, pre-processing, model tuning and evaluation. Several R packages will be shown along with the caret package which provides a unified interface to a large number of R’s modeling functions and enables parallel processing. Participants should have a basic understanding of R data structures and basic language elements (i.e. functions, classes, etc).
Supervised machine learning algorithms are categorized as either supervised or unsupervised. Supervised algorithms learn from labeled examples to predict future labels, while unsupervised algorithms find hidden patterns in unlabeled data. Specifically, supervised algorithms are presented with labeled training data and learn a model to predict the class labels of new test data. Common supervised algorithms include neural networks, decision trees, k-nearest neighbors, and Naive Bayes classifiers. Naive Bayes is an easy to implement algorithm that assumes independence between features. It has been successfully applied to problems like spam filtering.
Standardization of “Safety Drug” Reporting Applicationshalleyzand
proposes an Information Technology infrastructure model that provides drug providers IT organization with a strategic perspective for how to computerize their Safety Drug reporting activity. It introduces software development concepts, methods, techniques and tools for collecting data from multiple platforms and generates reports from them by scripting queries.
Machine Learning with Python-Data Visualization.pdfSHIBDASDUTTA
This document discusses various techniques for visualizing machine learning data using Python. It describes univariate visualization methods like histograms, density plots, and box plots to understand each attribute independently. It also covers multivariate visualization techniques like correlation matrix plots and scatter matrix plots to understand interactions between multiple attributes. Examples of generating histograms, density plots, box plots, correlation matrices, and scatter matrices on a diabetes dataset are provided to illustrate how to implement these techniques in Python.
How PROC SQL and SAS® Macro Programming Made My Statistical Analysis Easy? A ...Venu Perla
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how SAS macros are created on these programs and used for routine analysis of similar data without any further hard coding in a short period of time.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
This document discusses various statistical analysis and feature engineering techniques that can be used for model building in machine learning algorithms. It describes how proper feature extraction through techniques like correlation analysis, principal component analysis, recursive feature elimination, and feature importance can help improve the accuracy of machine learning models. The document provides examples of applying different feature selection methods like univariate selection, recursive feature elimination, and principal component analysis on a diabetes dataset. It also explains the mathematics behind principal component analysis and how feature importance is estimated using an extra trees classifier. Overall, the document emphasizes how statistical analysis and feature engineering are important for effective model building in machine learning.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate between algorithms with statistical implementation provides better consequence in terms of accurate prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical models, which provide less manual calculations. Presage is the essence of data science and machine learning requisitions that impart control over situations. Implementation of any dogmas require proper feature extraction which helps in the proper model building that assist in precision. This paper is predominantly based on different statistical analysis which includes correlation significance and proper categorical data distribution using feature engineering technique that unravel accuracy of different models of machine learning algorithms.
The document describes PheWAS, a method for phenome-wide association studies using the PheWAS R package. It discusses importing various types of data, transforming the data for analysis, performing PheWAS to identify associations between phenotypes and genotypes, and plotting the results. The package can be used to conduct GWAS, phenotype-only studies, or meta-analyses combining multiple studies. An end-to-end example analysis is also provided to demonstrate the PheWAS method.
The workshop is an overview of creating predictive models using R. An example data set will be used to demonstrate a typical workflow: data splitting, pre-processing, model tuning and evaluation. Several R packages will be shown along with the caret package which provides a unified interface to a large number of R’s modeling functions and enables parallel processing. Participants should have a basic understanding of R data structures and basic language elements (i.e. functions, classes, etc).
Supervised machine learning algorithms are categorized as either supervised or unsupervised. Supervised algorithms learn from labeled examples to predict future labels, while unsupervised algorithms find hidden patterns in unlabeled data. Specifically, supervised algorithms are presented with labeled training data and learn a model to predict the class labels of new test data. Common supervised algorithms include neural networks, decision trees, k-nearest neighbors, and Naive Bayes classifiers. Naive Bayes is an easy to implement algorithm that assumes independence between features. It has been successfully applied to problems like spam filtering.
Standardization of “Safety Drug” Reporting Applicationshalleyzand
proposes an Information Technology infrastructure model that provides drug providers IT organization with a strategic perspective for how to computerize their Safety Drug reporting activity. It introduces software development concepts, methods, techniques and tools for collecting data from multiple platforms and generates reports from them by scripting queries.
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
This document provides an overview of exploratory data analysis (EDA) techniques and commonly used tools. It discusses classical and Bayesian statistical analysis approaches as well as EDA. Popular Python libraries for EDA include NumPy, Pandas, Matplotlib and Seaborn. NumPy allows working with multidimensional arrays and matrices while Pandas facilitates working with structured data. The document also provides examples of creating arrays and dataframes, loading data from files, and analyzing datasets using these tools.
The document discusses the Analysis Data Model (ADaM), which is used to standardize the organization of clinical trial data for statistical analysis. ADaM has two main data structures - Analysis Data Structure, Level (ADSL), which contains one record per subject, and Basic Data Structure (BDS), which can have multiple records per subject. BDS includes variables for subject identifiers, treatments, timings, analysis parameters, and other metadata. Using ADaM makes clinical trial data analysis-ready and traceable. It allows statisticians to perform various analyses like survival analysis and comparisons between treatment groups using standard SAS procedures without additional data manipulation.
A course work on R programming for basics to advance statistics and GIS.pdfSEEMAB AKHTAR
The document provides an overview of an upcoming course on R programming for basics to advanced statistics and GIS. The course will cover traditional and advanced statistics using R, including descriptive statistics, regression analysis, trend surface identification, and non-parametric tests. It will also cover applications of R in GIS, such as spatial plotting, raster operations, and extracting statistics from raster surfaces. The instructor has six years of experience in geostatistics, GIS, and groundwater resource management. The course materials and schedule are also provided.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
This document compares the performance of various machine learning and classification algorithms, including neural networks, support vector machines, Naive Bayes, decision trees, and decision stumps. It analyzes these algorithms using a dataset of annual and monthly temperature data from India over 1901-2012. The analysis is conducted in RapidMiner and finds that neural networks and support vector machines can effectively model complex nonlinear relationships to predict temperature. Neural networks achieved reasonably accurate predictions of annual temperature compared to the original data values. The document concludes by comparing the performance of the different algorithms.
More Stored Procedures and MUMPS for DivConqeTimeline, LLC
This document discusses DivConq's MUMPS API and provides examples of querying, updating, and defining stored procedures in MUMPS using DivConq's framework. It shows how to:
- Query a MUMPS global to retrieve test data and return the results.
- Define the schema for stored procedures, including their inputs, outputs, and descriptions.
- Map data between Java and MUMPS for calling procedures.
- Add a new record to a MUMPS global by calling an "Update" procedure from Java that handles auditing.
- Retrieve complex nested data structures by calling a procedure that returns a list of records.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
This document analyzes a dataset of diabetes records from 130 US hospitals from 1999-2008 using various statistical data analysis and machine learning techniques. It first performs dimensionality reduction using principal component analysis (PCA) and multidimensional scaling (MDS). It then clusters the data using hierarchical clustering and k-means clustering. Cluster validity is assessed using precision. Spectral clustering is also applied and validated using Dunn and Davies-Bouldin indexes, with complete linkage diameter performing best.
Can data analysis help predict the future of your heart health?
The Boston Institute of Analytics (BIA) presents a collection of student presentations on data analysis projects tackling the critical topic of heart attack prediction.
Join us as we delve into the world of healthcare analytics and explore how data can be harnessed to identify individuals at risk of heart attack. These presentations offer valuable insights for:
Medical professionals seeking to develop preventative healthcare strategies
Individuals interested in understanding their own heart health risks
Data analysts passionate about applying data analysis for social good
Here's what you'll learn by watching these presentations:
The power of data analysis in predicting heart attacks
Various data analysis techniques used for risk assessment
Real-world examples of heart attack prediction models
Insights and findings from the research of dedicated BIA students
Empower yourself and others with the knowledge of heart health prediction. Watch these presentations and unlock the potential of data analysis in saving lives!
visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This document discusses bringing OpenClinica clinical trial data into SAS. It describes developing a Java utility to convert OpenClinica export files into an XML format that can be read by SAS. The utility standardizes names, adds metadata like labels and formats, and structures the data into a tall, thin format suitable for SAS. It allows OpenClinica data to be easily imported into SAS for analysis while preserving metadata.
This document provides an overview of the machine learning workbench WEKA. It describes how WEKA can be used to import and preprocess data, build classifiers and clustering models, perform attribute selection and data visualization, and run experiments. Key capabilities mentioned include importing data from various formats, using filters for preprocessing, implementing various learning algorithms like decision trees and SVMs, clustering algorithms, association rule learning, attribute selection methods, and the experimenter for comparing models. The Knowledge Flow GUI is also introduced as a graphical interface in WEKA.
Data mining involves using algorithms to find patterns in large datasets. It is commonly used in market research to perform tasks like classification, prediction, and association rule mining. The document discusses several common data mining techniques like decision trees, naive Bayes classification, and regression trees. It also covers related topics like cross-validation, bagging, and boosting methods used for improving model performance.
Data mining involves using algorithms to find patterns in large datasets. It is commonly used in market research to perform tasks like classification, prediction, and association rule mining. The document discusses several common data mining techniques like decision trees, naive Bayes classification, and regression trees. It also covers related topics like cross-validation, bagging, and boosting methods used for improving model performance.
Data mining involves using algorithms to find patterns in large datasets. It is commonly used in market research to perform tasks like classification, prediction, and association rule mining. The document discusses several common data mining techniques like decision trees, naive Bayes classification, and regression trees. It also covers related topics like cross-validation, bagging, and boosting methods used for improving model performance.
This document provides an overview and examples of various statistical concepts and tools, including:
- Useful statistical measures such as mean, median, mode, range, variance, and standard deviation.
- The normal distribution and how to calculate proportions of values that fall within a certain range using normal distribution tables or Excel functions.
- Common values from the normal distribution such as what proportion of values fall within 1, 2, or 3 standard deviations of the mean.
- Six Sigma "sigma values" and how they correspond to defects per million opportunities.
- Visualization tools like histograms, Pareto charts, stem-and-leaf plots, scatter graphs, multi-vari charts, and box plots; including
This document presents an intelligent visualization framework for multi-dimensional data sets. The framework includes pre-processing, feature selection, classification, rule refinement, and visualization phases. In the feature selection phase, principal component analysis and rough sets are used to select important features. Classification is done using rough set rules generation. The rules are then refined using entropy and genetic algorithms. Finally, the refined rules and reducts are visualized using nodes, edges, charts and grids to help experts understand the data. Experimental results on breast cancer and prostate cancer data sets demonstrate the performance of the approach.
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
This document discusses a method for visualizing direct and partial correlations using ELI (Exploratory Linear Information) plots. The method allows correlations between any number of variables to be plotted in an overlay fashion. The plots can show correlations against a single "with" variable, sorted by absolute value. Partial correlations can also be plotted. The method is implemented in a SAS macro. An example uses continuous variables from a dataset to demonstrate plotting correlations without a "with" variable.
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
This document provides an overview of exploratory data analysis (EDA) techniques and commonly used tools. It discusses classical and Bayesian statistical analysis approaches as well as EDA. Popular Python libraries for EDA include NumPy, Pandas, Matplotlib and Seaborn. NumPy allows working with multidimensional arrays and matrices while Pandas facilitates working with structured data. The document also provides examples of creating arrays and dataframes, loading data from files, and analyzing datasets using these tools.
The document discusses the Analysis Data Model (ADaM), which is used to standardize the organization of clinical trial data for statistical analysis. ADaM has two main data structures - Analysis Data Structure, Level (ADSL), which contains one record per subject, and Basic Data Structure (BDS), which can have multiple records per subject. BDS includes variables for subject identifiers, treatments, timings, analysis parameters, and other metadata. Using ADaM makes clinical trial data analysis-ready and traceable. It allows statisticians to perform various analyses like survival analysis and comparisons between treatment groups using standard SAS procedures without additional data manipulation.
A course work on R programming for basics to advance statistics and GIS.pdfSEEMAB AKHTAR
The document provides an overview of an upcoming course on R programming for basics to advanced statistics and GIS. The course will cover traditional and advanced statistics using R, including descriptive statistics, regression analysis, trend surface identification, and non-parametric tests. It will also cover applications of R in GIS, such as spatial plotting, raster operations, and extracting statistics from raster surfaces. The instructor has six years of experience in geostatistics, GIS, and groundwater resource management. The course materials and schedule are also provided.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
This document compares the performance of various machine learning and classification algorithms, including neural networks, support vector machines, Naive Bayes, decision trees, and decision stumps. It analyzes these algorithms using a dataset of annual and monthly temperature data from India over 1901-2012. The analysis is conducted in RapidMiner and finds that neural networks and support vector machines can effectively model complex nonlinear relationships to predict temperature. Neural networks achieved reasonably accurate predictions of annual temperature compared to the original data values. The document concludes by comparing the performance of the different algorithms.
More Stored Procedures and MUMPS for DivConqeTimeline, LLC
This document discusses DivConq's MUMPS API and provides examples of querying, updating, and defining stored procedures in MUMPS using DivConq's framework. It shows how to:
- Query a MUMPS global to retrieve test data and return the results.
- Define the schema for stored procedures, including their inputs, outputs, and descriptions.
- Map data between Java and MUMPS for calling procedures.
- Add a new record to a MUMPS global by calling an "Update" procedure from Java that handles auditing.
- Retrieve complex nested data structures by calling a procedure that returns a list of records.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
This document analyzes a dataset of diabetes records from 130 US hospitals from 1999-2008 using various statistical data analysis and machine learning techniques. It first performs dimensionality reduction using principal component analysis (PCA) and multidimensional scaling (MDS). It then clusters the data using hierarchical clustering and k-means clustering. Cluster validity is assessed using precision. Spectral clustering is also applied and validated using Dunn and Davies-Bouldin indexes, with complete linkage diameter performing best.
Can data analysis help predict the future of your heart health?
The Boston Institute of Analytics (BIA) presents a collection of student presentations on data analysis projects tackling the critical topic of heart attack prediction.
Join us as we delve into the world of healthcare analytics and explore how data can be harnessed to identify individuals at risk of heart attack. These presentations offer valuable insights for:
Medical professionals seeking to develop preventative healthcare strategies
Individuals interested in understanding their own heart health risks
Data analysts passionate about applying data analysis for social good
Here's what you'll learn by watching these presentations:
The power of data analysis in predicting heart attacks
Various data analysis techniques used for risk assessment
Real-world examples of heart attack prediction models
Insights and findings from the research of dedicated BIA students
Empower yourself and others with the knowledge of heart health prediction. Watch these presentations and unlock the potential of data analysis in saving lives!
visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This document discusses bringing OpenClinica clinical trial data into SAS. It describes developing a Java utility to convert OpenClinica export files into an XML format that can be read by SAS. The utility standardizes names, adds metadata like labels and formats, and structures the data into a tall, thin format suitable for SAS. It allows OpenClinica data to be easily imported into SAS for analysis while preserving metadata.
This document provides an overview of the machine learning workbench WEKA. It describes how WEKA can be used to import and preprocess data, build classifiers and clustering models, perform attribute selection and data visualization, and run experiments. Key capabilities mentioned include importing data from various formats, using filters for preprocessing, implementing various learning algorithms like decision trees and SVMs, clustering algorithms, association rule learning, attribute selection methods, and the experimenter for comparing models. The Knowledge Flow GUI is also introduced as a graphical interface in WEKA.
Data mining involves using algorithms to find patterns in large datasets. It is commonly used in market research to perform tasks like classification, prediction, and association rule mining. The document discusses several common data mining techniques like decision trees, naive Bayes classification, and regression trees. It also covers related topics like cross-validation, bagging, and boosting methods used for improving model performance.
Data mining involves using algorithms to find patterns in large datasets. It is commonly used in market research to perform tasks like classification, prediction, and association rule mining. The document discusses several common data mining techniques like decision trees, naive Bayes classification, and regression trees. It also covers related topics like cross-validation, bagging, and boosting methods used for improving model performance.
Data mining involves using algorithms to find patterns in large datasets. It is commonly used in market research to perform tasks like classification, prediction, and association rule mining. The document discusses several common data mining techniques like decision trees, naive Bayes classification, and regression trees. It also covers related topics like cross-validation, bagging, and boosting methods used for improving model performance.
This document provides an overview and examples of various statistical concepts and tools, including:
- Useful statistical measures such as mean, median, mode, range, variance, and standard deviation.
- The normal distribution and how to calculate proportions of values that fall within a certain range using normal distribution tables or Excel functions.
- Common values from the normal distribution such as what proportion of values fall within 1, 2, or 3 standard deviations of the mean.
- Six Sigma "sigma values" and how they correspond to defects per million opportunities.
- Visualization tools like histograms, Pareto charts, stem-and-leaf plots, scatter graphs, multi-vari charts, and box plots; including
This document presents an intelligent visualization framework for multi-dimensional data sets. The framework includes pre-processing, feature selection, classification, rule refinement, and visualization phases. In the feature selection phase, principal component analysis and rough sets are used to select important features. Classification is done using rough set rules generation. The rules are then refined using entropy and genetic algorithms. Finally, the refined rules and reducts are visualized using nodes, edges, charts and grids to help experts understand the data. Experimental results on breast cancer and prostate cancer data sets demonstrate the performance of the approach.
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
This document discusses a method for visualizing direct and partial correlations using ELI (Exploratory Linear Information) plots. The method allows correlations between any number of variables to be plotted in an overlay fashion. The plots can show correlations against a single "with" variable, sorted by absolute value. Partial correlations can also be plotted. The method is implemented in a SAS macro. An example uses continuous variables from a dataset to demonstrate plotting correlations without a "with" variable.
Similar to 4-Introduction to Machine Learning Lecture # 4.pdf (20)
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
The Python for beginners. This is an advance computer language.
4-Introduction to Machine Learning Lecture # 4.pdf
1. Dr. Ashfaq Ahmad Associate Professor
Department of Electronics & Electrical System
The University of Lahore
Introduction to Machine Learning
Lecture # 4
2. Lecture
Objectives
Data With Descriptive Statistics
Peek at Your Data
Dimensions of Your Data
Data Type For Each Attribute
Descriptive Statistics
Class Distribution (Classification Only)
Correlations Between Attributes
Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems
2
Skew of Univariate Distributions
3. Overfitting and Underfitting
Generalization in Machine Learning
Statistical Fit
Overfitting in Machine Learning
Underfitting in Machine Learning
A Good Fit in Machine Learning
How To Limit Overfitting
Dr. Ashfaq Ahmad, Associate Professor, Department of Electronics & Electrical Systems
3
Lecture
Objectives
4. Data
Understanding
With Descriptive
Statistics
➢ In order to get the best results, you must
understand your data.
➢ In this lecture you will learn about 7 ways that can
use in Python to better understand your machine
learning data.
➢ All 7 ways is demonstrated by loading the
Diabetes classification dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 4
5. Peek At Your Data
✓ There is no substitute for looking at the raw data.
✓ Analyzing the raw data can provide information
that cannot be learned in any other way.
✓ Additionally, it can sow the seeds that may later
grow into ideas on how to better prepare and
manage data for machine learning tasks.
✓ You can review the first 20 rows of your data using
the head() function on the Pandas DataFrame.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 5
6. Example of Reviewing The First Few Rows of Data
# View first 20 rows
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
peek = data.head(20)
print(peek)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 6
7. Example of Reviewing The First Few Rows of Data
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 7
You can see that the first column list shows the row number, This is useful for
referring to a particular observation.
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1
Output of
reviewing
the first
few rows
of data.
8. Dimensions of Data
❖ The quantity of your data, both in terms of rows and
columns, must be very well understood.
✓ Too many rows and algorithms may take too long to
train.
✓ Too few and perhaps you do not have enough data to
train the algorithms.
✓ The curse of dimensionality can cause some algorithms
to be distracted or to perform poorly when there are too
many features.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 8
9. An Illustration of
Examining The Shape
of The Data
❑ You can review the shape and size of your dataset by
printing the shape property on the Pandas DataFrame.
❑ # Dimensions of your data
❑ from pandas import read_csv
❑ filename = "diabetes.data.csv"
❑ names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class']
❑ data = read_csv(filename, names=names)
❑ shape = data.shape
❑ print(shape)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 9
The results are listed in rows then columns. You can see that the dataset
has 768 rows and 9 columns.
The Results of Examining The Data’s Shape
(768, 9)
10. Type of Data For Each Attribute
❑ The type of each attribute is important.
❑ Strings may need to be converted to floating point
values or integers to represent categorical or ordinal
values.
❑ You can get an idea of the types of attributes by
peeking at the raw data, as above.
❑ You can also list the data types used by the
DataFrame to characterize each attribute using the
dtypes property.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 10
11. Example of Reviewing The
Data Types of The Data
# Data Types for Each Attribute
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
types = data.dtypes
print(types)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 11
Output of Reviewing The
Data Types of The Data
➢ You can see that most of the attributes are integers
the only mass and pedi are floating point types
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
dtype: object
12. Descriptive
Statistics
❑ Descriptive statistics can give you great perception
into the shape of each attribute.
❑ Often you can create more summaries at the time
of review.
❑ The describe() function on the Pandas DataFrame
lists 8 statistical properties of each attribute. They
are:
➢ Count.
➢ Mean.
➢ Standard Deviation.
➢ Minimum Value.
➢ 25th Percentile.
➢ 50th Percentile (Median).
➢ 75th Percentile.
➢ Maximum Value.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 12
13. Example of Reviewing A
Statistical Summary of The Data
# Statistical Summary
from pandas import read_csv
from pandas import set_option
filename = "diabetes.data.csv"
names = ['preg’, 'plas’, 'pres’, 'skin’, 'test’, 'mass’, 'pedi’, 'age’, 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision’, 3)
description = data.describe()
print(description)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 13
14. Output of
Reviewing A
Statistical
Summary of
The Data
✓ You can see that you do get a lot of data.
✓ It should be noted that the precision of the numbers and
the preferred width of the output can be change by using
the pandas.set option() function.
✓ This is to make it more readable for this example.
✓ When describing your data this way, it is worth taking
some time and reviewing observations from the results.
✓ This might include the presence of NA values for missing
data or surprising distributions for attributes.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 14
15. Class Distribution (Classification Only)
❑ On classification problems you need to know how balanced the class values are.
❑ Highly imbalanced problems (a lot more observations for one class than another)
are common and may need special handling in the data preparation stage of your
project.
❑ You can quickly get an idea of the distribution of the class attribute in Pandas.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 15
16. Example of Reviewing A
Class Breakdown of The Data
# Class Distribution
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
class_counts = data.groupby('class').size()
print(class_counts)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 16
Output of Reviewing A
Class Breakdown of the Data
❑ You can see that there are nearly
double the number of observations
with class 0 (no onset of diabetes) than
there are with class 1 (onset of
diabetes).
class
0 500
1 268
17. Correlations Between Attributes
➢ Correlation refers to the relationship between two variables and how they may or may not
change together.
➢ The most common method for calculating correlation is Pearson's Correlation Coefficient,
that assumes a normal distribution of the attributes involved.
➢ A correlation of -1 or 1 shows a full negative or positive correlation, respectively.
➢ Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like
linear and logistic regression can suffer poor performance if there are highly correlated
attributes in your dataset.
➢ As such, it is a good idea to review all the pairwise correlations of the attributes in your
dataset.
➢ You can use the corr () function on the Pandas DataFrame to calculate a correlation
matrix.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 17
18. Example of Reviewing Correlations of Attributes In The Data
# Pairwise Pearson correlations
from pandas import read_csv
from pandas import set_option
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 18
19. Output of Reviewing
Correlations of
Attributes In The
Data
❑ The matrix lists all attributes across the top and down the
side, to give correlation between all pairs of attributes
(twice, because the matrix is symmetrical).
❑ You can see the diagonal line through the matrix from the
top left to bottom right corners of the matrix shows perfect
correlation of each attribute with itself.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 19
20. Skew of Univariate Distributions
➢ The skewness in statistics is a measure of asymmetry or the deviation of a given random
variable's distribution from a symmetric distribution (like normal Distribution). In Normal
Distribution, we know that: Median = Mode = Mean.
➢ The formula given in most textbooks is Skew = 3 * (Mean – Median) / Standard Deviation
➢ Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another.
➢ Many machine learning algorithms assume a Gaussian distribution.
➢ Knowing that an attribute has a skew may allow you to perform data preparation to correct the
skew and later improve the accuracy of your models.
➢ You can calculate the skew of each attribute using the skew() function on the Pandas
DataFrame.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 20
21. Example of Reviewing Skew of
Attribute Distributions In The Data
# Skew for each attribute
from pandas import read_csv
filename = "diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin',
'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
skew = data.skew()
print(skew)
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 21
Output of Reviewing Skew of
Attribute Distributions In The Data
The skew result show a positive (right) or
negative (left) skew. Values closer to zero
show less skew.
22. Overfitting
And
Underfitting
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 22
The cause of poor
performance in machine
learning is either overfitting
or underfitting the data.
In this lecture we will
discover the concept of
generalization in machine
learning and the problems of
overfitting and underfitting
that go along with it.
23. Generalization in Machine Learning
➢ In machine learning we describe the learning of the target function from training
data as inductive learning.
➢ Induction refers to learning general concepts from specific examples which is
exactly the problem that supervised machine learning problems aim to solve.
➢ This is different from deduction that is the other way around and seeks to learn
specific concepts from general rules.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 23
24. Generalization in Machine Learning
➢ Generalization refers to how well the concepts learned by a machine learning
model apply to specific examples not seen by the model when it was learning.
➢ The goal of a good machine learning model is to generalize well from the training
data to any data from the problem domain.
➢ This allows us to make predictions in the future on data the model has never seen.
➢ There is a terminology used in machine learning when we talk about how well a
machine learning model learns and generalizes to new data, namely overfitting
and underfitting.
➢ Overfitting and underfitting are the two biggest causes for poor performance of
machine learning algorithms.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 24
25. Statistical Fit
❑ In statistics a fit refers to how well you approximate a target function.
❑ This is good terminology to use in machine learning, because supervised machine
learning algorithms seek to approximate the unknown underlying mapping function for
the output variables given the input variables.
❑ Statistics often describe the goodness of fit which refers to measures used to estimate how
well the approximation of the function matches the target function.
❑ Some of these methods are useful in machine learning (e.g. calculating the residual
errors), but some of these techniques assume we know the form of the target function we
are approximating, which is not the case in machine learning.
❑ If we knew the form of the target function, we would use it directly to make predictions,
rather than trying to learn an approximation from samples of noisy training data.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 25
26. Overfitting in Machine Learning
❑ Overfitting refers to a model that models the training data too well.
❑ Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance on the model on new
data.
❑ This means that the noise or random fluctuations in the training data is picked up
and learned as concepts by the model.
❑ The problem is that these concepts do not apply to new data and negatively
impact the model's ability to generalize.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 26
27. Overfitting in Machine Learning
❑ Overfitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function.
❑ As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
❑ For example, decision trees are a nonparametric machine learning algorithm that
is very flexible and is subject to overfitting training data.
❑ This problem can be addressed by pruning a tree after it has learned in order to
remove some of the detail it has picked up.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 27
28. Underfitting in Machine Learning
❖ Underfitting refers to a model that can neither model the training data not
generalize to new data.
❖ An underfit machine learning model is not a suitable model and will be obvious
as it will have poor performance on the training data.
❖ Underfitting is often not discussed as it is easy to detect given a good
performance metric.
❖ The remedy is to move on and try alternate machine learning algorithms.
Nevertheless, it does provide good contrast to the problem of concept of
overfitting.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 28
29. A Good Fit in Machine Learning
❑ Ideally, you want to select a model at the sweet spot between underfitting and
overfitting. This is the goal but is very difficult to do in practice.
❑ To understand this goal, we can look at the performance of a machine learning
algorithm over time as it is learning a training data.
❑ We can plot the skill on the training data the skill on a test dataset we have held
back from the training process.
❑ Over time, as the algorithm learns, the error for the model on the training data
goes down and so does the error on the test dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 29
30. A Good Fit in Machine Learning
❑ If we train for too long, the performance on the training dataset may continue to
decrease because the model is overfitting and learning the irrelevant detail and
noise in the training dataset.
❑ At the same time the error for the test set starts to rise again as the model's ability
to generalize decreases.
❑ The sweet spot is the point just before the error on the test dataset starts to
increase where the model has good skill on both the training dataset and the
unseen test dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 30
31. A Good Fit in Machine Learning
❑ You can perform this experiment with your favorite machine learning algorithms.
❑ This is often not useful technique in practice, because by choosing the stopping
point for training using the skill on the test dataset it means that the test set is no
longer unseen or a standalone objective measure.
❑ Some knowledge (a lot of useful knowledge) about that data has leaked into the
training procedure.
❑ There are two additional techniques you can use to help find the sweet spot in
practice: resampling methods and a validation dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 31
32. How To Limit Overfitting
❑ Both overfitting and underfitting can lead to poor model performance.
❑ But by far the most common problem in applied machine learning is overfitting.
❑ Overfitting is such a problem because the evaluation of machine learning algorithms on
training data is different from the evaluation, we care the most about, namely how well
the algorithm performs on unseen data.
❑ There are two important techniques that you can use when evaluating machine learning
algorithms to limit overfitting:
✓ Use a resampling technique to estimate model accuracy
✓ Hold back a validation dataset.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 32
33. How To Limit Overfitting
❑ The most popular resampling technique is k-fold cross validation.
❑ It allows you to train and test your model k-times on different subsets of training
data and build up an estimate of the performance of a machine learning model on
unseen data.
❑ A validation dataset is simply a subset of your training data that you hold back
from your machine learning algorithms until the very end of your project.
❑ After you have selected and tuned your machine learning algorithms on your
training dataset you can evaluate the learned models on the validation dataset to
get a final objective idea of how the models might perform on unseen data.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 33
34. How To Limit Overfitting
❑ Using cross validation is a gold standard in applied machine learning for
estimating model accuracy on unseen data.
❑ If you have the data, using a validation dataset is also an excellent practice.
DR. ASHFAQ AHMAD, ASSOCIATE PROFESSOR, DEPARTMENT OF ELECTRONICS & ELECTRICAL SYSTEMS 34