Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate between algorithms with statistical implementation provides better consequence in terms of accurate prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical models, which provide less manual calculations. Presage is the essence of data science and machine learning requisitions that impart control over situations. Implementation of any dogmas require proper feature extraction which helps in the proper model building that assist in precision. This paper is predominantly based on different statistical analysis which includes correlation significance and proper categorical data distribution using feature engineering technique that unravel accuracy of different models of machine learning algorithms.
A Preference Model on Adaptive Affinity PropagationIJECEIAES
In recent years, two new data clustering algorithms have been proposed. One of them is Affinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.
Optimization is considered to be one of the pillars of statistical learning and also plays a major role in the design and development of intelligent systems such as search engines, recommender systems, and speech and image recognition software. Machine Learning is the study that gives the computers the ability to learn and also the ability to think without being explicitly programmed. A computer is said to learn from an experience with respect to a specified task and its performance related to that task. The machine learning algorithms are applied to the problems to reduce efforts. Machine learning algorithms are used for manipulating the data and predict the output for the new data with high precision and low uncertainty. The optimization algorithms are used to make rational decisions in an environment of uncertainty and imprecision. In this paper a methodology is presented to use the efficient optimization algorithm as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
This document evaluates the performance of various classification algorithms (logistic regression, K-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes) on a heart disease dataset. It provides details on each algorithm and evaluates their performance based on metrics like confusion matrix, precision, recall, F1-score and accuracy. The results show that naive Bayes had the best performance in correctly classifying samples with an accuracy of 80.21%, while SVM had the worst at 46.15%. In general, random forest and naive Bayes performed best according to the evaluation.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
A tour of the top 10 algorithms for machine learning newbiesVimal Gupta
The document summarizes the top 10 machine learning algorithms for machine learning newbies. It discusses linear regression, logistic regression, linear discriminant analysis, classification and regression trees, naive bayes, k-nearest neighbors, and learning vector quantization. For each algorithm, it provides a brief overview of the model representation and how predictions are made. The document emphasizes that no single algorithm is best and recommends trying multiple algorithms to find the best one for the given problem and dataset.
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...IJERA Editor
Cost estimating at schematic design stage as the basis of project evaluation, engineering design, and cost
management, plays an important role in project decision under a limited definition of scope and constraints in
available information and time, and the presence of uncertainties. The purpose of this study is to compare the
performance of cost estimation models of two different hybrid artificial intelligence approaches: regression
analysis-adaptive neuro fuzzy inference system (RANFIS) and case based reasoning-genetic algorithm (CBRGA)
techniques. The models were developed based on the same 50 low-cost apartment project datasets in
Indonesia. Tested on another five testing data, the models were proven to perform very well in term of accuracy.
A CBR-GA model was found to be the best performer but suffered from disadvantage of needing 15 cost drivers
if compared to only 4 cost drivers required by RANFIS for on-par performance.
This document discusses dimensionality reduction techniques for data mining. It begins with an introduction to dimensionality reduction and reasons for using it. These include dealing with high-dimensional data issues like the curse of dimensionality. It then covers major dimensionality reduction techniques of feature selection and feature extraction. Feature selection techniques discussed include search strategies, feature ranking, and evaluation measures. Feature extraction maps data to a lower-dimensional space. The document outlines applications of dimensionality reduction like text mining and gene expression analysis. It concludes with trends in the field.
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...ijcsit
Enterprise financial distress or failure includes bankruptcy prediction, financial distress, corporate performance prediction and credit risk estimation. The aim of this paper is that using wavelet networks innon-linear combination prediction to solve ARMA (Auto-Regressive and Moving Average) model problem.ARMA model need estimate the value of all parameters in the model, it has a large amount of computation.Under this aim, the paper provides an extensive review of Wavelet networks and Logistic regression. Itdiscussed the Wavelet neural network structure, Wavelet network model training algorithm, Accuracy rateand error rate (accuracy of classification, Type I error, and Type II error). The main research opportunity exist a proposed of business failure prediction model (wavelet network model and logistic regression
model). The empirical research which is comparison of Wavelet Network and Logistic Regression on training and forecasting sample, the result shows that this wavelet network model is high accurate and the overall prediction accuracy, Type Ⅰerror and Type Ⅱ error, wavelet networks model is better thanlogistic regression model.
A Preference Model on Adaptive Affinity PropagationIJECEIAES
In recent years, two new data clustering algorithms have been proposed. One of them is Affinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.
Optimization is considered to be one of the pillars of statistical learning and also plays a major role in the design and development of intelligent systems such as search engines, recommender systems, and speech and image recognition software. Machine Learning is the study that gives the computers the ability to learn and also the ability to think without being explicitly programmed. A computer is said to learn from an experience with respect to a specified task and its performance related to that task. The machine learning algorithms are applied to the problems to reduce efforts. Machine learning algorithms are used for manipulating the data and predict the output for the new data with high precision and low uncertainty. The optimization algorithms are used to make rational decisions in an environment of uncertainty and imprecision. In this paper a methodology is presented to use the efficient optimization algorithm as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
This document evaluates the performance of various classification algorithms (logistic regression, K-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes) on a heart disease dataset. It provides details on each algorithm and evaluates their performance based on metrics like confusion matrix, precision, recall, F1-score and accuracy. The results show that naive Bayes had the best performance in correctly classifying samples with an accuracy of 80.21%, while SVM had the worst at 46.15%. In general, random forest and naive Bayes performed best according to the evaluation.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
A tour of the top 10 algorithms for machine learning newbiesVimal Gupta
The document summarizes the top 10 machine learning algorithms for machine learning newbies. It discusses linear regression, logistic regression, linear discriminant analysis, classification and regression trees, naive bayes, k-nearest neighbors, and learning vector quantization. For each algorithm, it provides a brief overview of the model representation and how predictions are made. The document emphasizes that no single algorithm is best and recommends trying multiple algorithms to find the best one for the given problem and dataset.
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...IJERA Editor
Cost estimating at schematic design stage as the basis of project evaluation, engineering design, and cost
management, plays an important role in project decision under a limited definition of scope and constraints in
available information and time, and the presence of uncertainties. The purpose of this study is to compare the
performance of cost estimation models of two different hybrid artificial intelligence approaches: regression
analysis-adaptive neuro fuzzy inference system (RANFIS) and case based reasoning-genetic algorithm (CBRGA)
techniques. The models were developed based on the same 50 low-cost apartment project datasets in
Indonesia. Tested on another five testing data, the models were proven to perform very well in term of accuracy.
A CBR-GA model was found to be the best performer but suffered from disadvantage of needing 15 cost drivers
if compared to only 4 cost drivers required by RANFIS for on-par performance.
This document discusses dimensionality reduction techniques for data mining. It begins with an introduction to dimensionality reduction and reasons for using it. These include dealing with high-dimensional data issues like the curse of dimensionality. It then covers major dimensionality reduction techniques of feature selection and feature extraction. Feature selection techniques discussed include search strategies, feature ranking, and evaluation measures. Feature extraction maps data to a lower-dimensional space. The document outlines applications of dimensionality reduction like text mining and gene expression analysis. It concludes with trends in the field.
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...ijcsit
Enterprise financial distress or failure includes bankruptcy prediction, financial distress, corporate performance prediction and credit risk estimation. The aim of this paper is that using wavelet networks innon-linear combination prediction to solve ARMA (Auto-Regressive and Moving Average) model problem.ARMA model need estimate the value of all parameters in the model, it has a large amount of computation.Under this aim, the paper provides an extensive review of Wavelet networks and Logistic regression. Itdiscussed the Wavelet neural network structure, Wavelet network model training algorithm, Accuracy rateand error rate (accuracy of classification, Type I error, and Type II error). The main research opportunity exist a proposed of business failure prediction model (wavelet network model and logistic regression
model). The empirical research which is comparison of Wavelet Network and Logistic Regression on training and forecasting sample, the result shows that this wavelet network model is high accurate and the overall prediction accuracy, Type Ⅰerror and Type Ⅱ error, wavelet networks model is better thanlogistic regression model.
With these components in place, we present the Data
Science Machine — an automated system for generating
predictive models from raw data. It starts with a relational
database and automatically generates features to be used
for predictive modeling.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely.
Concept
R code
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
This document analyzes a dataset of diabetes records from 130 US hospitals from 1999-2008 using various statistical data analysis and machine learning techniques. It first performs dimensionality reduction using principal component analysis (PCA) and multidimensional scaling (MDS). It then clusters the data using hierarchical clustering and k-means clustering. Cluster validity is assessed using precision. Spectral clustering is also applied and validated using Dunn and Davies-Bouldin indexes, with complete linkage diameter performing best.
The document describes developing a model to predict house prices using deep learning techniques. It proposes using a dataset with house features without labels and applying regression algorithms like K-nearest neighbors, support vector machine, and artificial neural networks. The models are trained and tested on split data, with the artificial neural network achieving the lowest mean absolute percentage error of 18.3%, indicating it is the most accurate model for predicting house prices based on the data.
Bank - Loan Purchase Modeling
This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Our job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. We are expected to do the following:
EDA of the data available. Showcase the results using appropriate graphs
Apply appropriate clustering on the data and interpret the output .
Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning).
Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best.
Abdul Ahad Abro presented on data science, predictive analytics, machine learning algorithms, regression, classification, Microsoft Azure Machine Learning Studio, and academic publications. The presentation introduced key concepts in data science including machine learning, predictive analytics, regression, classification, and algorithms. It demonstrated regression analysis using Microsoft Azure Machine Learning Studio and Microsoft Excel. The methodology section described using a dataset from Azure for classification and linear regression in both Azure and Excel to compare results.
Standard Statistical Feature analysis of Image Features for Facial Images usi...Bulbul Agrawal
This document compares Principal Component Analysis (PCA) and Independent Component Analysis (ICA) and their application to facial image analysis. It provides an introduction to both PCA and ICA, including their processes and differences. The document then summarizes previous literature comparing PCA and ICA, describes implementations of PCA for facial recognition on Japanese, African, and Asian datasets in MATLAB, and calculates statistical metrics for the original and recognized images. It concludes that PCA is effective for pattern recognition and dimensionality reduction in facial analysis applications.
This document compares several supervised machine learning classification algorithms on a Titanic dataset: Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine, and Naive Bayes. It finds that Random Forest achieves the highest accuracy. Evaluation metrics like precision, recall, F1-score, and accuracy are used to evaluate and compare model performance on test data.
Enhanced ID3 algorithm based on the weightage of the AttributeAM Publications
ID3 algorithm a decision tree classification algorithm is very popular due to its speed and simplicity in construction but it has its own snags while classifying the ID3 algorithm and tends to choose the attributes with large values and practical complexities arises due to this. To solve this problem the proposed algorithm empowers and uses the importance of the attributes and classifies accordingly to produce effective rules. The proposed algorithm uses the attribute weightage and calculates the information gain for the few values attributes and performs quite better when compared to classical ID3 algorithm. The proposed algorithm is applied on a real time data (i.e.) selection process of employees in a firm for appraisal based on few important attributes and executed.
This document discusses using Lagrange interpolation to estimate missing values in datasets. It begins with an introduction to missing data problems and common techniques for handling missing values like deletion, mean substitution, and more. It then explains Lagrange interpolation, which uses known data points to estimate values at unknown points. The algorithm for Lagrange interpolation is presented. An example using years of experience and salary data to estimate salary for 10 years of experience is shown. The document concludes that Lagrange interpolation can be used to estimate missing values in preprocessing if the relationship between attributes is uniform. Limitations are noted if the relationship is not uniform.
This document discusses a project that uses machine learning algorithms to predict potential heart diseases. The project uses a dataset with 13 features and applies algorithms like K-Nearest Neighbors Classifier and Support Vector Classifier, with and without PCA. The K-Nearest Neighbors Classifier achieved the best accuracy score of 87% at predicting heart disease based on the dataset.
The document discusses modelling and evaluation in machine learning. It defines what models are and how they are selected and trained for predictive and descriptive tasks. Specifically, it covers:
1) Models represent raw data in meaningful patterns and are selected based on the problem and data type, like regression for continuous numeric prediction.
2) Models are trained by assigning parameters to optimize an objective function and evaluate quality. Cross-validation is used to evaluate models.
3) Predictive models predict target values like classification to categorize data or regression for continuous targets. Descriptive models find patterns without targets for tasks like clustering.
4) Model performance can be affected by underfitting if too simple or overfitting if too complex,
1) Machine learning involves analyzing data to find patterns and make predictions. It uses mathematics, statistics, and programming.
2) Key aspects of machine learning include understanding the business problem, collecting and preparing data, building and evaluating models, and different types of machine learning algorithms like supervised, unsupervised, and reinforcement learning.
3) Common machine learning algorithms discussed include linear regression, logistic regression, KNN, K-means clustering, decision trees, and handling issues like missing values, outliers, and feature engineering.
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
This document describes a comparative analysis of GUI-based machine learning approaches for predicting Parkinson's disease. It analyzes various machine learning algorithms including logistic regression, decision trees, support vector machines, random forests, k-nearest neighbors, and naive Bayes. The document discusses data preprocessing techniques like variable identification, data validation, cleaning and preparing. It also covers data visualization and evaluating model performance using accuracy calculations. The goal is to compare the performance of these machine learning algorithms and identify the approach that predicts Parkinson's disease with the highest accuracy based on a given hospital dataset.
This document summarizes a research project that aims to develop an application to predict airline ticket prices using machine learning techniques. The researchers collected over 10,000 records of flight data including features like source, destination, date, time, number of stops, and price. They preprocessed the data, selected important features, and applied machine learning algorithms like linear regression, decision trees, and random forests to build predictive models. The random forest model provided the most accurate predictions according to performance metrics like MAE, MSE, and RMSE. The researchers propose deploying the best model in a web application using Flask for the backend and Bootstrap for the frontend so users can input flight details and receive predicted price outputs.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
This document presents a study that uses machine learning techniques to predict crime rates. Specifically, it aims to analyze crime data using supervised machine learning classification algorithms like decision trees, support vector machines, logistic regression, k-nearest neighbors, and random forests. The document outlines collecting and preprocessing crime data, selecting relevant features, training models on a portion of the data and testing them on the remaining data. It finds that random forest achieved the best prediction accuracy compared to other algorithms tested. The goal is to help law enforcement agencies better predict and reduce crime rates by analyzing historical crime data patterns.
With these components in place, we present the Data
Science Machine — an automated system for generating
predictive models from raw data. It starts with a relational
database and automatically generates features to be used
for predictive modeling.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely.
Concept
R code
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
This document analyzes a dataset of diabetes records from 130 US hospitals from 1999-2008 using various statistical data analysis and machine learning techniques. It first performs dimensionality reduction using principal component analysis (PCA) and multidimensional scaling (MDS). It then clusters the data using hierarchical clustering and k-means clustering. Cluster validity is assessed using precision. Spectral clustering is also applied and validated using Dunn and Davies-Bouldin indexes, with complete linkage diameter performing best.
The document describes developing a model to predict house prices using deep learning techniques. It proposes using a dataset with house features without labels and applying regression algorithms like K-nearest neighbors, support vector machine, and artificial neural networks. The models are trained and tested on split data, with the artificial neural network achieving the lowest mean absolute percentage error of 18.3%, indicating it is the most accurate model for predicting house prices based on the data.
Bank - Loan Purchase Modeling
This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Our job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. We are expected to do the following:
EDA of the data available. Showcase the results using appropriate graphs
Apply appropriate clustering on the data and interpret the output .
Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning).
Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best.
Abdul Ahad Abro presented on data science, predictive analytics, machine learning algorithms, regression, classification, Microsoft Azure Machine Learning Studio, and academic publications. The presentation introduced key concepts in data science including machine learning, predictive analytics, regression, classification, and algorithms. It demonstrated regression analysis using Microsoft Azure Machine Learning Studio and Microsoft Excel. The methodology section described using a dataset from Azure for classification and linear regression in both Azure and Excel to compare results.
Standard Statistical Feature analysis of Image Features for Facial Images usi...Bulbul Agrawal
This document compares Principal Component Analysis (PCA) and Independent Component Analysis (ICA) and their application to facial image analysis. It provides an introduction to both PCA and ICA, including their processes and differences. The document then summarizes previous literature comparing PCA and ICA, describes implementations of PCA for facial recognition on Japanese, African, and Asian datasets in MATLAB, and calculates statistical metrics for the original and recognized images. It concludes that PCA is effective for pattern recognition and dimensionality reduction in facial analysis applications.
This document compares several supervised machine learning classification algorithms on a Titanic dataset: Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine, and Naive Bayes. It finds that Random Forest achieves the highest accuracy. Evaluation metrics like precision, recall, F1-score, and accuracy are used to evaluate and compare model performance on test data.
Enhanced ID3 algorithm based on the weightage of the AttributeAM Publications
ID3 algorithm a decision tree classification algorithm is very popular due to its speed and simplicity in construction but it has its own snags while classifying the ID3 algorithm and tends to choose the attributes with large values and practical complexities arises due to this. To solve this problem the proposed algorithm empowers and uses the importance of the attributes and classifies accordingly to produce effective rules. The proposed algorithm uses the attribute weightage and calculates the information gain for the few values attributes and performs quite better when compared to classical ID3 algorithm. The proposed algorithm is applied on a real time data (i.e.) selection process of employees in a firm for appraisal based on few important attributes and executed.
This document discusses using Lagrange interpolation to estimate missing values in datasets. It begins with an introduction to missing data problems and common techniques for handling missing values like deletion, mean substitution, and more. It then explains Lagrange interpolation, which uses known data points to estimate values at unknown points. The algorithm for Lagrange interpolation is presented. An example using years of experience and salary data to estimate salary for 10 years of experience is shown. The document concludes that Lagrange interpolation can be used to estimate missing values in preprocessing if the relationship between attributes is uniform. Limitations are noted if the relationship is not uniform.
This document discusses a project that uses machine learning algorithms to predict potential heart diseases. The project uses a dataset with 13 features and applies algorithms like K-Nearest Neighbors Classifier and Support Vector Classifier, with and without PCA. The K-Nearest Neighbors Classifier achieved the best accuracy score of 87% at predicting heart disease based on the dataset.
The document discusses modelling and evaluation in machine learning. It defines what models are and how they are selected and trained for predictive and descriptive tasks. Specifically, it covers:
1) Models represent raw data in meaningful patterns and are selected based on the problem and data type, like regression for continuous numeric prediction.
2) Models are trained by assigning parameters to optimize an objective function and evaluate quality. Cross-validation is used to evaluate models.
3) Predictive models predict target values like classification to categorize data or regression for continuous targets. Descriptive models find patterns without targets for tasks like clustering.
4) Model performance can be affected by underfitting if too simple or overfitting if too complex,
1) Machine learning involves analyzing data to find patterns and make predictions. It uses mathematics, statistics, and programming.
2) Key aspects of machine learning include understanding the business problem, collecting and preparing data, building and evaluating models, and different types of machine learning algorithms like supervised, unsupervised, and reinforcement learning.
3) Common machine learning algorithms discussed include linear regression, logistic regression, KNN, K-means clustering, decision trees, and handling issues like missing values, outliers, and feature engineering.
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
This document describes a comparative analysis of GUI-based machine learning approaches for predicting Parkinson's disease. It analyzes various machine learning algorithms including logistic regression, decision trees, support vector machines, random forests, k-nearest neighbors, and naive Bayes. The document discusses data preprocessing techniques like variable identification, data validation, cleaning and preparing. It also covers data visualization and evaluating model performance using accuracy calculations. The goal is to compare the performance of these machine learning algorithms and identify the approach that predicts Parkinson's disease with the highest accuracy based on a given hospital dataset.
This document summarizes a research project that aims to develop an application to predict airline ticket prices using machine learning techniques. The researchers collected over 10,000 records of flight data including features like source, destination, date, time, number of stops, and price. They preprocessed the data, selected important features, and applied machine learning algorithms like linear regression, decision trees, and random forests to build predictive models. The random forest model provided the most accurate predictions according to performance metrics like MAE, MSE, and RMSE. The researchers propose deploying the best model in a web application using Flask for the backend and Bootstrap for the frontend so users can input flight details and receive predicted price outputs.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
This document presents a study that uses machine learning techniques to predict crime rates. Specifically, it aims to analyze crime data using supervised machine learning classification algorithms like decision trees, support vector machines, logistic regression, k-nearest neighbors, and random forests. The document outlines collecting and preprocessing crime data, selecting relevant features, training models on a portion of the data and testing them on the remaining data. It finds that random forest achieved the best prediction accuracy compared to other algorithms tested. The goal is to help law enforcement agencies better predict and reduce crime rates by analyzing historical crime data patterns.
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
This paper presents an approach based on applying an aggregated predictor formed by multiple versions of a multilayer neural network with a back-propagation optimization algorithm for helping the engineer to get a list of the most appropriate well-test interpretation models for a given set of pressure/ production data. The proposed method consists of three stages: (1) data decorrelation through principal component analysis to reduce the covariance between the variables and the dimension of the input layer in the artificial neural network, (2) bootstrap replicates of the learning set where the data is repeatedly sampled with a random split of the data into train sets and using these as new learning sets, and (3) automatic reservoir model identification through aggregated predictor formed by a plurality vote when predicting a new class. This method is described in detail to ensure successful replication of results. The required training and test dataset were generated by using analytical solution models. In our case, there were used 600 samples: 300 for training, 100 for cross-validation, and 200 for testing. Different network structures were tested during this study to arrive at optimum network design. We notice that the single net methodology always brings about confusion in selecting the correct model even though the training results for the constructed networks are close to 1. We notice also that the principal component analysis is an effective strategy in reducing the number of input features, simplifying the network structure, and lowering the training time of the ANN. The results obtained show that the proposed model provides better performance when predicting new data with a coefficient of correlation approximately equal to 95% Compared to a previous approach 80%, the combination of the PCA and ANN is more stable and determine the more accurate results with lesser computational complexity than was feasible previously. Clearly, the aggregated predictor is more stable and shows less bad classes compared to the previous approach.
The objective of this investigation is to predict the behavior of the decision of a customer on a car model based on given six features. Features being Buying Price, Maintenance price, Number of doors, Seaters, Luggage space, and Safety.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document summarizes a machine learning project to predict insurance claim severities for the Kaggle "Allstate Claims Severity" competition. It describes the dataset, preprocessing steps including one-hot encoding and outlier removal. A deep neural network model was trained using H2O. Hyperparameters were optimized in 4 phases: activation function, network architecture, outliers/epochs, and learning rate. 21 model submissions achieved test MAEs from 1,169 to 1,292, outperforming random forest benchmarks. Rectifier activation, 3 hidden layers of 1000 neurons total, removing outliers, 100 epochs, and a learning rate of 0.0001 produced strong results.
Post Graduate Admission Prediction SystemIRJET Journal
This document presents a post graduate admission prediction system built using machine learning algorithms. The system analyzes factors like GRE scores, TOEFL scores, undergraduate GPA, research experience etc. to predict the universities a student is likely to get admission in. Various machine learning models like multiple linear regression, random forest regression, support vector machine and logistic regression are implemented and evaluated on an admission prediction dataset. Logistic regression achieved the highest accuracy of 97%. A web application called PostPred is developed using the logistic regression model to help students predict suitable universities to apply to based on their profile.
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmIRJET Journal
This document discusses using machine learning algorithms to calculate the water quality index of the Ganga River in India. Specifically, it aims to analyze water quality data collected from various cities along the Ganga Riverbed in different seasons (summer, monsoon, winter) and assess whether the river water is potable or not. The researchers designed a machine learning model using the decision tree algorithm that calculates the water quality index based on 9 physicochemical parameters. It will be implemented as a Python-based web application using the Flask framework. The model is trained on collected datasets to predict water quality and determine if it is safe for drinking.
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
This document discusses five ways to attain optimal model complexity in machine learning: 1) feature engineering and selection to optimize variables, 2) data augmentation to expand datasets, 3) dimensionality reduction to reduce high-dimensional data, 4) active learning where algorithms query users to label data, and 5) ensemble models that combine multiple models to improve performance over single models. These techniques help improve model performance, efficiency, and ability to learn from data.
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
In order to confirm the validity of CNN to Time-Series Forecasting Problem, RNN, LSTM, and CNN+LSTM models are build and compared with their MSE score.
In this report, the google stock datasets obtained at kaggle are used.
https://github.com/kinopee0219/capstone
This document provides an overview of machine learning concepts including supervised learning, unsupervised learning, and reinforcement learning. It discusses common machine learning applications and challenges. Key topics covered include linear regression, classification, clustering, neural networks, bias-variance tradeoff, and model selection. Evaluation techniques like training error, validation error, and test error are also summarized.
Can data analysis help predict the future of your heart health?
The Boston Institute of Analytics (BIA) presents a collection of student presentations on data analysis projects tackling the critical topic of heart attack prediction.
Join us as we delve into the world of healthcare analytics and explore how data can be harnessed to identify individuals at risk of heart attack. These presentations offer valuable insights for:
Medical professionals seeking to develop preventative healthcare strategies
Individuals interested in understanding their own heart health risks
Data analysts passionate about applying data analysis for social good
Here's what you'll learn by watching these presentations:
The power of data analysis in predicting heart attacks
Various data analysis techniques used for risk assessment
Real-world examples of heart attack prediction models
Insights and findings from the research of dedicated BIA students
Empower yourself and others with the knowledge of heart health prediction. Watch these presentations and unlock the potential of data analysis in saving lives!
visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This document provides an overview of machine learning using Python. It introduces machine learning applications and key Python concepts for machine learning like data types, variables, strings, dates, conditional statements, loops, and common machine learning libraries like NumPy, Matplotlib, and Pandas. It also covers important machine learning topics like statistics, probability, algorithms like linear regression, logistic regression, KNN, Naive Bayes, and clustering. It distinguishes between supervised and unsupervised learning, and highlights algorithm types like regression, classification, decision trees, and dimensionality reduction techniques. Finally, it provides examples of potential machine learning projects.
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...ijcnes
Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining can be applied on the educational data for predicting the students behavior. This paper focuses on finding thesuitablealgorithm which yields the best result to find out the reason behind students absenteeism in an academic year. The first step in this processis to gather students data by using questionnaire.The datais collected from 123 under graduate students from a private college which is situated in a semirural area. The second step is to clean the data which is appropriate for mining purpose and choose the relevant attributes. In the final step, three different Decision tree induction algorithms namely, ID3(Iterative Dichotomiser), C4.5 and CART(Classification and Regression Tree)were applied for comparison of results for the same data sample collected using questionnaire. The results were compared to find the algorithm which yields the best result in predicting the reason for student s absenteeism.
Top 20 Data Science Interview Questions and Answers in 2023.pdfAnanthReddy38
Here are the top 20 data science interview questions along with their answers:
What is data science?
Data science is an interdisciplinary field that involves extracting insights and knowledge from data using various scientific methods, algorithms, and tools.
What are the different steps involved in the data science process?
The data science process typically involves the following steps:
a. Problem formulation
b. Data collection
c. Data cleaning and preprocessing
d. Exploratory data analysis
e. Feature engineering
f. Model selection and training
g. Model evaluation and validation
h. Deployment and monitoring
What is the difference between supervised and unsupervised learning?
Supervised learning involves training a model on labeled data, where the target variable is known, to make predictions or classify new instances. Unsupervised learning, on the other hand, deals with unlabeled data and aims to discover patterns, relationships, or structures within the data.
What is overfitting, and how can it be prevented?
Overfitting occurs when a model learns the training data too well, resulting in poor generalization to new, unseen data. To prevent overfitting, techniques like cross-validation, regularization, and early stopping can be employed.
What is feature engineering?
Feature engineering involves creating new features from the existing data that can improve the performance of machine learning models. It includes techniques like feature extraction, transformation, scaling, and selection.
Explain the concept of cross-validation.
Cross-validation is a resampling technique used to assess the performance of a model on unseen data. It involves partitioning the available data into multiple subsets, training the model on some subsets, and evaluating it on the remaining subset. Common types of cross-validation include k-fold cross-validation and holdout validation.
What is the purpose of regularization in machine learning?
Regularization is used to prevent overfitting by adding a penalty term to the loss function during model training. It discourages complex models and promotes simpler ones, ultimately improving generalization performance.
What is the difference between precision and recall?
Precision is the ratio of true positives to the total predicted positives, while recall is the ratio of true positives to the total actual positives. Precision measures the accuracy of positive predictions, whereas recall measures the coverage of positive instances.
Explain the term “bias-variance tradeoff.”
The bias-variance tradeoff refers to the relationship between a model’s bias (error due to oversimplification) and variance (error due to sensitivity to fluctuations in the training data). Increasing model complexity reduces bias but increases variance, and vice versa. The goal is to find the right balance that minimizes overall error.
Similar to THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUILDING USING MACHINE LEARNING ALGORITHMS (20)
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUILDING USING MACHINE LEARNING ALGORITHMS
1. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
DOI:10.5121/ijcses.2019.10301 1
THE IMPLICATION OF STATISTICAL ANALYSIS AND
FEATURE ENGINEERING FOR MODEL BUILDING
USING MACHINE LEARNING ALGORITHMS
Swayanshu Shanti Pragnya and Shashwat Priyadarshi
Fellow of Computer Science Research, Global Journals
Sr. Python Developer, Accenture, Hyderabad
ABSTRACT
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate
between algorithms with statistical implementation provides better consequence in terms of accurate
prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical
models, which provide less manual calculations. Presage is the essence of data science and machine
learning requisitions that impart control over situations. Implementation of any dogmas require proper
feature extraction which helps in the proper model building that assist in precision. This paper is
predominantly based on different statistical analysis which includes correlation significance and proper
categorical data distribution using feature engineering technique that unravel accuracy of different models
of machine learning algorithms.
KEYWORDS:
Correlation, Feature engineering, Feature selection, PCA, K nearest neighbour, logistic regression, RFE
1. INTRODUCTION
Statistical analysis is performed just to analyse the data little bit more by using statistical
conventions. But only analysing a data is not sufficient when it comes to analysis that too by
using statistics only. So at this point predictive analysis comes which is nothing but a part of
inferential statistics. Here we try to infer any outcome based on analysing patterns from previous
data just to predict for the next dataset when it comes to prediction first buzzword came i.e.
machine learning. Machine learning is the way to train the machine for required task completion.
Here machine learning is used to predict the survival of the passengers in the titanic disaster. But
prediction of the survival depends upon how effectively we can reform the dataset. For
enhancement or reform of the data set feature extraction is required. By using Logistic regression
technique [9] the prediction accuracy increased to 80.756%. The actual Titanic disaster which
was a ship voyage sunk in the Northern Atlantic on 15th Apr,1912 where 1502 passengers crewed
out of 2224 [1]. The reason behind sinking, which data impacted more upon the analysis of
survival is continuing [2], [3]. For analysing the data set more effectively is already available in
the Kaggle website [4]. Kaggle has given the platform for data analysis and machine learning [4].
The persons who are able to predict to the most accurate Kaggle provides cash prize for
encouragement. [1]. This paper comprises of explaining the importance and higher usability of
extracting feature from data sets and how these accurate extraction will help in the accurate
prediction using machine learning algorithms.
2. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
2
Before going through the topic we need to understand data. Generally through our study we
collect different type of information which is known as data. Data can be numerical (discrete and
continuous), categorical and ordinal. Numerical data represents different type of measurements
like person’s age, height or length of any train. These numerical data also known as quantitative
data.
Discrete data are ought to counted. For example if we flip a coin for 100 times then the result
can be determined in a generalized manner of 2^n, where n = number of times to flip. So here the
number of outcome is finite so this data is discrete by nature.
Continuous data are not finite as the name itself defines its continuing. For example the value of
pi i.e. 3.14159265358979323. And so on. That’s the reason for calculating such continuous data
we have to take an approximation.
Categorical data represents the nature of the data like a person’s gender or answer of any
question which is yes or no. Though these are characteristics of the data so we need to convert
these data to numeric format. Example if probability of a question is ‘yes’ then we need to assign
‘yes’ as 1 or any integer so that machine will understand.
Ordinal data is the amalgamation of numeric and categorical data. It means data will fall into
different categories but whatever numbers are placed on the category has some meaning. For
example if in a survey of 1000 people and will ask them to give the rate of hospitality they got at
the hospital from nurses on the scale of 0 to 5, then by taking the average of 1000 rate of
responses will have meaning. Here this scenario or data would not be considered as categorical
data.
Here we got the brief idea about different type of data and how we are going to recognise through
examples. Though the reason behind knowing the feature extraction is to implement in machine
learning process so we need to know about machine learning processes for both train and test data
as given below.
Process to train data is given below-
Data collection → Data Pre-processing → Feature extraction → Model building → Model
evaluation →Deployment → Model
Machine learning workflow for test data set i.e. given below-
Data collection → Data Pre-processing → Feature Extraction → Model → Predictions
Training a data and then gain testing the data is the steps towards implementing any model in
machine learning towards prediction or regression and classification as these two are the main
functionality of machine learning algorithms.
2. DATA PREPARATION PIPELINE
Here the aim is to show a Machine learning (ML) project work flow to build data preparation
pipeline which transforms Pandas data frame to numpy array for training ML models with Scikit-
Learn.
3. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
3
This process includes the following steps.
1. Splitting data into labels and predictors.
2. Mapping of data frame and selecting variables.
3. Categorical variable encoding
4. Filling missing data
5. Scaling numeric data
6. Assemble final pipeline
7. Test the final pipeline.
// Step 1 Splitting data into labels and predictor
import pandas as pd
train_data = pd.read_csv('data/housing_train.csv')
X = train_data.drop(['median_house_value'], axis=1)
y = train_data['median_house_value']
X.info()
X.head(5)
// Step 2 Mapping of data frame
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
// Step 3 selecting variables
class DataFrameAdapter(BaseEstimator, TransformerMixin)
def __init__(self, col_names):
self.col_names = list(col_names)
def fit(self, X, y=None):
return self
// Step 4
class CategoricalFeatureEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
return None
def fit(self, X, y=None):
return self
// Step 5 Filling missing data
from sklearn.preprocessing import Imputer
num_data = X.drop(['ocean_proximity'], axis=1)
num_imputer = Imputer(strategy='median')
imputed_num_data = num_imputer.fit_transform(num_data)
// Step 6 Scaling numeric data
from sklearn.pipeline import Pipeline, FeatureUnion
numeric_cols = X.select_dtypes(exclude=['object']).columns
numeric_pipeline = Pipeline([
('var_selector', DataFrameAdapter(numeric_cols)),
('imputer', Imputer(strategy='median')),
('scaler', MinMaxScaler())
])
// Step 7 Assemble final pipeline
prepared_data = data_prep_pipeline.fit_transform(X)
4. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
4
print('prepared data has {} observations of {}
features'.format(*prepared_data.shape))
Fig 1. Steps for data preparation
Data pre-processing includes different type of data modification like dummy value replacement,
data value replacement by using numeric values.
Dimensionality reduction is required in machine learning algorithm implementation as space
complexity along with efficiency is the factor of any computation. It comprises of two factor i.e.
feature selection and feature extraction.
Feature selection is comprises of Wrapper, Filter and embedded method.
Example- For improvising performance let’s take a, b, c, d are different feature and create an
equation as
a+b+c+d = e
If ab = a + b (Feature extraction)
ab + c + d = e
Let’s take c = 0 (As condition)
ab + d = e (Feature selection)
In the above example we came to know that how replacing few values and adding conditions in
features completely changed and reduced the equation in terms of dimension. Initially there are
five features and now it reduced to only three features.
3. METHODS OF FEATURE EXTRACTION
Any type of statistical model comprises of the following equation like,
Y = β0 + β1X1 + β2X2 +…. + ε
Where X1 up to Xn are of different features.
Need of Feature Extraction:
It depends upon the number of features.
Less features:
1. Easy to interpret
2. Less likely to overfit
3. Low in prediction accuracy
5. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
5
More features:
1. Difficult to interpret as number of feature is high
2. More likely to overfit
3. High prediction accuracy
Feature Selection
It is also known as attribute or variable selection. The process to select attributes which are most
relevant for the prediction. In other words feature selection is the way to select any subset of
important feature to use in any model construction.
Difference between Dimensionality reduction and Feature selection:
Generally feature selection and dimensionality reduction seem hazy but both are different. Both
has few similarity that too reducing number of attributes in the given data set is the work of
feature selection process. But dimensionality reduction method also create new combination
whereas feature selection method exclude and include feature or attributes present in the data set
without changing them.
For example dimensionality reduction method includes singular value decomposition and
Principal component analysis.
Feature Selection:
It is a process of selecting features in data set which has highest contribution for the out put
column. Generally when we look at any data set those are consist of numerous type of data. All
the columns are not vital for the processing. This is the reason to find features through selection
method.
Another problem can be irrelevant feature may lead to decrease the accuracy of any model like
linear regression.
Benefits of Feature Selection:
1. Improvement in Accuracy
2. Overfitting of data is very less
3. Time complexity (Less data which leads to faster execution)
Feature Selection for Machine Learning
There are different ways of feature selection in machine learning. Those are discussed below:-
1. Univariate Selection
Various statistical tests are performed for the selection of correlated features for the dependant
column.
6. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
6
The library named selectKbest class by sci-kit library can perform statistical tests to select
features.
The given example explains the chi squared statistical test for positive features. Model accuracy
is used to identify the contributing target attribute.
The example below uses the chi squared (chi^2) statistical test for non-negative features to select
4 of the best features from the Pima Indians onset of diabetes data set.
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
you can see the scores for each attribute and the 4 attributes chosen (those with
the highest scores): plas, test, mass and age
O/p
[[ 148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[ 183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]
[ 137. 168. 43.1 33. ]]
Fig 2. Univariate selection
2. Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works recursively by removing attributes and
Here logistic regression algorithm has been implemented to select to 3 features.
7. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
7
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_o/p
1 Num Features: 3
2 Selected Features: [ True False False False False True True False]
3 Feature Ranking: [1 2 3 5 6 1 1 4]
Fig 3. Recursive feature selection using data set
3. Principal Component Analysis
PCA is uses algebra in linear format for the transformation of data set into compressed one. It is
different from feature selection technique. Generally PCA is a dimension reduction technique. It
can choose the number of dimension to reduce. The Fig. Below is the implication of PCA
# Principal Component Analysis
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s") % fit.
8. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
8
Mathematics behind “import PCA” statement
The data set is resembles as a Vector of rows and columns. So the steps involved to implement
PCA are as follows:-
1. Mean of the vector i.e. assuming we have N sample and we can compute the mean of vector as
M = (M1 + M2 +…….+MN)/N
2. Combine the mean adjusted matrix i.e. for every vector column ‘p’ the mean adjusted matrix
will be
Ȳ = Mp - M and Y mean = (Ȳ1……ȲN) (for column ‘p’)
Ȳ" = Mq - M (for row ‘q’)
3. Compute co variance matrix I.e.
C(p,q) = Ȳ. Ȳ" (dot product of Ȳ and Ȳ")
4. Quantify Eigen values and Eigen vectors of co variance matrix.
5. Represent each combination of eigen value and vector as a linear combination of matrix
4. Feature Importance
A bagged decision tree for example random forest and extra umber of trees can be used to
estimate the importance of features.
In the given example code we build ExtraTressClassifier which classifies for the data set named
as Pima diabetes
# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import
ExtraTreesClassifier
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] // selected columns from
data set
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8] // slicing method to select rows from 0 to 8
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y) // particular rows and columns will be fitted for the training process
print(model.feature_importances_)
9. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
9
o/p- [
0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214
0.15415431]
Fig. 4 Live use of extra tree classifier
4. MODEL IMPLEMENTATION AND ACCURACY ANALYSIS
In Module II and III, we explained the process of feature extraction, creation and selection. Along
with we have provided the fully executable code. In this module we are going to discuss the
change in accuracy by using the given techniques. The diabetes data consist of 768 data points
with 9 features. Here we implemented logistic regression without correlation analysis.To know
the correlation between each columns we need to find the correlation factor in data set.
Fig1. heat map shows that correlation between plasma glucose concentration and on set diabetes
is high I.e. 0.8.
logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train) print("Training set
accuracy: {:.3f}".format(logreg001.score(X_train, y_train))) print("Test set
accuracy: {:.3f}".format(logreg001.score(X_test, y_test)))
Training set accuracy: 0.700,
Test set accuracy: 0.703
//Less accuracy (Without correlation analysis)
Fig. 5. Logistic regression using diabetes data
After filling the missing values and selecting the high correlated columns now we can implement
our algorithms to check the accuracy.
Fig 6. Correlation between features
After knowing the correlation factor we have modified the train and test data to implement K-NN
as we got 9 features in our data set.
10. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
10
knn = KNeighborsClassifier(n_neighbors=9) knn.fit(X_train, y_train) // process of
training the data set assigned in x and y train
print('Accuracy of K-NN classifier on training set:
{:.2f}'.format(knn.score(X_train, y_train))) print('Accuracy of K-NN classifier on
test set: {:.2f}'.format(knn.score(X_test, y_test)))
Accuracy of K-NN classifier on training set: 0.79
Accuracy of K-NN classifier on test set: 0.78
// Improved
CONCLUSION
Here in all of the 4 subsection of the paper we discussed the following things i.e. types of data,
steps involved to find correlation, feature engineering techniques, the difference between feature
extraction and dimension reduction. In our final module we implemented logistic regression
technique and got 0.70 as accuracy.
But after using a simple correlation function and Heat map visualization we sorted the data set
with 9 features and by using K-nearest neighbour algorithm we are succeeded to get 0.78 as our
model accuracy. Here we have shown the importance of selecting features and their impact on the
improvisation of model accuracy.
We can visualize the importance of selecting proper feature by using statistical methods. Hence
before experimenting on any algorithm we should vividly check the features as it clearly impacts
on the accuracy.
The objective of our paper was to know which factors are important to improvise a model
accuracy and which techniques can be helpful to achieve it. We got a conclusion that selecting
proper feature along with reducing their dimension is correlated for enhancing model accuracy.
But this is not the end as accuracy is increased only 11.42% which is not a major change that
means there are other factors which we have to find and fix. So our next work will be on finding
other factors which are merged in the process of upgradation.
FUTURE WORK:
In this experiment of implementing dimension reduction technique and selection method of
feature are really helpful to increase a model accuracy. But the improvement is bit lesser. So we
want to conduct another way to improvise the accuracy by using normalization techniques like
min-max scaling, z score standardization and row normalization to develop model accuracy.
Along with this techniques we will implement different deep learning algorithms for better
functionality. Understanding the factors which helps to improvise accuracy is really relevant to
know as only few factors like selecting particular feature or even dimension reduction is not the
only factor which we came to know in this paper. So more depth on each feature and
development in training method is vital for improvisation.
Our next work will be fully focusing on normalization along with optimization of particular
machine learning algorithm like matrix notation of logistic regression and random forest
algorithm.
11. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.10, No.2/3, June 2019
11
REFERENCES
[1] GE, “Flight Quest Challenge,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/flight2-
final. [Accessed: 2-Jun-2017].
[2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available:
https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 2-Jun-2017].
[3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 2-Jun-2017].
[4] Kaggle, Data Science Community, [Online]. Available: http://www.kaggle.com/ [Accessed: 2-Jun-
2017].
[5] Multiple Regression, [Online] Available: https://statistics.laerd.com/spss-tutorials/multiple-
regression-usingspss-statistics.php [Accessed: 2-Jun-2017].
[6] Logistic Regression, [Online] Available: https://en.wikipedia.org/wiki/Logistic_regression [Accessed:
2-Jun2017].
[7] Consumer Preferences to Specific Features in Mobile Phones: A Comparative Study [Online]
Available: http://ermt.net/docs/papers/Volume_6/5_May2017/V6N5-107.pdf.
[8] Multiple Linear Regression, [Online] Available http://www.statisticssolutions.com/assumptions-of-
multiplelinear-regression/ [Accessed: 3-Jun-2017]
[9] Prediction of Survivors in Titanic Dataset: A Comparative Study using Machine Learning Algorithms
Tryambak Chatterjee* Department of Management Studies, NIT Trichy, Tiruchirappalli, Tamilnadu,
India