Predicting wine quality using data analyticsGautam Sawant
This project develops predictive models through numerous machine learning algorithms to predict the quality of wines based on its components. This info can be used by wine makers to make good quality new wines. I did this project as part of the course MIS- 636, Knowledge Discovery in Databases at Stevens Institute of Technology in Hoboken, New Jersey. I am uploading the for the project which was submitted as part of the final presentation along with the project itself.
Predicting Wine Quality Using Different Implementations of Decision Tree Algo...Mohammed Al Hamadi
Using R programming language's three packages: tree, rpart and C50, we try to predict the quality of wine on a publicly available data set. Then, we evaluate the performance of each package using misclassification error, sensitivity, fall-out, ROC Curve and Area Under Curve (AUC).
Wine Quality Analysis Using Machine LearningMahima -
Wine industries use Product Quality Certification to promote their products and become a concern for every individual who consumes any product. But it's not possible to ensure wine quality by experts with such a huge demand for the product as it will increase the cost. It allows building a model using machine learning techniques with a user interface which predicts the quality of the wine by selecting the important parameters.
What is pattern recognition (lecture 4 of 6)Randa Elanwar
In this series I intend to simplify a beautiful branch of computer science that we as humans use it in everyday life without knowing. Pattern recognition is a sub-branch of the computer vision research and is tightly related to digital signal processing research as well as machine learning and artificial intelligence.
This document discusses pattern recognition. It defines a pattern as a set of measurements describing a physical object and a pattern class as a set of patterns sharing common attributes. Pattern recognition involves relating perceived patterns to previously perceived patterns to classify them. The goals are to put patterns into categories and learn to distinguish patterns of interest. Examples of pattern recognition applications include optical character recognition, biometrics, medical diagnosis, and military target recognition. Common approaches to pattern recognition are statistical, neural networks, and structural. The process involves data acquisition, pre-processing, feature extraction, classification, and post-processing. An example of classifying fish into salmon and sea bass is provided.
This document provides an overview of classification in machine learning. It discusses supervised learning and the classification process. It describes several common classification algorithms including k-nearest neighbors, Naive Bayes, decision trees, and support vector machines. It also covers performance evaluation metrics like accuracy, precision and recall. The document uses examples to illustrate classification tasks and the training and testing process in supervised learning.
Predicting wine quality using data analyticsGautam Sawant
This project develops predictive models through numerous machine learning algorithms to predict the quality of wines based on its components. This info can be used by wine makers to make good quality new wines. I did this project as part of the course MIS- 636, Knowledge Discovery in Databases at Stevens Institute of Technology in Hoboken, New Jersey. I am uploading the for the project which was submitted as part of the final presentation along with the project itself.
Predicting Wine Quality Using Different Implementations of Decision Tree Algo...Mohammed Al Hamadi
Using R programming language's three packages: tree, rpart and C50, we try to predict the quality of wine on a publicly available data set. Then, we evaluate the performance of each package using misclassification error, sensitivity, fall-out, ROC Curve and Area Under Curve (AUC).
Wine Quality Analysis Using Machine LearningMahima -
Wine industries use Product Quality Certification to promote their products and become a concern for every individual who consumes any product. But it's not possible to ensure wine quality by experts with such a huge demand for the product as it will increase the cost. It allows building a model using machine learning techniques with a user interface which predicts the quality of the wine by selecting the important parameters.
What is pattern recognition (lecture 4 of 6)Randa Elanwar
In this series I intend to simplify a beautiful branch of computer science that we as humans use it in everyday life without knowing. Pattern recognition is a sub-branch of the computer vision research and is tightly related to digital signal processing research as well as machine learning and artificial intelligence.
This document discusses pattern recognition. It defines a pattern as a set of measurements describing a physical object and a pattern class as a set of patterns sharing common attributes. Pattern recognition involves relating perceived patterns to previously perceived patterns to classify them. The goals are to put patterns into categories and learn to distinguish patterns of interest. Examples of pattern recognition applications include optical character recognition, biometrics, medical diagnosis, and military target recognition. Common approaches to pattern recognition are statistical, neural networks, and structural. The process involves data acquisition, pre-processing, feature extraction, classification, and post-processing. An example of classifying fish into salmon and sea bass is provided.
This document provides an overview of classification in machine learning. It discusses supervised learning and the classification process. It describes several common classification algorithms including k-nearest neighbors, Naive Bayes, decision trees, and support vector machines. It also covers performance evaluation metrics like accuracy, precision and recall. The document uses examples to illustrate classification tasks and the training and testing process in supervised learning.
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
This document provides an overview of image processing and various image segmentation techniques. It begins with basics of image processing, types of images, and image formats. Then it discusses basics of image segmentation, including watershed transformation, point detection, and region-based segmentation. It also covers various edge detection methods like first order derivative, second order derivative, and optimal edge detection using Canny edge detection. Specific techniques discussed include Roberts operator, Sobel operator, Prewitt operator, Laplacian, watershed transformation, and region growing segmentation.
Data Science - Part XV - MARS, Logistic Regression, & Survival AnalysisDerek Kane
This lecture provides an overview on extending the regression concepts brought forth in previous lectures. We will start off by going through a broad overview of the Multivariate Adaptive Regression Splines Algorithm, Logistic Regression, and then explore the Survival Analysis. The presentation will culminate with a real world example on how these techniques can be used in the US criminal justice system.
Inroduction to Perceptron and how it is used in Machine Learning and Artificial Neural Network.
This presentation is prepared by Zaid Al-husseini, as a lectur for third stage of undergraduate students in Softwrae department - faculity of IT - University of Babylon, Iraq.
It is publicly availabe for the beginners to learn in theory and mathmatically how the Perceptron is working.
Notice: the slides are not detailed. And need a teacher to explain them deeply.
Brain Tumor Segmentation using Enhanced U-Net Model with Empirical AnalysisMD Abdullah Al Nasim
Cancer of the brain is deadly and requires careful surgical segmentation. The brain tumors were segmented using U-Net using a Convolutional Neural Network (CNN). When looking for overlaps of necrotic, edematous, growing, and healthy tissue, it might be hard to get relevant information from the images. The 2D U-Net network was improved and trained with the BraTS datasets to find these four areas. U-Net can set up many encoder and decoder routes that can be used to get information from images that can be used in different ways. To reduce computational time, we use image segmentation to exclude insignificant background details. Experiments on the BraTS datasets show that our proposed model for segmenting brain tumors from MRI (MRI) works well. In this study, we demonstrate that the BraTS datasets for 2017, 2018, 2019, and 2020 do not significantly differ from the BraTS 2019 dataset's attained dice scores of 0.8717 (necrotic), 0.9506 (edema), and 0.9427 (enhancing).
Expressing and recognizing emotional behavior of human play an important role in communication systems. Facial expression analysis is the most expressive way to display human emotion. Three types of facial emotion are recognized and classified: happy, sad, anger. Depending on the emotion, the music player will play the song accordingly which eliminates the time-consuming and tedious task of manually segregating or grouping songs into different lists and help in generating an appropriate playlist based on individual’s emotional features. This paper implements an efficient extraction of facial points using Beizer Curves which is more suitable for use in mobile devices. To enhance the audibility level of audio and memory shortage group play concept is also introduced. More than one android mobiles communicate with each other through Peer to Peer connection using Wi-Fi direct. Along with music player editing features like audio trimming and precise voice recording. Later the customized audio file can be set as alarm, ringtone, notification tones. Since all the emotional recognition is done for the real time images, it outperforms well than the any other existing face recognition algorithms or music play applications.
This document compares several graphical user interfaces (GUIs) for R, including BlueSky, Deducer, jamovi, JASP, R AnalyticFlow, Rattle, RKWard, and R-Instat. It ranks them based on features related to ease of use, general usability, graphics capabilities, and analytics. BlueSky and R-Instat score highest for their data wrangling features. JASP is strongest for Bayesian analysis and machine learning. Rattle focuses on machine learning/AI. RKWard provides an advanced integrated development environment. The document also notes strengths and limitations of each GUI.
ANOMALY DETECTION IN INTELLIGENT TRANSPORTATION SYSTEM using real-time video...MrMoliya
The document discusses anomaly detection in intelligent transportation systems using real-time video processing and deep learning. It aims to identify anomalies like improper driving, illegal road usage, overspeeding, and traffic light violations. The proposed method involves developing an architectural model that can automatically identify anomalies in real-time from any camera footage. Milestones and Gantt charts are provided to outline the research review process and project timelines from 2022 to 2023. The goal is to address current research gaps and lack of efficient systems for anomaly detection in the Indian context.
Keystroke dynamics, or typing dynamics, is the detailed timing information that describes exactly when each key was pressed and when it was released as a person is typing at a computer keyboard.
A fingerprint in its narrow sense is an impression left by the friction ridges of a human finger.Human fingerprints are detailed, unique, difficult to alter, and durable over the life of an individual making them suitable as long-term markers of human identity and may be employed by police or other authorities to identify individuals who wish to conceal their identity, or to identify people are incapacitated or deceased and thus unable to identify themselves, as in the aftermath of a natural disaster. When this fact is integrated for authentication purpose ,it creates an amazing lock system for Security purpose.Here we discuss what is Information Security ,how it is enhanced with the help of Fingerprint , its advantages, disadvantages.You will get a overall end to end scenario infront of you after going through all the slides. I have presented this topic in the Seminar presentation for completion of my M.Tech in Computer Science & Engineering.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
Introduction to Bayesian classifier. It describes the basic algorithm and applications of Bayesian classification. Explained with the help of numerical problems.
This document provides an overview of data mining and the Orange software tool for data mining. It defines data mining as the process of analyzing data from different perspectives to summarize it into useful information. It then discusses major data mining tasks like classification, clustering, deviation detection, and forecasting. It also introduces the concepts of data warehouses and decision trees. The document proceeds to describe Orange, an open-source software for visual data mining and analytics. Orange contains various widgets that can be used for data preprocessing, visualization, and machine learning algorithms. Finally, the document demonstrates some Orange widgets and provides references for further information.
This document discusses decision tree regression for predicting salary based on position level. It shows how to import data, build a decision tree regression model using scikit-learn in Python and rpart in R, make predictions, and plot the results. It notes that decision trees are discrete models, so the plots need to treat the x-axis as discrete rather than continuous to properly visualize the model's piecewise constant predictions.
1) The 1R algorithm generates a one-level decision tree by considering each attribute individually and assigning the majority class to each branch. It chooses the attribute with the minimum classification error.
2) Naive Bayes classification assumes attributes are independent and calculates the probability of each class using Bayes' rule. It handles missing and numeric attributes.
3) Decision tree algorithms like ID3 use a divide-and-conquer approach, recursively splitting the data on attributes that maximize information gain or gain ratio at each node.
4) Rule-based algorithms like PRISM generate rules to cover instances of each class sequentially, maximizing the ratio of correctly covered to total covered instances at each step.
The document discusses recommender systems and describes several techniques used in collaborative filtering recommender systems including k-nearest neighbors (kNN), singular value decomposition (SVD), and similarity weights optimization (SWO). It provides examples of how these techniques work and compares kNN to SWO. The document aims to explain state-of-the-art recommender system methods.
The document discusses K-means clustering, an unsupervised machine learning algorithm that partitions observations into k clusters where each observation belongs to the cluster with the nearest mean. It describes how K-means aims to minimize intra-cluster similarity while maximizing inter-cluster similarity. The algorithm works by first selecting k random cluster centroids, then iteratively reassigning observations to the closest centroid and recalculating the centroids until convergence is reached. It also addresses computational complexity, extensions, tools for implementing K-means, and examples of applications like image compression, recommendation systems, and yield management.
The document discusses visual pattern recognition and the design and implementation of visual pattern classifiers. It describes the common steps in designing a statistical visual pattern classifier, which include defining the problem, extracting relevant features, selecting a classification method, selecting a dataset for training and testing, training the classifier on a subset of images, testing the classifier, and refining the solution. It also defines what patterns and pattern classes are in the context of pattern recognition.
Building a strong brand takes intelligence, creativity, hard work and a bit of luck. The document provides information on brand development and initial design concepts for several brands, including an Italian restaurant called Viale inspired by New York City's Soho arts district, a upscale lounge called Seahorse Lounge taking design cues from champagne and seahorses, and a fine cigar shop for a Las Vegas casino called Humidor Fine Cigars aiming to deliver the finest cigars in a believable brand.
Samantha enjoys rollerblading because she loves the feeling of speed and the sensation of flying. She also enjoys watching comedies because they make her laugh and relax. Additionally, she does hatha yoga occasionally with her mother as it is a gentle exercise that improves flexibility, concentration, and breath control while providing benefits to both body and mind. Samantha also has a passion for shopping and enjoys walking through stores and leaving with purchases.
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
This document provides an overview of image processing and various image segmentation techniques. It begins with basics of image processing, types of images, and image formats. Then it discusses basics of image segmentation, including watershed transformation, point detection, and region-based segmentation. It also covers various edge detection methods like first order derivative, second order derivative, and optimal edge detection using Canny edge detection. Specific techniques discussed include Roberts operator, Sobel operator, Prewitt operator, Laplacian, watershed transformation, and region growing segmentation.
Data Science - Part XV - MARS, Logistic Regression, & Survival AnalysisDerek Kane
This lecture provides an overview on extending the regression concepts brought forth in previous lectures. We will start off by going through a broad overview of the Multivariate Adaptive Regression Splines Algorithm, Logistic Regression, and then explore the Survival Analysis. The presentation will culminate with a real world example on how these techniques can be used in the US criminal justice system.
Inroduction to Perceptron and how it is used in Machine Learning and Artificial Neural Network.
This presentation is prepared by Zaid Al-husseini, as a lectur for third stage of undergraduate students in Softwrae department - faculity of IT - University of Babylon, Iraq.
It is publicly availabe for the beginners to learn in theory and mathmatically how the Perceptron is working.
Notice: the slides are not detailed. And need a teacher to explain them deeply.
Brain Tumor Segmentation using Enhanced U-Net Model with Empirical AnalysisMD Abdullah Al Nasim
Cancer of the brain is deadly and requires careful surgical segmentation. The brain tumors were segmented using U-Net using a Convolutional Neural Network (CNN). When looking for overlaps of necrotic, edematous, growing, and healthy tissue, it might be hard to get relevant information from the images. The 2D U-Net network was improved and trained with the BraTS datasets to find these four areas. U-Net can set up many encoder and decoder routes that can be used to get information from images that can be used in different ways. To reduce computational time, we use image segmentation to exclude insignificant background details. Experiments on the BraTS datasets show that our proposed model for segmenting brain tumors from MRI (MRI) works well. In this study, we demonstrate that the BraTS datasets for 2017, 2018, 2019, and 2020 do not significantly differ from the BraTS 2019 dataset's attained dice scores of 0.8717 (necrotic), 0.9506 (edema), and 0.9427 (enhancing).
Expressing and recognizing emotional behavior of human play an important role in communication systems. Facial expression analysis is the most expressive way to display human emotion. Three types of facial emotion are recognized and classified: happy, sad, anger. Depending on the emotion, the music player will play the song accordingly which eliminates the time-consuming and tedious task of manually segregating or grouping songs into different lists and help in generating an appropriate playlist based on individual’s emotional features. This paper implements an efficient extraction of facial points using Beizer Curves which is more suitable for use in mobile devices. To enhance the audibility level of audio and memory shortage group play concept is also introduced. More than one android mobiles communicate with each other through Peer to Peer connection using Wi-Fi direct. Along with music player editing features like audio trimming and precise voice recording. Later the customized audio file can be set as alarm, ringtone, notification tones. Since all the emotional recognition is done for the real time images, it outperforms well than the any other existing face recognition algorithms or music play applications.
This document compares several graphical user interfaces (GUIs) for R, including BlueSky, Deducer, jamovi, JASP, R AnalyticFlow, Rattle, RKWard, and R-Instat. It ranks them based on features related to ease of use, general usability, graphics capabilities, and analytics. BlueSky and R-Instat score highest for their data wrangling features. JASP is strongest for Bayesian analysis and machine learning. Rattle focuses on machine learning/AI. RKWard provides an advanced integrated development environment. The document also notes strengths and limitations of each GUI.
ANOMALY DETECTION IN INTELLIGENT TRANSPORTATION SYSTEM using real-time video...MrMoliya
The document discusses anomaly detection in intelligent transportation systems using real-time video processing and deep learning. It aims to identify anomalies like improper driving, illegal road usage, overspeeding, and traffic light violations. The proposed method involves developing an architectural model that can automatically identify anomalies in real-time from any camera footage. Milestones and Gantt charts are provided to outline the research review process and project timelines from 2022 to 2023. The goal is to address current research gaps and lack of efficient systems for anomaly detection in the Indian context.
Keystroke dynamics, or typing dynamics, is the detailed timing information that describes exactly when each key was pressed and when it was released as a person is typing at a computer keyboard.
A fingerprint in its narrow sense is an impression left by the friction ridges of a human finger.Human fingerprints are detailed, unique, difficult to alter, and durable over the life of an individual making them suitable as long-term markers of human identity and may be employed by police or other authorities to identify individuals who wish to conceal their identity, or to identify people are incapacitated or deceased and thus unable to identify themselves, as in the aftermath of a natural disaster. When this fact is integrated for authentication purpose ,it creates an amazing lock system for Security purpose.Here we discuss what is Information Security ,how it is enhanced with the help of Fingerprint , its advantages, disadvantages.You will get a overall end to end scenario infront of you after going through all the slides. I have presented this topic in the Seminar presentation for completion of my M.Tech in Computer Science & Engineering.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
Introduction to Bayesian classifier. It describes the basic algorithm and applications of Bayesian classification. Explained with the help of numerical problems.
This document provides an overview of data mining and the Orange software tool for data mining. It defines data mining as the process of analyzing data from different perspectives to summarize it into useful information. It then discusses major data mining tasks like classification, clustering, deviation detection, and forecasting. It also introduces the concepts of data warehouses and decision trees. The document proceeds to describe Orange, an open-source software for visual data mining and analytics. Orange contains various widgets that can be used for data preprocessing, visualization, and machine learning algorithms. Finally, the document demonstrates some Orange widgets and provides references for further information.
This document discusses decision tree regression for predicting salary based on position level. It shows how to import data, build a decision tree regression model using scikit-learn in Python and rpart in R, make predictions, and plot the results. It notes that decision trees are discrete models, so the plots need to treat the x-axis as discrete rather than continuous to properly visualize the model's piecewise constant predictions.
1) The 1R algorithm generates a one-level decision tree by considering each attribute individually and assigning the majority class to each branch. It chooses the attribute with the minimum classification error.
2) Naive Bayes classification assumes attributes are independent and calculates the probability of each class using Bayes' rule. It handles missing and numeric attributes.
3) Decision tree algorithms like ID3 use a divide-and-conquer approach, recursively splitting the data on attributes that maximize information gain or gain ratio at each node.
4) Rule-based algorithms like PRISM generate rules to cover instances of each class sequentially, maximizing the ratio of correctly covered to total covered instances at each step.
The document discusses recommender systems and describes several techniques used in collaborative filtering recommender systems including k-nearest neighbors (kNN), singular value decomposition (SVD), and similarity weights optimization (SWO). It provides examples of how these techniques work and compares kNN to SWO. The document aims to explain state-of-the-art recommender system methods.
The document discusses K-means clustering, an unsupervised machine learning algorithm that partitions observations into k clusters where each observation belongs to the cluster with the nearest mean. It describes how K-means aims to minimize intra-cluster similarity while maximizing inter-cluster similarity. The algorithm works by first selecting k random cluster centroids, then iteratively reassigning observations to the closest centroid and recalculating the centroids until convergence is reached. It also addresses computational complexity, extensions, tools for implementing K-means, and examples of applications like image compression, recommendation systems, and yield management.
The document discusses visual pattern recognition and the design and implementation of visual pattern classifiers. It describes the common steps in designing a statistical visual pattern classifier, which include defining the problem, extracting relevant features, selecting a classification method, selecting a dataset for training and testing, training the classifier on a subset of images, testing the classifier, and refining the solution. It also defines what patterns and pattern classes are in the context of pattern recognition.
Building a strong brand takes intelligence, creativity, hard work and a bit of luck. The document provides information on brand development and initial design concepts for several brands, including an Italian restaurant called Viale inspired by New York City's Soho arts district, a upscale lounge called Seahorse Lounge taking design cues from champagne and seahorses, and a fine cigar shop for a Las Vegas casino called Humidor Fine Cigars aiming to deliver the finest cigars in a believable brand.
Samantha enjoys rollerblading because she loves the feeling of speed and the sensation of flying. She also enjoys watching comedies because they make her laugh and relax. Additionally, she does hatha yoga occasionally with her mother as it is a gentle exercise that improves flexibility, concentration, and breath control while providing benefits to both body and mind. Samantha also has a passion for shopping and enjoys walking through stores and leaving with purchases.
Chapter 11 vocabular words and guided notes10cline101
This document provides vocabulary words and guided notes for a chapter on politics and reform from 1870-1900. It includes terms related to stalemate in Washington between Republicans and Democrats, the rise of populism among farmers seeking solutions to economic hardship, and the rise of segregation in the post-Reconstruction South. Questions prompt definitions of terms and short answers about key events and figures from the time period.
The document outlines a teaching plan for a 9th standard science class on minerals and water. The plan aims to develop students' factual, conceptual, and procedural knowledge through group discussions, explanations, and questioning. Students will learn about key terms like minerals and water, facts about their importance, and concepts like minerals providing nutrients for living things. The class will involve students observing mineral and water samples, communicating their roles, and explaining their interconnections through various activities and a follow up assignment. The goal is for students to understand minerals and water are essential resources and gain scientific attitudes.
Weka is a popular suite of machine learning software written in Java. It was developed at the University of Waikato in New Zealand and is free to use under the GNU license. Weka allows users to preprocess and analyze data, build models, and perform predictive analytics. It includes an easy to use graphical user interface that allows users to load datasets, run machine learning algorithms, and evaluate experimental results.
Empirical Study on Classification Algorithm For Evaluation of Students Academ...iosrjce
Data mining techniques (DMT) are extensively used in educational field to find new hidden patterns
from student’s data. In recent years, the greatest issues that educational institutions are facing the unstable
expansion of educational data and to utilize this information data to progress the quality of managerial
decisions. Educational institutions are playing a prominent role in the public and also playing an essential role
for enlargement and progress of nation. The idea is predicting the paths of students, thus identifying the student
achievement. The data mining methods are very useful in predicting the educational database. Educational data
mining is concerns with improving techniques for determining knowledge from data which comes from the
educational database. However it has issue with accuracy of classification algorithms. To overcome this
problem the higher accuracy of the classification J48 algorithm is used. This work takes consideration with the
locality and the performance of the student in education in order to analyse the student achievement is high over
schooling or in graduation
The document discusses a study that evaluates the performance of the Naive Bayes and J48 classification algorithms on Swahili tweets. The study collected 276 tweets from popular Tanzanian Twitter accounts and preprocessed the data to remove non-Swahili words. The preprocessed data was then analyzed using the Naive Bayes and J48 algorithms in WEKA. The algorithms were evaluated based on accuracy, precision, recall, and ROC curve. The results showed that the Naive Bayes algorithm performed better than J48, achieving higher accuracy (36.96% vs 34.78%) and higher values for the other evaluation metrics as well. Therefore, the study found that Naive Bayes classification works better than J48 for classifying Swahili tweets.
This document discusses the data cleaning process for a dataset combining Reddit news headlines and Dow Jones Industrial Average (DJIA) values. The dataset contained issues like special characters in headlines and multiple word forms referring to the same concept. To address this, the data was cleaned by removing special characters, stemming words to their root forms, and removing stopwords. Keywords were identified by creating a corpus from the cleaned headlines and calculating term frequencies. Two frequency ranges containing about 50 terms each were selected for further analysis against the DJIA rise/fall classification attribute. The goal of the data cleaning was to extract keyword counts that could help predict the DJIA classification.
The document provides details on developing a new mobile application for Metrolink transit. It discusses weaknesses in the current application and outlines features and justifications for the new application being developed. The key points are:
1) The new application aims to remove unnecessary features from the current Metrolink app and add useful new functionality like ticket purchasing that users want.
2) Features like real-time updates, multiple ticket purchasing, QR codes for tickets, and alarms for arrival times are described and their benefits outlined.
3) Making the app free, reducing time spent on tasks, and providing relevant location-based information are emphasized as ways to enhance usability and encourage more users.
Abstract - Various aspects of three proposed architectures for distributed software are examined. A Crucial need to
create an ideal model for optimal architecture which meets the needs of the organization for flexibility, extensibility
and integration, to fulfill exhaustive performance for potential talents processes and opportunities in the corporations
a permanent and ongoing need. The excellence of the proposed architecture is demonstrated by presenting a rigor scenario based proof of adaptively and compatibility of the architecture in cases of merging and varying organizations, where the whole structure of hierarchies is revised.
Keywords: ERP, Data-centric architecture, architecture Component-based, Plug in architecture, distributed systems
This document describes a data mining project to analyze bank marketing data using classification algorithms. The goal is to predict whether a client will subscribe to a term deposit based on their attributes. The dataset contains information on over 61,000 customers. Pre-processing steps like data cleaning, handling missing values, removing outliers, and scaling are performed. Algorithms like Naive Bayes, J48 decision trees, and PART are applied and their results compared to determine the best predictor of subscriptions.
The use of branches within a version control system is a risk management technique. They are commonly used to minimise the risk of unanticipated side-effects when releasing critical changes to production, or to minimise the disruption to developer productivity when making changes to the base product. But branching is not the only means of managing risk and that is what this talk addresses – the forces that drives the use of branches, what are their problems and what are the alternatives.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
The document is a group report for redesigning the Metrolink mobile app. It includes:
- An executive summary outlining flaws in the current app and the goals of redesigning it.
- Details of the methodology used including gathering user requirements, collaborating on ideas, creating navigation diagrams and prototypes.
- Usability evaluations were conducted on the prototypes including SUS surveys, Nielsen's heuristics analysis and heuristic statements.
- The final redesigned app is presented through storyboards and a high-level prototype, addressing user needs through new features like smart journeys, location-based notifications and a rewards system.
This document discusses developing classifiers for a census income dataset using the WEKA data mining tool. It summarizes preprocessing steps like handling missing values, removing outliers, and balancing class distributions. It then evaluates various classifiers like J48 decision trees and k-nearest neighbors (IBk) on the preprocessed data. The best performing model was an ensemble "vote" classifier that combined the predictions of J48, IBk, and logistic regression models, achieving 87.3% accuracy and an ROC area of 0.947.
This document discusses a project analyzing a mushroom dataset containing descriptions of 23 species of gilled mushrooms to determine if they are edible or poisonous. The dataset includes 8,124 instances with 22 attributes describing intrinsic mushroom characteristics and external factors. A decision tree model achieved 100% accuracy in classification by considering attributes like odor, habitat, population density, and cap/stalk features. Further analysis found odor and spore print color were the most statistically significant but not practical for identification in the field. Future work aims to develop easier visual classification methods using attributes like habitat, population density, and cap/stalk characteristics.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
2. I
Abstract
This study aims to highlight the differences between two classification algorithms. Both
algorithms are applied on a dataset with white Portuguese wine. They are used to classify
human wine taste preferences, based on physicochemical properties. The algorithms studied
here are J48 (C4.5) and AdaBoostM1(J48). AdaBoostM1(J48) starts with the basic J48 but
attempts to improve it. That is why it is interesting to see how it performs on a real-world
application. Since AdaBoostM1(J48) adds complexity, a tradeoff should be made based on its
added value and its added cost. This study will provide clarity on its added value in classifying
red Portuguese wine.
5. 1
1. Introduction
The classification task is a form of supervised learning where a dataset, labeled with the right
class, is first specified. A classifier is trained on this set with the goal of generalizing its model
to other data. Different classifiers (learning algorithms) exist for this task with their
advantages and disadvantages. Increasingly complex algorithms combined with increasing
accuracy have been developed. Computational power has increased strongly over the years.
Therefore, much attention typically goes to accuracy of an algorithm. Currently, many smart
mobile devices and services are developed. However, for these less powerful, mobile
applications, the context is different and power and efficiency become important.
The algorithms under study here are J48 and AdaBoostM1(J48). J48 is a relatively simple tree-
building algorithm, very popular thanks to its great performance and its understandable
models. AdaBoost, attempts to boost the performance of an underlying algorithm, which in
this case is J48. This boosting comes at the cost of increased complexity. For the average
dataset, AdaBoost has shown better performance at the cost of computational power (J R
Quinlan, 2006). For it to be relevant, the gain should outweigh the cost, which of course
depends on the context.
The study is organized as follows: In section 2 the dataset and its attributes are discussed,
followed by the algorithms that are going to be applied on the dataset. Section 5.1 then
discusses the preprocessing steps the dataset underwent in order to increase the
performance of both algorithms. Once applied, the results are followed by a brief discussion
in section 6.
6. 2
2. Dataset
The dataset talks about wine. More precisely it contains both physicochemical properties and
sensory data of red and white Portuguese wine (vinho verde). It was collected between May
2004 and February 2007 (Cortez, Cerdeira, Almeida, Matos, & Reis, 2009). to study three
regression techniques, support vector machine, multiple regression and neural networks.
Since the sensory data of both types of wine are based on a completely different taste, the
authors decided to split the dataset in two, a red dataset and a white dataset. The
experiments below are based solely on the white dataset, since the goal is not to compare
the results of both types of wine, but rather to compare different algorithms. For the white
dataset, 4898 instances where collected with 12 attributes each.
2.1. Attributes
There are eleven physicochemical properties recorded: fixed acidity, volatile acidity, citric
acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates
and alcohol.
Table 1 : Attribute characteristics white wine dataset
Acidity of the wine protects the wine from bacteria during the fermentation process. A
distinction should be made between the amount of acidity and the strength of the acidity.
The amount is measured in g/l, whereas the strength is measured in pH. Most wines show
pH-values between 2,9 and 3,9. The higher the wine’s acidity, the lower the pH value.
Three main acids can be found in wine grapes: tartaric, malic and citric acid. These are fixed
acids that contribute to the quality of the wine, the distinct taste-shaping during winemaking
and the aging of the wine. Tartaric acid is important for maintaining the wine’s chemical
stability and color. Its concentration in the wine grape varies with the soil and grape type.
Malic acid is not measured in the dataset. Next to that, citric acid is also present, but in much
smaller concentrations (1/20 of tartaric acid). It can be found in many citrus fruits and offsets
a strong citric taste. Extra citric acid can be added but with caution. Namely, certain bacteria
Min Max Mean StdDev
Fixed acidity ( g(tartaric acid)/l ) 3,80 14,20 6,86 0,84
Volatile acidity ( g(acetic acid)/l ) 0,08 1,10 0,28 0,10
Citric acid (g/l) 0,00 1,66 0,33 0,12
Residual sugar (g/l) 0,60 65,80 6,39 5,07
Chlorides ( g(sodium chloride)/l ) 0,01 0,35 0,05 0,02
Free sulfur dioxide (mg/l) 2,00 289,00 35,31 17,01
Total sulfur dioxide (mg/l) 9,00 440,00 138,36 42,50
Density (g/ml) 0,99 1,04 0,99 0,00
pH 2,72 3,82 3,19 0,15
Sulphates( g(potassium sulphate)/l ) 0,22 1,08 0,49 0,11
Alcohol (vol.%) 8,00 14,20 10,51 1,23
8. 4
3. Algorithms
Three algorithms where chosen based on their interesting aspects. First of all, ZeroR is
included as a baseline algorithm compared to which any smart algorithm should be
performing better in order to be useful. The J48 is included as a tree-building algorithm
because trees are easy to understand and to model. A disadvantage of trees is that they are
prone to overfitting. To improve the J48 algorithm, another algorithm called AdaBoostM1 is
included which is kind of a meta-algorithm that has to be used with another algorithm. Here
it is combined with J48 to test its improvements.
3.1. ZeroR
Commonly referred to as the baseline algorithm, ZeroR is the simplest classification method.
It is based on a frequency table in which it looks at the majority class and predicts this class
all the time. It thus simply relies on the class attribute and ignores all other attributes. ZeroR
has no predictability power but is useful to benchmark with other classification methods.
3.2. J48 (C4.5)
J48 is an open-source java implementation of the C4.5 algorithm. C4.5 is a widely popular
statistical classifier that generates trees using the concept of information gain. It is an
improvement over the ID3 algorithm, earlier developed by Ross Quinlan (J Ross Quinlan,
1992).
Basically, it generates a decision tree in a top-down manner. Using a training set, at each stage
it will use a greedy approach to look for the attribute that best splits the set into subsets. Let
T be the set of instances associated with a stage. To test this, each attribute is evaluated
separately on T using a metric called information gain (more correctly, a gain ratio is used,
which is the attribute’s information gain divided by its entropy). This in turn is derived from
entropy. The attribute providing the highest information gain, is selected as node in the tree.
For each subset, this process is repeated until the subset contains only samples from the same
class or until the minimum number of leaf objects is reached. A simplified pseudo-code of the
tree-construction algorithm from C4.5 is included below (Salvatore, 2000).
11. 7
AdaBoost boosts the performance of the original learning algorithm on the training data. To
do this, it iteratively uses a weighted training set (Freund & Schapire, 1996; J R Quinlan, 2006;
Schapire, 2013). Each instance weight wj reflects the importance of the respective instance
and starts at value 1 for all instances. The first hypothesis is made, based on this set. An error
is calculated as the sum of all weights from the misclassified instances. All correctly classified
instances are then given lower weights. A new hypothesis is generated using these new
weights. This process is repeated until there are T hypotheses, with T being an input to
AdaBoost. All hypotheses can be seen as committee members with the weights of their votes
z being a function of their accuracy on the training set. The final hypothesis is based on the
majority of the weighted votes. The algorithm actually sums the votes taking into account
their weights. Pseudo-code for this algorithm can be seen below.
For clarification purposes the normalization steps and the steps to calculate the boosted
classifier are discarded. Also, the edge cases where the error equals 0 or exceeds 0.5 are not
mentioned in the code. However, they require different steps. When there is no error,
obviously, no extra trials should be performed and T should be set at t. The error rate of the
boosted algorithm should approach 0 as T increases. This is only the case when the error rate
of the trials is below 0.5. Therefore, when error > 0.5, the trials should be ended and T should
be replaced by t-1. AdaBoost thus makes the assumption that the simple classifiers perform
better than pure guessing. This is noted as the weak learning condition.
-------------------------------------------------------------------------------------------------
function AdaBoost(examples, L, T) returns a weighted-majority hypothesis
inputs: examples, set of N labeled examples (x1, y1),…,(xN,yN)
L, a ‘simple’ learning algorithm
T, the number of hypotheses (trials / iterations) in the ensemble
local variables: w, a vector of N example weights
h, a vector of T hypotheses
z, a vector of T hypothesis weights
for n = 1 to N do
w[n] ß 1/N
for t = 1 to T do
h[k] ß L(examples, w)
error ß 0
for n = 1 to N do
if h[t](xn) ≠ yn then error ß error + w[n]
for n = 1 to N do
if h[t](xn) = yn then w[n] ß w[n] . error / (1 – error)
w ß NORMALIZE(w) (so that the sum of
z[t] ß log(1 – error)/error
return WEIGHTED-MAJORITY(h,z)
-------------------------------------------------------------------------------------------------
15. 11
The accuracy improvements are tested using a paired T-test. The paired T-test assumes that
the results from both datasets are independent and normally distributed. These assumptions
are fulfilled since the results of the dataset without outliers is independent of the results
before deleting them. Also by setting the Weka experimenter to perform 30 iterations, the
distribution of results can be approximated by a normal distribution. This is done for all
experiments in this study. The algorithms during this experiment, and the experiments during
further preprocessing steps are applied using default values in Weka. The models are trained
and evaluated with 10-fold stratified cross-validation. Compared to normal cross-validation,
stratified cross-validation has the benefit that every piece is a good representation of the
dataset. The folds are selected so that the mean response value is approximately equal in all
the folds. This has been proven to reduce the variance of the estimated accuracy. The results
are given in the table below.
Figure 6 : Performance difference deleting outliers
No significant improvements are found for both J48 and AdaBoost, with a 95% confidence
level (two tailed). Except for the baseline algorithm ZeroR, deleting outliers has no visible
effect on the accuracy and its standard deviation of the different algorithms.
5.1.3. Problem: Imbalanced dataset
When the separate classes are not equally represented, the dataset is imbalanced. An
imbalanced dataset can lead to overfitting and underperforming algorithms. Our dataset is
severely unbalanced with the amount of instances ranging from 5 in the minority class up to
2188 in the majority class. Extreme quality scores are rare compared to the mediocre classes.
By resampling, this problem can be solved. Resampling can either be done by deleting
instances from the over-represented class (under-sampling) or by adding copies of instances
from the under-represented class or synthetically creating such instances (over-sampling).
Generally, it might be better to over-sample unless you have plenty of data. There are some
disadvantages to over-sampling however. It increases the dataset, leading to increased
processing time needed to build a model. Also, since the class is not taken into account it may
cause overgeneralization. When put to extremes, oversampling can lead to overfitting
(Drummond & Holte, n.d.; Rahman & Davis, 2013).
Another option would be to keep the imbalanced dataset but to wrap your learning
algorithms in a penalization scheme, which adds an extra cost on misclassifying a minority
class. This however, means that the algorithms that are to be compared, are changed, making
comparisons less intuitive. Therefore, sampling is preferred.
16. 12
In Weka, sampling can be achieved by applying the supervised SMOTE filter (Nitesh V Chawla,
2005). This resamples the dataset by applying the Synthetic Minority Oversampling
Technique. It does not simply copy instances from the minority class. Rather, it iteratively
looks at a number of neighbors and creates an instance with randomly distorted attributes,
within the boundaries of these neighbors.
We changed the percentage-parameter to correspond to the necessary extra instances to be
created. Since the over-sampling takes on extreme percentages, we expect a certain bias in
the results due to overgeneralization. However, this does not impact the differences between
J48 and AdaBoost. Remarks on this method can be found in the limitations paragraph. After
balancing, our training set consists of 15311 instances, which means that 10445 instances
were created. WEKA pushes these extra instances on the bottom of the dataset. If you want
to use 10-fold cross-validation, this might lead to folds with a lot of instances from the same
class, and thus eventually lead to overfitting. To avoid this issue, we apply an extra filter that
randomizes the instances over the dataset.
Class Number of instances % to add Amount added
1 14 15528 2173
2 161 1259 2026
3 1443 51,6 744
4 2188 0 0
5 880 148,6 1307
6 175 1150 2012
7 5 43660 2183
Table 3 : Balancing dataset using SMOTE filter
Figure 7 : Effect of balancing dataset
18. 14
Figure 11 : Weighted average F-measure
Results from the experiments are shown above. With a two-tailed confidence level of 95%,
the performance of the J48 and AdaBoostM1(J48) algorithms improved significantly (v) by
balancing the dataset. This was found by running the Weka experimenter. Only the baseline
algorithm deteriorated significantly (*). Also the standard deviations lowered, meaning more
stable results. This provides a broad conclusion that balancing truly improves the
performance of the mentioned algorithms. Here the default values of the algorithms were
used. The parameters of the different algorithms will be adjusted in a later stage when we
are comparing them to one another.
5.1.4. Normalization
When big differences among the variable ranges can be seen, normalization can be beneficial.
For this dataset, the scales are very different among the attributes. The values, measured on
different scales, are adjusted to fit a common scale. It is important that normalization is
applied after checking for outliers. Outliers have already been processed before. The default
values for the scale (1) and translation (0,0) are used, meaning that everything is scaled to the
interval [0,1]. The class values are ignored since they are nominal values. At 95% confidence
level, there is no significant difference when looking at the accuracy of the three algorithms.
Here, normalization has no effect. Therefore, we continue with the dataset without
normalizing the numeric attributes.
5.1.5. Feature Selection
In the real world, more attributes can lead to higher discrimination power. However, most
machine learning algorithms have difficulties handling irrelevant or redundant information.
Sometimes attributes can be completely irrelevant for the class. These attributes still need
processing power and can even bias the result. Therefore, feature subset selection is a great
way to improve classification results, lower processing time and raise readability of the model
(Guyon, 2003). This is done by identifying and neglecting or removing the irrelevant
information. Feature selection is successful if the number of dimensions can be reduced
without lowering (or by improving) the accuracy of the induction algorithm.
Figure 12 : Effect of normalising numeric attributes
20. 16
The search direction can have a serious effect on the attributes selected. One can start by
selecting all attributes and iteratively deleting attributes from that selection until some
termination point. This method is called backward elimination. On the other hand, the
forward selection method starts with zero attributes and gradually builds up a selection until
some termination point. Combining these two methods leads to bi-directional search, where
you start with a subset of attributes and you either delete or add attributes depending on
some characteristic such as merit.
By setting a termination point, you avoid processing over the entire search space. Typically,
a termination point could be a fixed number of attributes to select or a merit threshold.
(i) Feature subset selection in Weka
Based on the scatter plots, we suspect some attributes to be irrelevant based on their
seemingly high correlation with each other. An example can be seen in the figure below,
which shows the relation of ‘residual sugar’ to ‘density’. Based on the theory above, one can
see that the higher the amount of residual sugar, the higher the density will be. This relation,
combined with the low correlation of SO2 with ‘quality’, will probably lead to the exclusion of
the one of the two attributes.
Figure 14 : Relation of ‘residual sugar’ to ‘density’ shows high correlation
Weka allows many methods to apply feature subset selection, either permanently or
temporarily during learning algorithm execution. Since processing power is limited, all
irrelevant attributes were first discarded before using the adapted dataset to train the
models. This method is very fast and leads to similar performance as the slower wrapper
method. Although there is a filter in Weka called AttributeSelection that combines an
evaluation strategy with a search method to automatically select the correct attributes, it
does not apply cross-validation. Therefore, an attribute selection is first processed, and its
results manually applied afterwards. The evaluation strategy used is CfsSubsetEval (CFS =
Correlation based Feature Selection), which looks at the correlation matrix of all attributes.
This leads to a metric called “symmetric uncertainty” (Hall, 1999). It considers the predictive
value of each attribute, together with the degree of inter-redundancy.
21. 17
Attributes with high correlation with the class attribute and low inter-correlations are
preferred. CFS assumes that that the attributes are independent and can fail to select the
relevant attributes when they depend strongly on other attributes given a class. The
components of CFS are listed in the figure below (Hall, 1999).
Figure 15 : Components of CFS
Multiple search methods were used to compare the results. All lead to the same result.
Ultimately exhaustive search was used because it was more than feasible with only twelve
attributes. With 10-fold cross validation, it shows a clear exclusion of ‘residual sugar’.
Experimenting with the dataset before and after deleting the attribute ‘residual sugar’ shows
that by deleting it, the performance of the different algorithms does not deteriorate
significantly (95% significance level). More importantly however, the CPU time needed to
build the model does decrease significantly from 0.96 to 0.87 for J48 and from 8.84 to 8.35
for AdaBoost (cf figures below). This shows that by removing the attribute one can reduce
processing time without lowering the performance of the algorithms. It is therefore beneficial
to remove the attribute, especially when processing power is limited.
Figure 16 : Results of Attribute Selection with 10 fold cross-validation
23. 19
number of objects in a leaf, the size of the tree can be limited. This allows for an easier to
understand model and can reduce overfitting. However, it is expected that accuracy will
decrease with growing leafs. Therefore, a tradeoff needs to be made between tree size and
accuracy. The graph below shows the impact of adjusting the minNumObj parameter. The
experimenter was used with 30 iterations.
Figure 19 : Adjusting minNumObj parameter for J48
By default, minNumObj is set at 2. This is increased up to 500. The graph shows a gradual
decline in both tree size and accuracy as a result. The standard deviation is not included in
this graph but gradually increases for both metrics. The tree size shrinks much faster than the
accuracy. By changing the parameter to 5, the tree size is divided in two while the accuracy
decreases only from 69,8 to 68,12. Going further to 10 as a minimum results in yet again half
the tree size, with a slight decrease in accuracy to 66,5. From that moment on, the accuracy
drops slightly faster. Therefore, minNumObj is set at 10. Remarks to this decision can be found
in the limitations section.
By adjusting the confidenceFactor, accuracy does not change significantly, values from 0,05
up to 0,5 have been tested, with 0,25 being the default value. The default value is therefore
used.
5.3.2. Optimizing AdaBoost
Since AdaBoost(J48) implements J48, it is important to use the same parameters for J48 here
as those used for the standalone J48 algorithm. AdaBoost itself also allows some adjusting.
Namely, the number of iterations can be adjusted (cf. T in the theoretical section on
AdaBoost). Obviously, the accuracy increases as more iterations make up the committee. The
graph below shows the accuracies corresponding to different numbers of iterations. The
accuracy improvements gradually decline. By setting an improvement threshold at 1%, the
number of iterations is set at 15, which leads to an accuracy of 78,28%.
25. 21
6. Results & Interpretation
The table below shows all relevant performance metrics of the different algorithms.
ZeroR J48 AdaBoost
Accuracy 12,93% 65,48% 78,33%
Kappa 0,00 0,60 0,75
Mean absolute error 0,24 0,12 0,07
Root mean squared error 0,35 0,26 0,22
Relative absolute error 100% 48% 27%
Root relative squared error 100% 75% 63%
TP rate 0,13 0,66 0,78
FP rate 0,13 0,06 0,04
Precision 0,02 0,65 0,78
Recall 0,13 0,66 0,78
F-measure 0,03 0,65 0,78
ROC area 0,50 89,00 0,96
Table 4 : Experimental results
From the table it is clear that AdaBoost outperforms the other algorithms in every aspect,
with J48 being the runner-up. Although our dataset is balanced, due to the random split, small
differences exist in class sizes. In the training set, the class with a score of 9 is slightly bigger
than the other classes. That is why ZeroR will impose as rule to always choose that class. This
leads to an accuracy of merely 12,9% on the test set. J48 performs much better, with an
accuracy of 65,48%. This equals an error reduction of 60%. Compared to J48, AdaBoost again
reduces the error of J48 with 37%. Since the classes are still quite balanced, the weighted
average of the F-measure approximates the accuracy. In the case of J48 and AdaBoost,
(weighted avg.) Precision and (weighted avg.) Recall also approximate the accuracy. ZeroR,
on the other hand, shows a much lower value for the weighted average of precision since in
all but one class, the TP and FP rates are zero.
Although the ROC area is especially useful for unbalanced dataset, here it confirms the other
metrics with the area for AdaBoost coming close to 1. This means it is an excellent prediction.
J48 with an ROC area just below 0,9 shows a good prediction. The confusion matrices can be
found in appendix B. They show that for the class with score 9, very little errors are made in
J48 and AdaBoost. For J48, no instances from this class are misclassified as having lower
scores and AdaBoost misclassifies only 3 instances like this. Also, only instances with scores
from 6 to 8 are occasionally misclassified as being in the last class. This shows that the models
perform better on good wines.