This document summarizes a student assignment to predict red wine quality using classification models. It describes using the wine quality dataset from UCI, preprocessing the data, exploring it visually, and training KNN and decision tree classifiers to predict wine quality. Evaluation shows the decision tree model achieved slightly higher accuracy than KNN, particularly when standard scaling was applied during modeling.
The presentation consist of the following,
- What is graph DB ?
- Why choose Graph DB ?
- Types of Graph DB (Based on storage)
- Janus Graph Architecture
- Janus Graph Basic Terms
- Conceptual working of Gremlin Queries.
- Setup Janus Graph on local
- Some sample queries
- Schema and data modelling
- Automatic Schema Maker
- Index in Janus Graph
Multiple disease prediction using Machine Learning AlgorithmsIRJET Journal
This document discusses a proposed system for predicting multiple diseases using machine learning algorithms. It aims to predict diabetes, brain tumors, heart disease, and Alzheimer's disease using factors like age, sex, BMI, blood glucose levels, and other health parameters. Previous systems could only predict single diseases. The proposed system uses TensorFlow, Flask API, and machine learning techniques. It saves models using Python pickling and loads them using unpickling when needed. The system allows adding new disease prediction models. It analyzes full disease impacts by considering all contributing factors. This allows better prediction accuracy compared to existing single-disease models.
A tutorial on LDA that first builds on the intuition of the algorithm followed by a numerical example that is solved using MATLAB. This presentation is an audio-slide, which becomes self-explanatory if downloaded and viewed in slideshow mode.
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Neo4j
The document discusses how knowledge graphs and graph data science can provide more context and enable better predictions. It provides examples of using knowledge graphs for interactive browsing of patent and pathway data, cross-species ontology graph queries, identifying relevant COVID-19 genes using graph algorithms, and sub-phenotyping patient populations using graph embeddings. The key message is that knowledge graphs harness relationships to provide deep, dynamic context for analytics and machine learning.
Water Quality Prediction using Machine LearningIRJET Journal
The document discusses using machine learning techniques to predict water quality. It describes building a decision tree model to classify water samples as potable or non-potable based on characteristics like pH, conductivity, chemicals. The model was tested on a water quality dataset and achieved 59% accuracy in predictions. Creating more advanced models using additional methods and deep learning could improve effectiveness for selecting safe drinking water.
Three key techniques for ensemble methods are summarized:
1. Bagging grows decision trees from multiple bootstrapped samples and averages the results to form a single classifier. It creates base trees independently with no weighting.
2. Boosting creates base trees successively by fitting to the residuals of previous trees. It up-weights misclassified points and combines trees with learning rates, requiring tuning to prevent overfitting.
3. Gradient boosting is a generalized boosting technique that can use different loss functions like deviance, making it more robust than AdaBoost to outliers. It requires setting the number of trees and hyperparameters like learning rate.
The presentation consist of the following,
- What is graph DB ?
- Why choose Graph DB ?
- Types of Graph DB (Based on storage)
- Janus Graph Architecture
- Janus Graph Basic Terms
- Conceptual working of Gremlin Queries.
- Setup Janus Graph on local
- Some sample queries
- Schema and data modelling
- Automatic Schema Maker
- Index in Janus Graph
Multiple disease prediction using Machine Learning AlgorithmsIRJET Journal
This document discusses a proposed system for predicting multiple diseases using machine learning algorithms. It aims to predict diabetes, brain tumors, heart disease, and Alzheimer's disease using factors like age, sex, BMI, blood glucose levels, and other health parameters. Previous systems could only predict single diseases. The proposed system uses TensorFlow, Flask API, and machine learning techniques. It saves models using Python pickling and loads them using unpickling when needed. The system allows adding new disease prediction models. It analyzes full disease impacts by considering all contributing factors. This allows better prediction accuracy compared to existing single-disease models.
A tutorial on LDA that first builds on the intuition of the algorithm followed by a numerical example that is solved using MATLAB. This presentation is an audio-slide, which becomes self-explanatory if downloaded and viewed in slideshow mode.
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Neo4j
The document discusses how knowledge graphs and graph data science can provide more context and enable better predictions. It provides examples of using knowledge graphs for interactive browsing of patent and pathway data, cross-species ontology graph queries, identifying relevant COVID-19 genes using graph algorithms, and sub-phenotyping patient populations using graph embeddings. The key message is that knowledge graphs harness relationships to provide deep, dynamic context for analytics and machine learning.
Water Quality Prediction using Machine LearningIRJET Journal
The document discusses using machine learning techniques to predict water quality. It describes building a decision tree model to classify water samples as potable or non-potable based on characteristics like pH, conductivity, chemicals. The model was tested on a water quality dataset and achieved 59% accuracy in predictions. Creating more advanced models using additional methods and deep learning could improve effectiveness for selecting safe drinking water.
Three key techniques for ensemble methods are summarized:
1. Bagging grows decision trees from multiple bootstrapped samples and averages the results to form a single classifier. It creates base trees independently with no weighting.
2. Boosting creates base trees successively by fitting to the residuals of previous trees. It up-weights misclassified points and combines trees with learning rates, requiring tuning to prevent overfitting.
3. Gradient boosting is a generalized boosting technique that can use different loss functions like deviance, making it more robust than AdaBoost to outliers. It requires setting the number of trees and hyperparameters like learning rate.
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
Kelompok siswa melakukan percobaan dengan dua pegas yang berbeda jenis. Pegas yang memiliki periode getaran lebih pendek akan menghasilkan kecepatan maksimum yang lebih besar dan dapat melontarkan bola lebih tinggi.
- Pakistan has the highest incidence of breast cancer in Asia, with over 83,000 new cases reported annually. Breast cancer is the leading cause of cancer mortality among females in Pakistan.
- Early detection of breast cancer significantly improves survival rates, with a 100% 5-year survival rate for cancers detected early. However, Pakistan currently lacks a system for widespread breast cancer screening.
- Artificial intelligence can help by assisting oncologists in diagnosing breast cancer. A machine learning model trained on breast cancer data achieved over 96% accuracy in predicting malignant tumors, which could help detect cancers earlier.
1. Modul ini membahas operasi hitung bilangan bulat seperti penjumlahan, pengurangan, perkalian, dan pembagian beserta sifat-sifatnya.
2. Terdapat penjelasan tentang simbol dan rumus operasi hitung bilangan bulat.
3. Sifat-sifat operasi hitung yang dijelaskan antara lain komutatif, asosiatif, dan distributif.
This document proposes using machine learning algorithms to predict heart disease at early stages. It discusses problems with current diagnosis methods and the need for an automated system. The proposed system would use a dataset of 779 individuals and various machine learning algorithms to predict the likelihood of heart disease for new individuals. It describes preprocessing the data, training models like logistic regression, random forest, SVM and comparing their performance. The system architecture involves preprocessing, training models, testing them and predicting heart disease risk. Modules like SVM, decision trees, random forest and naive Bayes are explained. The document concludes by discussing implementation and outputs like algorithm accuracies for training and test sets.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
The document compares the performance of four common data mining algorithms (KNN, decision trees, EM, and k-means clustering) across three datasets. It describes the datasets, experimental procedure using 10-fold cross-validation, and provides the results of applying each algorithm to each dataset. The key finding is that the algorithm performance varies significantly depending on the characteristics of the particular dataset, with decision trees achieving the highest accuracy on the wine quality dataset while k-means performed best on the spam dataset. The conclusion is that the suitability of algorithms depends heavily on the domain and properties of the data.
This document describes an analysis of forest cover type data using three decision tree algorithms: Naive Bayes Tree, Reduced Error Pruning Tree, and J48 Tree. The goal is to determine which algorithm yields the highest classification accuracy and the optimal parameter settings and bin sizes for each algorithm. The data is explored and preprocessed, including binning real-valued features. Experiments are conducted to evaluate accuracy on training and test sets for different parameter settings and bin sizes. Results show bin sizes between 20-50 yield higher accuracy, and optimal training set parameters are not necessarily optimal for the test set.
The document benchmarks 20 machine learning models on two datasets to compare their accuracy and speed. On the smaller Car Evaluation dataset, bagged decision trees, random forests and boosted decision trees achieved over 99% accuracy, while neural networks, decision stumps and support vector machines exceeded 95% accuracy. On the larger Nursery dataset, similar models exceeded 99% accuracy, while other models like decision rules and k-nearest neighbors exceeded 95% accuracy. However, models varied significantly in speed depending on the hardware, with decision trees, mixture discriminant analysis and gradient boosting as the fastest on Car Evaluation, and mixture discriminant analysis, one rule and boosted decision trees as the fastest on Nursery. The findings imply the importance of regular benchmarking
The document provides an overview of different machine learning algorithms used to predict house sale prices in King County, Washington using a dataset of over 21,000 house sales. Linear regression, neural networks, random forest, support vector machines, and Gaussian mixture models were applied. Neural networks with 100 hidden neurons performed best with an R-squared of 0.9142 and RMSE of 0.0015. Random forest had an R-squared of 0.825. Support vector machines achieved 73% accuracy. Gaussian mixture modeling clustered homes into three groups and achieved 49% accuracy.
This document contains legal notices and disclaimers for an Intel presentation. It states that the presentation is for informational purposes only and that Intel makes no warranties. It also notes that performance depends on system configuration and that sample source code is released under an Intel license agreement. Finally, it provides basic copyright information.
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestStevenQu1
Research paper drafted during my 2 year internship with Oak Ridge National Laboratory illustrating the potential of artificial intelligence in cancer research.
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
With only three months' instruction, a five-person team uses Azure Machine Learning Studio to predict Moscow real estate prices based on property descriptors, macroeconomic indicators, and geospatial data.
Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha
The document summarizes experiments conducted by Team D-Hawks to predict real estate prices in Moscow using data from Kaggle competitions. It describes five different experiments that varied the features used, data cleaning techniques, and machine learning models. The winning experiment used parallel data cleaning paths and multiple boosted decision tree models to achieve the lowest root mean squared error. The team's work demonstrated that feature selection, additional data sources, and testing multiple approaches can improve price prediction accuracy.
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
The document discusses using cluster analysis techniques like K-Means and spectral clustering on NBA player statistics data. It begins by introducing machine learning concepts like supervised vs. unsupervised learning and definitions of clustering criteria. It then describes preprocessing the 27-dimensional player data into 2 dimensions using linear discriminant analysis (LDA) and principal component analysis (PCA) for visualization. K-Means clustering is applied to the LDA-reduced data, identifying distinct player groups. Spectral clustering will also be applied using PCA for comparison. The goal is to categorize players and determine the best athletes without prior basketball knowledge.
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
Kelompok siswa melakukan percobaan dengan dua pegas yang berbeda jenis. Pegas yang memiliki periode getaran lebih pendek akan menghasilkan kecepatan maksimum yang lebih besar dan dapat melontarkan bola lebih tinggi.
- Pakistan has the highest incidence of breast cancer in Asia, with over 83,000 new cases reported annually. Breast cancer is the leading cause of cancer mortality among females in Pakistan.
- Early detection of breast cancer significantly improves survival rates, with a 100% 5-year survival rate for cancers detected early. However, Pakistan currently lacks a system for widespread breast cancer screening.
- Artificial intelligence can help by assisting oncologists in diagnosing breast cancer. A machine learning model trained on breast cancer data achieved over 96% accuracy in predicting malignant tumors, which could help detect cancers earlier.
1. Modul ini membahas operasi hitung bilangan bulat seperti penjumlahan, pengurangan, perkalian, dan pembagian beserta sifat-sifatnya.
2. Terdapat penjelasan tentang simbol dan rumus operasi hitung bilangan bulat.
3. Sifat-sifat operasi hitung yang dijelaskan antara lain komutatif, asosiatif, dan distributif.
This document proposes using machine learning algorithms to predict heart disease at early stages. It discusses problems with current diagnosis methods and the need for an automated system. The proposed system would use a dataset of 779 individuals and various machine learning algorithms to predict the likelihood of heart disease for new individuals. It describes preprocessing the data, training models like logistic regression, random forest, SVM and comparing their performance. The system architecture involves preprocessing, training models, testing them and predicting heart disease risk. Modules like SVM, decision trees, random forest and naive Bayes are explained. The document concludes by discussing implementation and outputs like algorithm accuracies for training and test sets.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
The document compares the performance of four common data mining algorithms (KNN, decision trees, EM, and k-means clustering) across three datasets. It describes the datasets, experimental procedure using 10-fold cross-validation, and provides the results of applying each algorithm to each dataset. The key finding is that the algorithm performance varies significantly depending on the characteristics of the particular dataset, with decision trees achieving the highest accuracy on the wine quality dataset while k-means performed best on the spam dataset. The conclusion is that the suitability of algorithms depends heavily on the domain and properties of the data.
This document describes an analysis of forest cover type data using three decision tree algorithms: Naive Bayes Tree, Reduced Error Pruning Tree, and J48 Tree. The goal is to determine which algorithm yields the highest classification accuracy and the optimal parameter settings and bin sizes for each algorithm. The data is explored and preprocessed, including binning real-valued features. Experiments are conducted to evaluate accuracy on training and test sets for different parameter settings and bin sizes. Results show bin sizes between 20-50 yield higher accuracy, and optimal training set parameters are not necessarily optimal for the test set.
The document benchmarks 20 machine learning models on two datasets to compare their accuracy and speed. On the smaller Car Evaluation dataset, bagged decision trees, random forests and boosted decision trees achieved over 99% accuracy, while neural networks, decision stumps and support vector machines exceeded 95% accuracy. On the larger Nursery dataset, similar models exceeded 99% accuracy, while other models like decision rules and k-nearest neighbors exceeded 95% accuracy. However, models varied significantly in speed depending on the hardware, with decision trees, mixture discriminant analysis and gradient boosting as the fastest on Car Evaluation, and mixture discriminant analysis, one rule and boosted decision trees as the fastest on Nursery. The findings imply the importance of regular benchmarking
The document provides an overview of different machine learning algorithms used to predict house sale prices in King County, Washington using a dataset of over 21,000 house sales. Linear regression, neural networks, random forest, support vector machines, and Gaussian mixture models were applied. Neural networks with 100 hidden neurons performed best with an R-squared of 0.9142 and RMSE of 0.0015. Random forest had an R-squared of 0.825. Support vector machines achieved 73% accuracy. Gaussian mixture modeling clustered homes into three groups and achieved 49% accuracy.
This document contains legal notices and disclaimers for an Intel presentation. It states that the presentation is for informational purposes only and that Intel makes no warranties. It also notes that performance depends on system configuration and that sample source code is released under an Intel license agreement. Finally, it provides basic copyright information.
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestStevenQu1
Research paper drafted during my 2 year internship with Oak Ridge National Laboratory illustrating the potential of artificial intelligence in cancer research.
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
With only three months' instruction, a five-person team uses Azure Machine Learning Studio to predict Moscow real estate prices based on property descriptors, macroeconomic indicators, and geospatial data.
Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha
The document summarizes experiments conducted by Team D-Hawks to predict real estate prices in Moscow using data from Kaggle competitions. It describes five different experiments that varied the features used, data cleaning techniques, and machine learning models. The winning experiment used parallel data cleaning paths and multiple boosted decision tree models to achieve the lowest root mean squared error. The team's work demonstrated that feature selection, additional data sources, and testing multiple approaches can improve price prediction accuracy.
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
The document discusses using cluster analysis techniques like K-Means and spectral clustering on NBA player statistics data. It begins by introducing machine learning concepts like supervised vs. unsupervised learning and definitions of clustering criteria. It then describes preprocessing the 27-dimensional player data into 2 dimensions using linear discriminant analysis (LDA) and principal component analysis (PCA) for visualization. K-Means clustering is applied to the LDA-reduced data, identifying distinct player groups. Spectral clustering will also be applied using PCA for comparison. The goal is to categorize players and determine the best athletes without prior basketball knowledge.
This document presents an analysis of a dataset containing 200,000 mortgage loan applications to predict the interest rate spread. Key findings include:
- The most important predictive features were loan amount, loan type, property type, preapproval status, loan purpose, median family income, applicant income, and minority population percentage.
- A boosted decision tree regression model achieved the highest prediction accuracy with an R-squared of 0.77 on test data, outperforming linear regression and random forest models.
- The analysis included data exploration of relationships between numerical features, feature selection, model training, tuning, and validation.
Predicting rainfall using ensemble of ensemblesVarad Meru
The Paper was done in a group of three for the class project of CS 273: Introduction to Machine Learning at UC Irvine. The group members were Prolok Sundaresan, Varad Meru, and Prateek Jain.
Regression is an approach for modeling the relationship between data X and the dependent variable y. In this report, we present our experiments with multiple approaches, ranging from Ensemble of Learning to Deep Learning Networks on the weather modeling data to predict the rainfall. The competition was held on the online data science competition portal ‘Kaggle’. The results for weighted ensemble of learners gave us a top-10 ranking, with the testing root-mean-squared error being 0.5878.
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
This document summarizes a student's machine learning term project on using an artificial neural network (ANN) to estimate house prices from the UCI Housing dataset. The student:
1) Describes the UCI Housing dataset and ANN approach.
2) Explains how to use Matlab's Neural Network Toolbox to set up, train, and evaluate an ANN model on the housing data.
3) Performs experiments comparing different data splits, training algorithms, and hidden layer sizes. The best results came from a 80-10-10 training-validation-test split.
The document discusses using various statistical techniques to refine housing data and improve predictions of house values. It applies Box-Cox transformation to make variables more linear, performs linear regression on the transformed data, and checks for multicollinearity using VIF. It then uses principal component analysis (PCA) to reduce dimensions and variables. This improves results but still overestimates cheaper houses. Partial least squares regression is then used and further reduces errors, though some problems remain. Overall, the document aims to reduce overfitting, multicollinearity, and nonlinearities in the data to build a better predictive model for house values.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Practical Data Science: Data Modelling and Presentation
1. COSC2670 Practical Data Science Assignment 2
Predicting The quality of Red Wine
Names: Junaid Ahmed Syed &Harini Mylanahally Sannaveeranna
Student ID: s3731300& s3755660
May 29, 2019
3. Chapter 1
Abstract
The main objective of this assignment is to focus on data modelling, Which is a core step in the data science
process. The dataset used here is ’Red Wine Quality’ with a target feature being ’wine quality’. This dataset
can be viewed as a classification task, and the chosen models within these particular tasks are KNearest-
Neighbor, DecisionTree. The rest of this report is organized as follows. Section 2 gives an introduction
and describes the data sets and their attributes. Part 3 is a Methodology that covers data pre-processing,
Data Exploration and Data Modelling. In Section 4, we explore the results got in Section 3. In section 5, we
discuss the effects we got in Section 4.The last part is to present a summary.
2
4. Chapter 2
Introduction
2.1 DataSet Information
This Dataset is sourced from the UCI Machine Learning Repository at
https://archive.ics.uci.edu/ml/datasets/Wine+Quality[1]..The UCI Machine Learning Repository has 2
datasets, but only winequality-red.csv is useful for this Assignment. This data set has 1599 observations
and 12 variables.
2.2 Target Feature
The classification goal is to predict wheater the quality of the wine is good or bad.
Wine[quality] =
{
bad if value = 0
good if value = 1
2.3 Descriptive Features
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
3
5. Chapter 3
Methodology
3.1 Data Preprocessing
In 3.1, we checked that the feature types matched the description as outlined in the documentation by using
dtypes().
3.1.1 Missing values
Upon verifying feature types, Missing values got checked with isnull(). sum() but there are no missing
values for this dataset on the surface level. ### pd.cut() and LableEncoder() We can see that the target
feature - quality has values from 2 to 8, with the help of pd.cut() we can Bin the values into discrete intervals.
The mean of the quality is 5.6, So we have set an interval of 2 to 5.6 and address it as bad quality and 5.6 to
8 as good quality. LabelEncoder() was then used to encode labels with values of 0 and 1. ### Outliers We
tend to keep outliers for our predictive analysis as outliers can be a great source of information.
3.2 Data Exploration
3.2.1 Univariate visualisation
BoxHistogramPlot(x) is a function defined for numerical features, for the sake of simplicity. For a given
binary input column, BoxHistogramPlot(x) plots a histogram. A histogram is are useful to visualize the
shape of the underlying distribution, whereas A box plot tells the range of the attribute and helps detect
any outliers. The following chunk codes show how these functions were defined using the numpy library
and the matplotlib library.
From the plots, we can see that the majority of histograms of columns are unimodal, and among these
graphs fixed acidity, density, pH, sulphates and residual sugar were seemed normally distributed whereas
free sulphur dioxide, total sulphur dioxide, chlorides are left-skewed. We can also notice volatile acid as a
bimodal attribute because most of the values lie 0.4 to 0.5 and 0.6 to 0.7 and citric acid column as Plateau
since there are more than 3 modes.
3.2.2 Mulitvariate Visualisation
• Histogram of numeric features segregated by Wine Quality
From the histograms, we can see that if the volatile acidity of the Wine ranges above 0.6, the quality of
the Wine is good. Likewise, higher citric acid levels are not so good for Wine. Alcohol in excess quantity,
i.e., above 10% may make the quality of wine bad.
• Pairwise scatter plot between numeric features by Wine quality
4
6. A function named scatterplotByCategory(c, x, y, D) is designed to draw a scatterplot between two nu-
meric attributes y and x labelled by a categorical attribute c given an input data D. In the case, D is the
dataset itself and c is the Wine quality.
We have plotted scatterplot for volatile acidity, citric acid and alcohol segregated by the target fea-
ture.But the graphs show no clear correlation between any two numeric variables. Therefore, numeric
features are likely to be independent which each other
5
7. Chapter 4
Data Modelling
4.0.1 Train and Test data split
In order to perform predictive analysis, The dataset got divided into two parts. One part has all the de-
scriptive features, and another part has a target feature itself. These were named as X and y respectively.
4.0.2 Knn Classification Training
Data Slicing
Now we need to split the data randomly into training and test set in the ratio of 50:50. We have used
train_test_split () to perform that which is provided by Scikit-learn. Later on, We will fit/train a classifier on
the training set and make predictions on the test set.standardscaler() which helps improve the performance
of the model and reducing the values/models from varying widely.
KnnClassifier()
There are 2 important parameters for KnnClassifier() one of them is n_neighbors while the other on being
distance metric.The default metric is Minkowski distance and we have used the default one.
- Predicting optimal number of clusters(K value):
The most common way of finding k value chosen as the square-root of the number of observations in test s
Then we define the Knn classifier function with the optimal value of K and fit the train data in the
model.Then we use predict() to test the results.Lastly we evalute the model using confusion matrix and
classification report.We repeat this process 2 more times with a train and test ratios of 60:40 and 80:20
respectively.We shall discuss the results in next chapter.
4.0.3 Decision Tree Training
We used a similar approach to do decision tree classification as we did in the Knn Classification. However,
parameters are different for both of them. An advantage of using Decision Tree over Knn is minimal effort
was required for data preparation. i.e., No scaling of feature variables is needed.
DecisionTreeClassifier()
There are important features of DecisionTreeClassifier() are - criterion: It is a function which is used to
measure the quality of a split.We have used the default one which is gini
index.
6
8. - max_depth:
It is an integer value which denotes the maximum depth of the tree. When not specified, it will take
default value as None.
-min_samples_leaf:
It is Used to restrict the decision tree by specifing minimum number of samples required to be at a
node.
After defining the parameters we have fitted the train, predict and finally evaluate the models using
confusion matrix and classification report with a train and test ratios of 50;50,60:40 and 80:20.
Plotting decision tree
We have plotted a decision tree to see how it looks internally.This plot uses criterion as the Gini index &
information gain. The value row in each node tells us how many of the observations that were sorted into
that node fall into each category.As expected the maximum depth of the decision tree is 4 and also we have
got 16 leaf nodes because we haven’t specifed any value for it.
7
9. Chapter 5
Results
The results which of confusion matrix and classification report for both classification algorithm is as follows:
- A Table For Confusion matrix for KnnClassifer with a train and test ratios of 50;50,60:40 and 80:20.
confusion matrix 50:50 60:40 80:20
True negative 679 534 267
False positive 14 16 10
False negative 80 67 33
True postive 27 23 10
• A Table For Confusion matrix for Decision tree with a train and test ratios of 50;50,60:40 and 80:20.
confusion matrix 50:50 60:40 80:20
True negative 651 538 270
False positive 42 12 7
False negative 61 70 27
True postive 46 20 16
• A Table For Accuracy percentage for both KNN and decision Tree with a train and test ratios of
50;50,60:40 and 80:20.
precision KNN Decision
50:50 88.25 89.37
60:40 87.03 89.37
80:20 86.56 86.56
From the tables, we can say that both models Knn And Decision tree seem to have similar results of
accuracy. However, if we try not applying Standard scaler functions to train the model and use the same
process, we get around 7% less low result in precision. So, for this particular dataset, we assume decision
tree classification is better than KNN classification.
8
10. Chapter 6
Disscusion
• The functions used for visualizations is taken from MATH2319[2]
• For finding optimal k-value, we have come across so many functions over the internet, All of them
gave us similar results, but we have chosen the function from Website[3]. If the result is a odd number
the k value is taken as that number, other case if the result is even, it is incremented by 1.
• To find max_depth value, we taken a range from 2 to 8 and started fitting the models. Out of all the
numbers, we got better precision for the value of 4.
9
11. Chapter 7
Conclusion
In this assignment, We have converted the cardinality of column ’quality’ into binary which is of a integer
datatype. From the visualizations, we came to know that all the variables were potentially useful features
in predicting wine quality. Finally, after fitting binary classification and model evaluation, We founded out
that decision classification is better for this dataset.
10
12. Bibliography
[1] P. Cortez S. Moro and P. Rita. UCI Machine Learning Repository: Wine Data Set.
[2] Math2319,machine learning course,rmit.
[3] Sklearn. URL: http://www.simplilean.com.
11