This document discusses classification and clustering techniques using the Weka data mining tool. It begins with an introduction to Weka and its capabilities for classification, clustering, and other data mining functions. It then provides an example of using Weka's J48 decision tree algorithm to classify iris flower samples based on sepal and petal attributes. Finally, it demonstrates k-means clustering on customer purchase data from a BMW dealership to group customers into five clusters based on their buying behaviors.
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...Sri Ambati
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/-qfEOwm5Th4.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
In this talk, we discuss how we implemented H2O and LIME to predict and explain employee turnover on the IBM Watson HR Employee Attrition dataset. We use H2O’s new automated machine learning algorithm to improve on the accuracy of IBM Watson. We use LIME to produce feature importance and ultimately explain the black-box model produced by H2O.
Matt Dancho is the founder of Business Science (www.business-science.io), a consulting firm that assists organizations in applying data science to business applications. He is the creator of R packages tidyquant and timetk and has been working with data science for business and financial analysis since 2011. Matt holds master’s degrees in business and engineering, and has extensive experience in business intelligence, data mining, time series analysis, statistics and machine learning. Connect with Matt on twitter (https://twitter.com/mdancho84) and LinkedIn (https://www.linkedin.com/in/mattdancho/).
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)Yi-Hsuan Yang
Slides Hao-Wen Dong and I presented at the ISMIR 2019 tutorial on "Generating Music with GANs—An Overview and Case Studies". More info: https://salu133445.github.io/ismir2019tutorial/
Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline.
Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month
To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019
The application allows students to register and then search the data based on different criteria. Also it has the benefit of having a centralized database and up to date information. A user can easily obtain information about other registered users.
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...Sri Ambati
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/-qfEOwm5Th4.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
In this talk, we discuss how we implemented H2O and LIME to predict and explain employee turnover on the IBM Watson HR Employee Attrition dataset. We use H2O’s new automated machine learning algorithm to improve on the accuracy of IBM Watson. We use LIME to produce feature importance and ultimately explain the black-box model produced by H2O.
Matt Dancho is the founder of Business Science (www.business-science.io), a consulting firm that assists organizations in applying data science to business applications. He is the creator of R packages tidyquant and timetk and has been working with data science for business and financial analysis since 2011. Matt holds master’s degrees in business and engineering, and has extensive experience in business intelligence, data mining, time series analysis, statistics and machine learning. Connect with Matt on twitter (https://twitter.com/mdancho84) and LinkedIn (https://www.linkedin.com/in/mattdancho/).
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)Yi-Hsuan Yang
Slides Hao-Wen Dong and I presented at the ISMIR 2019 tutorial on "Generating Music with GANs—An Overview and Case Studies". More info: https://salu133445.github.io/ismir2019tutorial/
Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline.
Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month
To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019
The application allows students to register and then search the data based on different criteria. Also it has the benefit of having a centralized database and up to date information. A user can easily obtain information about other registered users.
This Edureka Machine Learning Algorithms tutorial will help you understand all the basics of machine learning and different kind of algorithms along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1. What is an Algorithm?
2. What is Machine Learning?
3. How is a problem solved using Machine Learning?
4. Types of Machine Learning
5. Machine Learning Algorithms
Customer churn classification using machine learning techniquesSindhujanDhayalan
Advanced data mining project on classifying customer churn by
using machine learning algorithms such as random forest,
C5.0, Decision tree, KNN, ANN, and SVM. CRISP-DM approach was followed for developing the project. Accuracy rate, Error rate, Precision, Recall, F1 and ROC curve was generated using R programming and the efficient model was found comparing these values.
key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.
Customer Churn Prediction Using Machine Learning Techniques: the case of Lion...IIJSRJournal
The growth of an insurance company is measured by the number of policies purchased by customers. To keep the company growing and having more customers, the customer churn prediction model is crucial to maintain its competitiveness. Even if the company has good service delivery, it is important to identify the customer’s behavior and be able to predict the future churners. The main contribution to our work is the development of a predictive model that can proactively predict the customer who will leave the insurance company. The model developed in this study uses machine learning techniques on lion insurance data. Another main contribution of this study is the labeling of the data using an unsupervised algorithm on 12007 rows with 9 features from which 2 clusters were generated using the K-means++ algorithm. As the cluster results found are imbalanced, the synthetic minority oversampling technique was applied to the training dataset. The Deep Neural Network algorithm turns out to be a very effective model for predicting customer churn, reaching an accuracy of 98.81%. The two years of customer data were obtained from lion insurance and used to train test, and evaluate the model. The Randomized optimization technique was selected for each algorithm. However, the best results were obtained by a deep neural network with a structure of (9-55-55-55-55-55-1). This algorithm was selected for classification in this churn prediction study.
A presentation detailing a Library Management System (LMS) Project for a Medical Research Council. The function of the Library is to organize and account for all the materials (Books, Journals, Magazines, Publications and Thesis) in the Library.
The system makes use of a Bar coding system to identify materials; used when checking in items.
Martins Jr.
Amazon SageMaker is a fully-managed platform that lets developers and data scientists build and scale machine learning solutions. First, we'll show you how SageMaker Ground Truth helps you label large training datasets. Then, using Jupyter notebooks, we'll show you how to build, train and deploy models using built-in algorithms and frameworks (TensorFlow, Apache MXNet, etc). Finally, we'll show you how to use 3rd-party models from the AWS marketplace.
This Edureka Machine Learning Algorithms tutorial will help you understand all the basics of machine learning and different kind of algorithms along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1. What is an Algorithm?
2. What is Machine Learning?
3. How is a problem solved using Machine Learning?
4. Types of Machine Learning
5. Machine Learning Algorithms
Customer churn classification using machine learning techniquesSindhujanDhayalan
Advanced data mining project on classifying customer churn by
using machine learning algorithms such as random forest,
C5.0, Decision tree, KNN, ANN, and SVM. CRISP-DM approach was followed for developing the project. Accuracy rate, Error rate, Precision, Recall, F1 and ROC curve was generated using R programming and the efficient model was found comparing these values.
key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.
Customer Churn Prediction Using Machine Learning Techniques: the case of Lion...IIJSRJournal
The growth of an insurance company is measured by the number of policies purchased by customers. To keep the company growing and having more customers, the customer churn prediction model is crucial to maintain its competitiveness. Even if the company has good service delivery, it is important to identify the customer’s behavior and be able to predict the future churners. The main contribution to our work is the development of a predictive model that can proactively predict the customer who will leave the insurance company. The model developed in this study uses machine learning techniques on lion insurance data. Another main contribution of this study is the labeling of the data using an unsupervised algorithm on 12007 rows with 9 features from which 2 clusters were generated using the K-means++ algorithm. As the cluster results found are imbalanced, the synthetic minority oversampling technique was applied to the training dataset. The Deep Neural Network algorithm turns out to be a very effective model for predicting customer churn, reaching an accuracy of 98.81%. The two years of customer data were obtained from lion insurance and used to train test, and evaluate the model. The Randomized optimization technique was selected for each algorithm. However, the best results were obtained by a deep neural network with a structure of (9-55-55-55-55-55-1). This algorithm was selected for classification in this churn prediction study.
A presentation detailing a Library Management System (LMS) Project for a Medical Research Council. The function of the Library is to organize and account for all the materials (Books, Journals, Magazines, Publications and Thesis) in the Library.
The system makes use of a Bar coding system to identify materials; used when checking in items.
Martins Jr.
Amazon SageMaker is a fully-managed platform that lets developers and data scientists build and scale machine learning solutions. First, we'll show you how SageMaker Ground Truth helps you label large training datasets. Then, using Jupyter notebooks, we'll show you how to build, train and deploy models using built-in algorithms and frameworks (TensorFlow, Apache MXNet, etc). Finally, we'll show you how to use 3rd-party models from the AWS marketplace.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
Comparing Colleges on basis of various attributes and doing regression using Weka Software
Demonstration of Clustering using Weka on various attributes on data set of places.
It’s a data mining/machine learning tool developed by Department of
Computer Science, University of Waikato, New Zealand.
Weka is a collection of machine learning algorithms for data mining tasks.
Weka is open source software issued under the GNU General Public License
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Advantages and Disadvantages of CMS from an SEO Perspective
Data mining techniques using weka
1. ITB TERM PAPER
Classification and Clutering
Nitin Kumar Rathore
10BM60055
2. Introduction
Weka stands for Waikato Environment for knowledge analysis. Weka is software available
for free used for machine learning. It is coded in Java and is developed by the University of
Waikato, New Zealand. Weka workbench includes set of visualization tools and algorithms
which is applied for better decision making through data analysis and predictive modeling. It
also has a GUI (graphical user interface) for ease of use. It is developed in Java so is portable
across platforms Weka has many applications and is used widely for research and educational
purposes. Data mining functions can be done by weka involves classification, clustering,
feature selection, data preprocessing, regression and visualization.
Weka startup screen looks like:
This is Weka GUI Chooser. It gives you four interfaces to work on
Explorer: It is used for exploring the data with weka by providing access to all the
facilities by the use of menues and forms
Experimenter: Weka Experimenter allows you to create, analyse, modify and run
large scale experiments. It can be used to answer question such as out of many
schemes which is better (if there is)
Knowledge flow: it has the same function as that of explorer. It supports incremental
learning. It handles data on incremental basis. It uses incremental algorithms to
process data.
Simple CLI: CLI stands for command line interface. It just provides all the
functionality through command line interface.
Data Ming Techniques
Out of the data mining techniques provided by the weka, classification, clustering, feature
selection, data preprocessing, regression and visualization, this paper will demonstrate use of
classification and clustering.
3. Classification
Classification creates a model based on which a new instance can be classified into the
existing classes or determined classes. for example by creating a decision tree based on past
sales we can determine how likely is a person to buy the product given all his attribute like
disposable income, family strength, state/country etc.
To start with classification you must use or create arff or csv (or any supported) file format.
An arff file is a table. To create arff file from excel you just have to follow these steps
Open the excel file. Remove headings
Save as it as a csv file (comma delimited) file.
Open the csv file ina text editor.
Now write the relation name at the top of the file as: @relation <relation_name>
The text inside the arrows, < and >, represents the text to be entered according to the
requirement
Leave a blank line and enter all the attributes, column heads, in the format: @attibute
<attribute_name>(<attribute_values>). For example @attribute outlook (sunny,
overcast, rainy)
After entering all the attribute leave a blank line and write: @data
This last line will appear just above comma separated data values of the file.
Save it as <file_name>.arff
The sample picture of arff file is shown below
4. Classification example:
Our goal is to create a decision tree using weka so that we can classify new or unknown iris
flower samples.
There are three kind or iris they are Iris setosa, Iris versicolor, Iris virginica.
Data file: We have a data file containing attribute values for 150 Iris samples in arff format at
this link: http://code.google.com/p/pwr-apw/downloads/detail?name=iris.arff.
Concept behind the classification is the sepal and petal length and width help us to identify
the unknown iris. The data files contain all the four attributes. The algorithm we are going to
use to classify is weka’s J4.8 decision tree learner.
Follow the underlying steps to classify:
Open weka and choose explorer. Then open the downloaded arff file.
Go to classify tab.
Click “choose” and Choose J48 algorithm under trees section
Left click on the chosen J48 algorithm to open Weka Generic Object Editor.
Change the option saveInstanceData to true. Click ok. It allows you to find the
classification process for each sample after building of the decision tree
5. Click “Percentage Split” option in the “Test Options” section. It trains on the
numerical percentage enters in the box and test on the rest of the data. Default value is
66%
Click on “Start” to start classification. The output box named “classifier output”
shows the output of classification. Output will look like this
Now we will see the tree. Right click on the entry in “Result List” then click visualize
tree.
6. Decision tree will be visible in new window
It gives the decision structure or flow of process to be followed during classification. For
example if petal width is > 0.6, petal width <=1.7, petal length > 4.9 and petal width <= 1.5,it
implies the iris is Virginica.
Now look at the classifier output box. The rules describing the decision tree is described as
given in the picture.
7. As we can see in the decision tree we don’t require sepal length and width for classification.
We require only petal length and width.
Go to “classifier output box”. Scroll to the section “Evaluation on test split section”.
We have split the data in two 66% for training and 33% for testing the model or tree.
This section will be visible as follows
Weka took 51 samples as 33% for test. Out of which 49 are classified correctly and 2
are classified incorrectly.
If you look at the confusion matrix below in classifier output box. You will see all
setosa(15) and all versicolor(19) are classified correctly but 2 out 0f 117 virginica are
classified as versicolor.
To find more information or to visualize how decision tree did on test samples. Right
click on “Result list” and select “Visualize classifier errors”.
A new window will open. Now as our tree has used on petal width and petal length to
classify, we will select Petal Length for X axis and Petal Width for Y axis.
Here “x” or cross represents correctly classified samples and squares represents
incorrectly classified samples.
Results of decision tree as Setosa, versicolor and virginica are represented in different
colors as blue red and green.
AS we can see why these are classified incorrectly as virginica, because they fall into
the versicolor group considering petal length and width.
8. The picture of window will appear as
By left clicking on the squared instances circled black will give you information about
that instance.
9. As we can see 2 nodes out of 50 virginica samples (train +test) are classified incorrectly. Rest
others are classified correctly for setosa and versicolor. There can be many reasons for it.
Few are mentioned below.
Attribute measurement error: It arises out of incorrect measurement of petal and sepal
length and widths.
Sample class identification error: It may arise because some classes are identified
incorrectly. Say some versicolor are classified as virginica.
Outlier samples: some infected or abnormal flower are sampled
Inappropriate classification algorithm: the algorithm we chose is not suitable for the
classification.
Clustering
Clustering is formation of groups on the basis of its attributes and is used to determine
patterns from the data. Advantage of clustering over classification is each and every attribute
is used to define a group but disadvantage of clustering is a user must know beforehand how
many groups he wants to form.
There are 2 types of clustering:
Hierarchical clustering: This approach uses measure (generally squared Euclidean) of
distance for identifying distance between objects to form a cluster. Process starts with all the
objects as separate clusters. Then on the basis of shortest distance between clusters two
objects closest are joined to form a cluster. And this cluster represents the new object. Now
again the process continues until one cluster is formed or the required number of cluster is
reached.
Non-Hierarchical Clustering: It is the method of clustering in which partition of observations
(say n in number) occur into k clusters. Each observation is assigned to nearest cluster and
cluster mean is recalculated. In this paper we will study K-means clustering example.
Applications of clustering includes
Market segmentation
Computer vision
Geostatistics
Understanding buyer behavior
Data file: Data file talks about the BMW dealership. It contains data about how often
customer makes a purchase, what cars they look at, how they walk through showroom and
dealership. It contains 100 rows of data and where every attribute/column represent the steps
that customer have achieved in their buying process. “1” represents they have reached this
step whereas “0” represents they didn’t made it to this step. Download the data file from the
11. To create clusters click on cluster tab. Click the command button “Choose” and select
“SimpleKMeans”.
Click on the text box next to choose button which displays the k means algorithm. It
will open Weka GUI Generic Object Editor
Change “numClusters” from 2 to 5. It define the number of clusters to be formed.
Click ok
Click Start to start clustering
In “result List” box a entry will appear and Cluster output will display output of
clustering. It will appear as follows.
Cluster Results:
Now we will have the clusters defined. You can have cluster data in a separate window by
right clicking the entry in the “Result List” Box. There are 5 clusters formed named from “0”
to “4“. If a attribute value for a cluster is “1” it means all the instances in the cluster have the
same value “1” for that attribute. If a cluster has “0” values for an attribute it means all
instances in the cluster have the “0” value for that attribute. To remind, the “0” value
represent customer have not entered into that step of buying process and “1” represent
customer have entered into the step.
12. Clustered instances show how many instances belong to each cluster. Clustered instances is
the heading given in the cluster output. For example in cluster “0” it have 26 instances or
26% instances (as there are 100 rows no. of instances is equal to percentage)
The value for clusters in separate window is given in the picture below.
Interpreting the clusters
Cluster 0: It represents the group of non purchasers, as they may look for dealership,
look for cars in a showroom but when it comes to purchasing a car they do nothing.
This group just adds to cost but doesn’t bring any revenue.
Cluster 1: This group is attracted towards M5 as it is quite visible that go straight
towards the M5s ignoring 3Series car and paying no heed at all to Z4. They even
13. don’t do the computer search. But as we can see this high footfall for does not bring
sales accordingly. The reason for medium sales should be unearthed. Say if customer
service is the problem we should increase the service quality over the M5 section by
training sales executive better or if lack of no. of sales personnel to cater every
customer is the problem we can provide more staff for the M5 section.
Cluster 2: This group just contains 5 instances out of 100. They can be called
“insignificant group”. They are not statistically important. We should not make any
conclusion from such an insignificant group. It indicates we may reduce the no. of
clusters
Cluster 3: This is the group of customers we can call “sure shot buyers”. Because
they will always buy a car. One thing to note is we should take care of their financing
as they always go for financing. They lookout showroom for available cars and also
do computer search for the available dealership. They generally don’t lookout for
3Series. It displays that we should make computer search for M5 and Z4 more visible
and attractive in search results.
Cluster 4: This group contains the ones that make least purchase after non-
purchasers. They are the new ones in the category, because they don’t look for
expensive cars like M5 instead lookout for 3Series. They walk into showrooms and
they don’t involve in computer search. As we can see 50 percent of them get to the
financing stage but only 32 percent end up buying a car. This means these are the
ones buying their first BMW and know exactly their requirement and hence their car
(3Series entry level model). They generally go for financing to afford the car. This
means to increase the sales we should increase the conversion ratio from financing
stage to purchasing stage. We should identify the problem there and take the
appropriate step. For example making financing easier by collaborating with bank. By
lowering the terms that repels customers.
14. REFERENCES
[1] Data mining by Jan H. witten, Eibe Frank and Mark A. Hall, 3rd edition, Morgan
Kaufman Publisher
[2]Tutorial for weka provided by university of Waikato, www.cs.waikato.ac.nz/ml/weka/
[3] Weka,Classification using decision trees based on Dr. Polczynski's Lecture, written by
Prof. Andrzej Kochanski and Prof Marcin Perzyk, Faculty of Production Engineering,
Warsaw University of Technology, Warsaw Poland,
http://referensi.dosen.narotama.ac.id/files/2011/12/weka-tutorial-2.pdf
[4] Classification via Decision Trees in WEKA, Computer science, Telecommunications, and
Information systems, DePaul University,
http://maya.cs.depaul.edu/classes/ect584/weka/classify.html
[5] Data mining with WEKA, Part 2: classification and clustering, IBM developer works
Michael Abernethy, http://www.ibm.com/developerworks/opensource/library/os-
weka2/index.html?ca=drs-