This document provides an introduction to WEKA, an open-source data mining software. It describes the key interfaces of WEKA including Explorer, Experimenter, KnowledgeFlow and SimpleCLI. It demonstrates how to use these interfaces to perform common data mining tasks like preprocessing, classification, clustering and association rule mining on sample datasets. Examples using iris data and weather data are provided to illustrate how to analyze and evaluate models in WEKA.
For regular Updates on SAP ABAP please like our Facebook page:-
Facebook:- https://www.facebook.com/bigclasses/
Twitter:- https://twitter.com/bigclasses
LinkedIn:-https://www.linkedin.com/company/bigclasses/
Google+:https://plus.google.com/+Bigclassesonlinetraining
SAP ABAP Course Page:-https://bigclasses.com/sap-abap-online-training.html
Contact us: - India +91 800 811 4040
USA +1 732 325 1626
Email us at: - info@bigclasses.com
sap abap online training, online sap abap training, sap abap training online, sap abap training, abap online training, sap abap, sap online training, sap abap online training from india, sap abap online training demo, sap, abap, sap abap online classes, sap abap online, sap abap training course, online abap training, abap training online, sap abap online courses, www.bigclasses.com,sap abap training
usa
You can enter formulas in two ways, either directly into the cell itself, or at the input line. Either way, you need to start a formula with one of the following symbols: =, + or –. Starting with anything else causes the formula to be treated as if it were text.
Creating Formulas
Understanding Functions
Using regular expressions in functions
Using Pivot tables
The DataPilot dialog
For regular Updates on SAP ABAP please like our Facebook page:-
Facebook:- https://www.facebook.com/bigclasses/
Twitter:- https://twitter.com/bigclasses
LinkedIn:-https://www.linkedin.com/company/bigclasses/
Google+:https://plus.google.com/+Bigclassesonlinetraining
SAP ABAP Course Page:-https://bigclasses.com/sap-abap-online-training.html
Contact us: - India +91 800 811 4040
USA +1 732 325 1626
Email us at: - info@bigclasses.com
sap abap online training, online sap abap training, sap abap training online, sap abap training, abap online training, sap abap, sap online training, sap abap online training from india, sap abap online training demo, sap, abap, sap abap online classes, sap abap online, sap abap training course, online abap training, abap training online, sap abap online courses, www.bigclasses.com,sap abap training
usa
You can enter formulas in two ways, either directly into the cell itself, or at the input line. Either way, you need to start a formula with one of the following symbols: =, + or –. Starting with anything else causes the formula to be treated as if it were text.
Creating Formulas
Understanding Functions
Using regular expressions in functions
Using Pivot tables
The DataPilot dialog
Prepare for your interview with these top SAP ABAP Practice Exam questions. For more IT Profiles, Sample Resumes, Practice exams, Interview Questions, Live Training and more…visit ITLearnMore – Most Trusted Website for all Learning Needs by Students, Graduates and Working Professionals.
Looking to add weight to your resume? Check out for ITLearnmore for varied online IT courses at affordable prices intended for career boost. There is so much in store for both fresh graduates and professionals here. Hurry up..! Get updated with the current IT job market requirements and related courses.More information visit www.ITLearnMore.com
This is a short introduction course to Stata statistical software version 9. The course still applies to later versions of Stata, too. The course duration was 9 hours. It has been given at the Faculty of Economics and Political Science, Cairo University.
bis 155 week 1 to week 5 all quiz,bis 155,bis 155 entire course,bis 155 devry,devry bis 155,bis 155 ilabs,bis 155 exercise, bis 155 final exam,devry bis 155 final exam,bis 155 course project,bis 155 free,bis 155 week 1,bis 155 week 2,bis 155 week 3,bis 155 new
Comparing Colleges on basis of various attributes and doing regression using Weka Software
Demonstration of Clustering using Weka on various attributes on data set of places.
This presentation illustrates DocIndex, InternetMiner and VisioDecompositer - my 3 proprietary test tools - and walks the user through how they are used effectively.
The tools are presented in the context of a Test Strategy and the emphasis is on HOW the tools are used and the rationale behind the esign of the tools.
View this presentation with SPEAKERS NOTES ON.
Prepare for your interview with these top SAP ABAP Practice Exam questions. For more IT Profiles, Sample Resumes, Practice exams, Interview Questions, Live Training and more…visit ITLearnMore – Most Trusted Website for all Learning Needs by Students, Graduates and Working Professionals.
Looking to add weight to your resume? Check out for ITLearnmore for varied online IT courses at affordable prices intended for career boost. There is so much in store for both fresh graduates and professionals here. Hurry up..! Get updated with the current IT job market requirements and related courses.More information visit www.ITLearnMore.com
This is a short introduction course to Stata statistical software version 9. The course still applies to later versions of Stata, too. The course duration was 9 hours. It has been given at the Faculty of Economics and Political Science, Cairo University.
bis 155 week 1 to week 5 all quiz,bis 155,bis 155 entire course,bis 155 devry,devry bis 155,bis 155 ilabs,bis 155 exercise, bis 155 final exam,devry bis 155 final exam,bis 155 course project,bis 155 free,bis 155 week 1,bis 155 week 2,bis 155 week 3,bis 155 new
Comparing Colleges on basis of various attributes and doing regression using Weka Software
Demonstration of Clustering using Weka on various attributes on data set of places.
This presentation illustrates DocIndex, InternetMiner and VisioDecompositer - my 3 proprietary test tools - and walks the user through how they are used effectively.
The tools are presented in the context of a Test Strategy and the emphasis is on HOW the tools are used and the rationale behind the esign of the tools.
View this presentation with SPEAKERS NOTES ON.
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxinfantsuk
Major AssignmentDue
5pm Friday, Week 11. If you unable to submit on time please upload an extension request 24 hours prior (see Assignments folder on MySCU).Objectives
This assignment will provide practice and experience in:
· Writing a program – Topic 2
· Debugging– Topic 3
· Stepwise Refinement & Modularisation – Topic 4 and Topic 10
· Selection – Topic 5
· Iteration – Topic 6
· Arrays – Topic 7
· File handling – Topic 9
· Structs – Topic 11
NB Depending on when you start this assignment you may need to read ahead especially on how to use files and structs.
Suggestions:
Read the assignment specifications carefully first. Write the first version of your program in Week 4 and then create new versions as you learn new topics. Do NOT leave it until Week 11 to start writing the program. Review Topic 4 on stepwise refinement. This is how you should approach the major. Also note that though your program must do something and must compile it does not have to be complete to earn marks.Specifications
One of the many tasks that programmers get asked to do is to convert data from one form to another. Frequently data is presented to users in well-labelled, tabular form for easy reading. However, it is impossible or very difficult to do further processing of the data unless it is changed into a more useful form.
For the purposes of this assignment I have downloaded and will make available the undergraduate applications to the 37 Australian universities from the Department of Education for 2009 – 2013 data file as a text file.
Your program will load this data into an array of structs, save the data in a form that is directly usable by a database (see below), display the data on the console in its original form and in its database form. It will also allow the user to display the highest number of applications for a given state and year.
Your program will use a menu to allow the user to choose what task is to be done. You will only be required to handle the Applications data. You can ignore the Offers and Offers rates data (see below).Data
See “undergraduateapplicationsoffersandacceptances2013appendices.txt” for the original data.
This is the data your program should produce and save:
New South Wales Charles Sturt University 4265 4298 4287 4668 4614
New South Wales Macquarie University 6255 6880 7294 7632 7625
New South Wales Southern Cross University 2432 2742 2573 2666 2442
New South Wales The University of New England 1601 1531 1504 1632 1690
New South Wales The University of New South Wales 10572 10865 11077 11008 11424
New South Wales The University of Newcastle 9364 9651 9876 10300 10571
New South Wales The University of Sydney 13963 14631 14271 14486 15058
New South Wales "University of Technology, Sydney" 10155 9906 9854 10621 9614
New South Wales University of Western Sydney 11251 11776 11713 11947 13158
New South Wales University of Wollongong 3645 3685 3843 3801 3608
Victoria Deakin University 10780 12301 11223 11443 11288
Victori ...
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docxYASHU40
EN1320: Module 2
Lab 2.1
Capturing the Reader’s Interest
Select and rewrite any one introduction and one conclusion to make them engaging and inspiring.
For the introduction:
Is the opening statement interesting?
Does it use an opening technique suggested in the textbook?
Does it include a thesis—a statement that tells the central idea of the essay?
For the conclusion:
Does it restate the thesis or indicate the essay’s central idea?
Does it summarize support for the thesis or central idea?
Does it leave the reader with something to ponder or act upon?
Introduction 1:
Whether student-athletes should be monetarily compensated has long been debated. On one side, the National Collegiate Athletic Association (NCAA) and its supporters stand by NCAA rules, regulations, and bylaws as their claim for why student-athletes should not be paid. On the other side, critics stand by business economics as well as other NCAA rules that regulate the amount of money a student-athlete can earn in the academic year. Although critics claim that the NCAA limits the amount of money a student-athlete can earn in a year, the NCAA should continue to not allow a pay-to-play standard. History has proven that pay-to play programs would be unmanageable, scholarships are compensation enough, and athletic departments would have to become separate entities from the universities.
Introduction 2:
There is no question that society believes in the familiar saying, “If you do the crime, then you have to do the time.” The prison system is not meant to be a fun or happy place for convicted criminals to serve their time. Its main purpose is to punish anyone who breaks the law in order to decrease the crime rate. Prison is not the safest and most humane place to be. Prisons are dirty, chaotic, dangerous, and depressing, and it is not a life that is worth living. The conditions in the prison systems need to be improved to have positive outcomes such as reducing recidivism and benefiting society financially.
Conclusion 1:
All of these points being said, euthanasia should never be used against a patient’s will. If euthanasia is ever to be completely legal in our states, we need to impose a very strict and concise set of guidelines. Euthanasia could benefit many citizens with no hope of a full recovery. Suffering cancer patients could have a dignified alternative to their lives getting more unbearable by the day, with countless tests, medications, procedures, and prodding by dozens of doctors and nurses.
Conclusion 2:
In conclusion we cannot allow NASA to be dissolved or underfunded due to the importance of the future of space exploration and travel and the technology produced by NASA that saves lives and benefits humanity. We need Washington and NASA to come up with a 10-year plan and a funding system to allow NASA to do the work that it needs to do unhindered. Without a 10-year plan NASA will continue to be plagued by the rapidly changing political sc ...
Week 2 Project - STAT 3001Student Name Type your name here.docxcockekeshia
Week 2 Project - STAT 3001
Student Name: <Type your name here>
Date: <Enter the date on which you began working on this assignment.>
Instructions: To complete this project, you will need the following materials:
· STATDISK User Manual (found in the classroom in DocSharing)
· Access to the Internet to download the STATDISK program.
This assignment is worth a total of 60 points.
Part I. Histograms and Frequency Tables
Instructions
Answers
1. Open the file Diamonds using menu option Datasets and then Elementary Stats, 9th Edition. This file contains some information about diamonds. What are the names of the variables in this file?
2. Create a histogram for the depth of the diamonds using the Auto-fit option. Paste the chart here. Once your histogram displays, click Turn on Labels to get the height of the bars.
3. Using the information in the above histogram, complete this table. Be sure to include frequency, relative frequency, and cumulative frequency.
Depth
Frequency
Relative Frequency
Cumulative Frequency
57-58.9
59-60.9
61-62.9
63-64.9
a. Using the frequency table above, how many of the diamonds have a depth of 60.9 or less? How do you know?
b. Using the frequency table above, how many of the diamonds have a depth between 59 and 62.9? Show your work.
c. What percent of the diamonds have a depth of 61 or more?
Part II. Comparing Datasets
Instructions
Answers
1. Create a boxplot that compares the color and clarity of the diamonds. Paste it here.
2. Describe the similarities and differences in the data sets. Please be specific to the graph created.
Part III. Finding Descriptive Numbers
Instructions
Answers
3. Open the file named Stowaway (using Datasets and then Elementary Stats, 9th Edition). This gives information on the number of stowaways going west vs east.List all the variables in the dataset.
4. Find the Mean, median, and midrange for the Data in Column 1.
5. Find the Range, variance, and standard deviation for the first column.
6. List any values for the first column that you think may be outliers. Why do you think that?
[Hint: You may want to sort the data and look at the smallest and largest values.]
7. Find the Mean, median, and midrange for the data in Column 2.
8. Find the Range, variance, and standard deviation for the data in Column 2.
9. List any values for the second column that you think may be outliers. Why do you think that?
10. Find the five-number summary for the stowaways data in Columns 1 and 2. You will need to label each of the columns with an appropriate measure in the top row for clarity.
11. Compare number of stowaways going west and east using a boxplot of Columns 1 and 2. Paste your boxplot here
12. Create a histogram for the
Column 1 data and paste it here.
13. Create a histogram for the
Column 2 data and paste it here.
Part IV. Interpreting Statistical Information
The Stowaway data contains two columns, both of which are mea.
FOR MORE CLASSES VISIT
tutorialoutletdotcom
CMIS 102 Hands-On Lab
Week 8
Overview
This hands-on lab allows you to follow and experiment with the critical steps of developing a program
including the program description, analysis, test plan, and implementation with C code. The example
provided uses sequential, repetition, selection statements, functions, strings and arrays.
Reaction StatisticsBackgroundWhen collecting experimental data f.pdffashionbigchennai
Reaction Statistics
Background
When collecting experimental data from chemical reactions, it’s often useful to generate
statistics based on the data. One experimental measure is the reaction rate in moles per second,
representing the amount of product formed per unit time. If we have a set of these reaction rates
collected in a data file, we can calculate summary statistical information, such as the minimum
and maximum values, the arithmetic mean, variance, and standard deviation.
Finding the minimum and maximum are straightforward: we scan through all the data, and keep
track of the smallest and largest values encountered. The arithmetic mean (or average) is defined
as:
m = (X1+X2+…+Xn)/n
where n is the number of reaction rates, and xi represents one experimental reaction rate. Once
you have the arithmetic mean, the variance can be calculated as the mean of the squares of the
deviations from the mean:
v= ((Xn-m)^2+(X2 – m^2) + …+(Xn-m)^2)/n
where n is the number of reaction rates, xi represents one experimental reaction rate, and m is the
arithmetic mean of the reaction rates. Once you have the variance, you can calculate the
standarddeviation as:
s = sqrt(v)
Assignment
You will develop a C program that reads data from an input text file containing chemical
reaction rates (in moles per second), and computes the minimum, maximum, arithmetic mean,
variance, and standard deviation for that set of data. Your instructor will provide input text files,
which will each contain a series of double values, each on a line of its own within the file. Your
program will read one of these input files into an array of doubles (i.e., it will populate the array
using the data values from the file). Your program will then calculate statistics using that array of
doubles, and will write the results out to a separate output text file.
The goals of this assignment are to provide you with experience reading and writing text data
files, provide you with experience passing an array into a function, and give you more
experience organizing your program into separate C functions.
When defining your C functions, you may either:
Define the functions before they are used by any other functions, OR
Place function prototypes near the top of your code (after all #include directives), and then define
the functions in any order.
Part 1 – Opening Files and Reading Data
Create a new Visual Studio Win32 Console project named reactionstats. Create a new C source
file named project4.c within that project. At the top of the source file, #define
_CRT_SECURE_NO_WARNINGS, and then include stdio.h, math.h, stdlib.h, stdbool.h, and
float.h.
Inside your main function, define the following:
A one-dimensional array of 600 doubles. They do not need to be initialized to anything at this
stage.
An integer variable to hold the number of elements in the array, initialized using the approach
demonstrated in class, using sizeof.
A FILE pointer variable, which will refer to the input data text file..
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
TAO Fayan_ Introduction to WEKA
1. An Introduction to WEKA
Fayan TAO
∗
Computer and information science system
Macau University of Science and Technology
fytao2015@gmail.com
ABSTRACT
1
WEKA is a data mining software workbench, which has
wide applications in machine learning technology. It has an
active community and enjoys widespread acceptance in both
academia and business. This report provides an introduc-
tion to WEKA, and demonstrates how to use each applica-
tion, including Explorer, Experimenter, KnowledgeFlow and
Simple CLI of WEKA based on version 3.6.13. Two kind
of classical data–iris data and weather data are used to do
experiments and related output results are shown and anal-
ysed in this report.
Keywords
WEKA; data mining; machine learning
1. INTRODUCTION
The WEKA[10][5] workbench is a collection of state-of-
the-art machine learning algorithms and data preprocess-
ing tools. It was developed at the University of Waikato in
New Zealand; the name stands for Waikato Environment for
Knowledge Analysis. (Outside the university, the WEKA,
pronounced to rhyme with Mecca, is a flightless bird with an
inquisitive nature found only on the islands of New Zealand
[8].) The system is written in Java and distributed under the
terms of the GNU General Public License. It runs on almost
any platform and has been tested under Linux, Windows,
and Macintosh operating systemsaland even on a personal
digital assistant.
It provides a uniform interface to many different learning
algorithms, along with methods for pre- and post-processing
and for evaluating the result of learning schemes on any
given dataset[6].
It contains several standard data mining techniques, in-
cluding data preprocessing, classification, regression, clus-
∗Stu ID:1509853F-II20-0019
1
This report follows ACM TEX format.
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
Figure 1: WEKA interface
tering, and association.
2. BACKGROUND AND HISTORY
The WEKA project has been funded by the New Zealand
since 1993[1][7]. The initial goal at this time was as follows:
The programme aims to build a state-of-the-art facility for
developing techniques of machine learning and investigat-
ing their application in key areas of New Zealand economy.
Specifically we will create a workbench for machine learning,
determine the factors that contribute towards its successful
application in the agriculture industries, and develop new
methods of machine learning and ways of assessing their ef-
fectiveness.
In 1996, a mostly C version of WEKA was released, while
in 1999 it was redeveloped and released in Java to support
platform independence. Today, there are several versions of
WEKA available to the public. The GUI version (6.0) is the
most recent release. The developer version (3.5.8) allows
users to obtain and modify source code to add content or fix
bugs. The book version (3.4.14) is as described in the data
mining book released by Witten and Frank[9].
3. INTERFACES
WEKA has four interfaces, which are started from the
main GUI Chooser window, as shown in figure 1(this dis-
cussion is based on WEKA version 3.6.13).They can handle
data preprocessing, classification, regression, clustering, and
association [1][2]. Following shows the specified application
interfaces:
• Explorer: An environment for exploring data with WEKA,
which contains six main functions, namely Preprocess,
Classify, Cluster, Associate, Select attributes and Vi-
sualize, respectively.
• Experimenter: An environment for performing exper-
iments and conducting statistical tests between learn-
2. Figure 2: Explorer interface
Figure 3: Preprocess interface
ing schemes.
• KnowledgeFlow: This environment supports essentially
the same functions as the Explorer but with a drag-
and-drop interface. One advantage is that it supports
incremental learning.
• SimpleCLI: An environment providing a simple command-
line interface that allows direct execution of WEKA
commands for operating systems that do not provide
their own command line interface.
4. EXPLORER
Explorer is the main graphical user interface in WEKA.
It is shown in figure 2.
It has six different panels, accessed by the tabs at the top,
that correspond to the various Data Mining tasks supported
([8][2]). We will discuss these six functions one by one in the
following sections.
• Preprocess: Choose and modify the data being acted
on.
• Classify: Train and test learning schemes that classify
or perform regression.
• Cluster: Learn clusters for the data.
• Associate: Learn association rules for the data.
• Select attributes: Select the most relevant attributes in
the data.
• Visualize: View an interactive 2D plot of the data.
4.1 Preprocess
The interface of Preprocess can be seen in the figure 3.
The first four buttons at the top of the preprocess section
are used to load data into WEKA[2]:
• Open file. . . Brings up a dialog box allowing us to
browse for the data file on the local file system.
Figure 4: Left:Weather data (.CSV ); Right: Iris
data (.arff)
Figure 5: Iris data after preprocessing
• Open URL. . . Asks for a Uniform Resource Locator
address for where the data is stored.
• Open DB. . . Reads data from a database. (Note that
to make this work we might have to edit the file in
WEKA/experiment/DatabaseUtils.props.)
• Generate. . . Enables us to generate artificial data
from a variety of DataGenerators.
Using the Open file. . . button, we can read files in a vari-
ety of formats: WEKA’s ARFF format, CSV format, C4.5
format, or serialized Instances format. ARFF files typically
have a .arff extension, CSV files a .csv extension, C4.5 files a
.data and .names extension, and serialized Instances objects
has a .bsi extension.
Note: This list of formats can be extended by adding cus-
tom file converters to the WEKA.core.converters package.
Here, we take weather data and iris data as examples. Fig-
ure 4 shows the weather data in the .CSV format and the
iris data in the .arff format. figure 2 and figure 5 show the
weather data and iris data after preprocessing, respectively.
We can see that, preprocess interface demonstrates data in-
formation, such as data Relation, Instance and Attributes. It
also shows statistics values, including Minimum, Maximum,
Mean and StdDev values. Additional, we can discretize data.
As the figure 6 shows, the weather data contains 5 sunny
days, 4 overcast days and 5 rainy days, where all overcast
days have no bad influence in playing, while not all sunny
or rainy days can be regarded as suitable days to play.
4.2 Classification
The Classify interface is shown as figure 7. To analyse
the weather data, We have to choose a classifier and test
options. In our case, we choose trees.J48 as the classifier.
As to the Test options box. There are four test modes[2]:
• Use training set: The classifier is evaluated on how
well it predicts the class of the instances it was trained
on.
3. Figure 6: Discretize weather data
Figure 7: Classify interface
• Supplied test set: The classifier is evaluated on how
well it predicts the class of a set of instances loaded
from a file. Clicking the Set. . . button brings up a
dialog allowing us to choose the file to test on.
• Cross-validation: The classifier is evaluated by cross-
validation, using the number of folds that are entered
in the Folds text field.
• Percentage split: The classifier is evaluated on how
well it predicts a certain percentage of the data which
is held out for testing. The amount of data held out
depends on the value entered in the % field.
Here, we set Percentage split as 66%, which means that the
classifier will predict 66 percentage of the weather data. The
figure 8 shows the visualize tree. And the table 1 shows the
classifier output. As we can see, the text in the Classifier
output area is split into several sections:
• Run information: A list of information giving the learn-
ing scheme options, relation name, instances, attributes
and test mode that were involved in the process.
• Classifier model (full training set): A textual represen-
tation of the classification model that was produced on
the full training data.
• Summary: A list of statistics summarizing how accu-
rately the classifier was able to predict the true class
of the instances under the chosen test mode.
Figure 8: Classification visualize tree
Figure 9: Visualize cluster assignments (SimpleK-
Means)
Figure 10: Visualize cluster assignments (Hierarchi-
calClusterer)
• Detailed Accuracy By Class: A more detailed per-class
break down of the classifier’s prediction accuracy. Here,
the true positives (TP)[8] and true negatives (TN) are
correct classifications. A false positive (FP) is when
the outcome is incorrectly predicted as yes (or posi-
tive) when it is actually no (negative). A false negative
(FN) is when the outcome is incorrectly predicted as
negative when it is actually positive. The true positive
rate is TP divided by the total number of positives,
which is TP +FN (i.e T P
T P +F N
); the false positive rate
is FP divided by the total number of negatives, which
is FP + TN (i.e F P
F P +T N
). The overall success rate
is the number of correct classifiations divided by the
total number of classifications (i.e T P +T N
T P +T N+F P +F N
).
Finally, the error rate is 1 minus this.
• Confusion Matrix: Shows how many instances have
been assigned to each class. Elements show the num-
ber of test examples whose actual class is the row and
whose predicted class is the column. In the table 1, we
can see that there are 2 no instances are classified to
yes, while 1 yes instance is mistaken as no.
4.3 Clustering
WEKA contains clusterer for finding groups of similar in-
stances in a dataset. There are different kind of implemented
schemes, such as k-Means, EM, Cobweb, X-means and Far-
thestFirst. Besides, clusters can be visualized and compared
to ”true” clusters (if given). If clustering scheme produces
a probability distribution, the evaluation will be based on
log-likelihood.
In this report, we choose three different clusters: SimpleK-
Means, HierarchicalClusterer and EM to analyse weather
data.
Figure 9, figure 10 and figure 11 show visualize cluster
assignments by the cluster ways of SimpleKMeans, Hierar-
chicalClusterer and EM,respectively.
Table 2 shows the cluster output by Kmeans. We can see
that, there are 14 instances in total. They are clustered into
4. === Run information ===
Scheme :WEKA. c l a s s i f i e r s . t r e e s . J48 −C 0.25 −M 2
Relation : weatherData
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
play
Test mode : s p l i t 66.0% train , remainder t e s t
=== C l a s s i f i e r model ( f u l l t r a i n i n g set ) ===
J48 pruned tree
−−−−−−−−−−−−−−−−−−
outlook = Rainy
| humidity = High : No ( 3 . 0 )
| humidity = Normal : Yes ( 2 . 0 )
outlook = Overcast : Yes ( 4 . 0 )
outlook = Sunny
| windy = False : Yes ( 3 . 0 )
| windy = True : No ( 2 . 0 )
Number of Leaves : 5
Size of the tree : 8
Time taken to build model : 0.03 seconds
=== Evaluation on t e s t s p l i t ===
=== Summary ===
Correctly C l a s s i f i e d Instances 2 40 %
I n c o r r e c t l y C l a s s i f i e d Instances 3 60 %
Kappa s t a t i s t i c −0.3636
Mean absolute e r ror 0.6
Root mean squared e r ror 0.7746
Relative absolute er r or 126.9231 %
Root r e l a t i v e squared e r ror 157.6801 %
Total Number of Instances 5
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F−Measure ROC Area Class
0 0.333 0 0 0 0.333 No
0.667 1 0.5 0.667 0.571 0.333 Yes
Weighted Avg . 0.4 0.733 0.3 0.4 0.343 0.333
=== Confusion Matrix ===
a b <−− c l a s s i f i e d as
0 2 | a = No
1 2 | b = Yes
Table 1: Classifier output (J48)
5. Scheme :WEKA. c l u s t e r e r s . SimpleKMeans −N 2 −A ”WEKA. core . EuclideanDistance −R f i r s t −l a s t ” −I 500 −S 10
Relation : weather . symbolic
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
Ignored :
play
Test mode : Classes to c l u s t e r s evaluation on t r a i n i n g data
=== Model and evaluation on t r a i n i n g set ===
kMeans
======
Number of i t e r a t i o n s : 4
Within c l u s t e r sum of squared e r r o r s : 21.000000000000004
Missing values g l o b a l l y replaced with mean/mode
Cluster centroids :
Cluster#
Attribute Full Data 0 1
(14) (10) (4)
==============================================
outlook sunny sunny overcast
temperature mild mild cool
humidity high high normal
windy FALSE FALSE TRUE
Time taken to build model ( f u l l t r a i n i n g data ) : 0 seconds
=== Model and evaluation on t r a i n i n g set ===
Clustered Instances
0 10 ( 71%)
1 4 ( 29%)
Class a t t r i b u t e : play
Classes to Clusters :
0 1 <−− assigned to c l u s t e r
6 3 | yes
4 1 | no
Cluster 0 <−− yes
Cluster 1 <−− no
I n c o r r e c t l y c l u s t e r e d instances : 7.0 50 %
Table 2: Cluster output (Kmeans)
6. === Run information ===
Scheme : WEKA. a s s o c i a t i o n s . Apriori −N 10 −T 0 −C 0.9 −D 0.05 −U 1.0 −M 0.1 −S −1.0 −c −1
Relation : weatherData
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
play
=== Associator model ( f u l l t r a i n i n g set ) ===
Apriori
=======
Minimum support : 0.15 (2 instances )
Minimum metric <confidence >: 0.9
Number of c y c l e s performed : 17
Generated s e t s of l a r g e itemsets :
Size of set of l a r g e itemsets L ( 1 ) : 12
Size of set of l a r g e itemsets L ( 2 ) : 47
Size of set of l a r g e itemsets L ( 3 ) : 39
Size of set of l a r g e itemsets L ( 4 ) : 6
Best r u l e s found :
1. outlook=Overcast 4 ==> play=Yes 4 conf : ( 1 )
2. temperature=Cool 4 ==> humidity=Normal 4 conf : ( 1 )
3. humidity=Normal windy=False 4 ==> play=Yes 4 conf : ( 1 )
4. outlook=Rainy play=No 3 ==> humidity=High 3 conf : ( 1 )
5. outlook=Rainy humidity=High 3 ==> play=No 3 conf : ( 1 )
6. outlook=Sunny play=Yes 3 ==> windy=False 3 conf : ( 1 )
7. outlook=Sunny windy=False 3 ==> play=Yes 3 conf : ( 1 )
8. temperature=Cool play=Yes 3 ==> humidity=Normal 3 conf : ( 1 )
9. outlook=Rainy temperature=Hot 2 ==> humidity=High 2 conf : ( 1 )
10. temperature=Hot play=No 2 ==> outlook=Rainy 2 conf : ( 1 )
Table 3: Associator output (Apriori)
7. Figure 11: Visualize cluster assignments (EM)
Figure 12: Associate interface
two groups–0 and 1. Group 0 represents play, while group 1
means non-play. They contains 10 instances and 4 instances,
respectively. According to the output section–Classes to
Clusters, there are 3 play instance being mistaken as non-
play and 4 non-play instances being misclassified as play. So,
in fact, there are 9 play instances and 5 non-play instances
in the real data.
4.4 Association
WEKA contains an implementation of the Apriori algo-
rithm for learning association rules. Apriori can compute
all rules that have a given minimum support and exceed a
given confidence. But it works only with discrete data. It
can identify statistical dependencies between groups of at-
tributes[3].
Figure 12 shows the interface of Associate application. Ta-
ble 3 shows the output by means of apriori. We can get the
best rule by the Best rules found section. For example,
when the outlook is overcast, it is good for playing; While on
the rainy day and there is no play, which means the humidity
is high.
4.5 Select attributes
Attribute selection[2] involves searching through all pos-
sible combinations of attributes in the data to find which
subset of attributes works best for prediction. To do this,
two objects must be set up: an attribute evaluator and a
search method. A search method contains best-first, for-
ward selection, random, exhaustive, genetic algorithm and
ranking. An evaluation method includes correlation-based,
wrapper, information gain,chi-squared and so on.
The evaluator determines what method is used to assign
a worth to each subset of attributes. The search method
determines what style of search is performed. WEKA allows
(almost) arbitrary combinations of these two, so it is very
flexible.
Figure 13 and figure 14 demonstrate attribute select out-
puts by methods of (CfsSubsetEval + Best first) and (ChiSquare-
dAttributeEval + Ranker), respectively. The former selects
outlook and humidity as the best subset for prediction. The
latter method ranks the attributes in a descending rank or-
Figure 13: Attribute select output (CfsSubsetEval
+ Best first)
Figure 14: Attribute select output (ChiSquaredAt-
tributeEval + Ranker)
der: outlook, humidity, windy and temperature. Therefor,
we can see that outlook and humidity play an important role
in determining whether it is good for playing or not.
4.6 Visualization
Visualization is very useful in practice, for example, it
helps to determine which difficulty is in the learning prob-
lems. WEKA can visualize single attributes and pairs of
attributes, which can be shown in figure 15.
5. EXPERIMENTER
The Experimenter[8] enables us to set up large-scale ex-
periments, start them running, leave them and come back
when they have finished, and then analyze the performance
statistics that have been collected. They automate the ex-
perimental process. Experimenter makes it easy to compare
the performance of different learning schemes. The statistics
can be stored in ARFF format, and can themselves be the
subject of further data mining.
Figure 16, figure 17 and figure 19 show the three main
part of Experimenter. We have to setup the data first, then
run it, finally analyse the result.
To analyze the experiment that has been performed in
this section, click the Experiment button at the top right;
otherwise, supply a fie that contains the results of another
experiment. Then click Perform test (near the bottom left).
Figure 15: Visualization of weather data
8. Figure 16: An experiment: setting it up
Figure 17: An experiment: run
Figure 18: Experiment output result
Figure 19: Statistical test results for the experiment
Figure 20: KnowledgeFlow interface
The result of a statistical significance test of the performance
of the fist learning scheme (J48) versus the other two (OneR
and ZeroR) are displayed in the large panel on the right, just
as the figure 18 shows.
We are comparing the percent correct statistic: This is se-
lected by default as the comparison field shown toward the
left in figure 19. The three methods are displayed horizon-
tally, numbered (1), (2) and (3), as the heading of a little ta-
ble. The labels for the columns are repeated at the bottom–
trees.J48, rules.OneR, and rules.ZeroR–in case there is in-
sufficient space for them in the heading. The inscrutable in-
tegers beside the scheme names identify which version of the
scheme is being used. They are present by default to avoid
confusion among results generated using different versions
of the algorithms. The value in brackets at the beginning
of the iris row (100) is the number of experimental runs: 10
times tenfold cross-validation. The percentage correct for
the three schemes is shown in figure 19: 94.73% for method
1, 92.53% for method 2, and 33.33% for method 3. The
symbol placed beside a result indicates that it is statistically
better (v) or worse (*) than the baseline scheme-in this case
J48-at the specified significance level (0.05, or 5%). The cor-
rected resampled t-test[8] is used here. As shown, method 3
is significantly worse than method 1 because its success rate
is followed by an asterisk. At the bottom of columns 2 and
3 are counts (x/y/z) of the number of times the scheme was
better than (x), the same as (y), or worse than (z) the base-
line scheme on the datasets used in the experiment. In this
case there is only one dataset; method 2 was equivalent to
method 1 (the baseline) once, and method 3 was worse than
it once. (The annotation (v/ /*) is placed at the bottom of
column 1 to help you remember the meanings of the three
counts (x/y/z).
6. KNOWLEDGE FLOW
The KnowledgeFlow[2] provides an alternative to the Ex-
plorer as a graphical front end to WEKA’s core algorithms.
The interface of KnowledgeFlow is shown as figure 20. The
KnowledgeFlow is a work in progress so some of the func-
tionality from the Explorer is not yet available. On the other
hand, there are things that can be done in the Knowledge-
Flow but not in the Explorer.
The KnowledgeFlow presents a data-flow inspired inter-
face to WEKA. The user can select WEKA components
from a tool bar, place them on a layout canvas and con-
nect them together in order to form a knowledge flow for
processing and analyzing data. At present, all of WEKA’s
classifiers, filters, clusterers, loaders and savers are available
in the KnowledgeFlow along with some extra tools.
Figure 21 shows the J48 operational mechanism in Knowl-
edgeFlow application. Figure 22 reveals the related result,
9. Figure 21: KnowledgeFlow (J48)
Figure 22: J48 Result of KnowledgeFlow
which is the same as table 1 shows.
7. SIMPLE CLI
Lurking behind WEKA’s interactive interfacesalthe Ex-
plorer, the Knowledge Flow, and the Experimenter-lies its
basic functionality. This can be accessed more directly through
a command-line interface. That is Simple CLI. Its interface
is shown as figure 23. It has a plain textual panel with a
line at the bottom on which we enter commands.
For example, when we type ”java weka.associations.Apriori
-t data/weather.nominal.arff” into the plain textual panel,
the result will be shown in figure 24, which is the same as
that table 3 shows.
8. SUMMARY
WEKA has proved itself to be a useful and even essential
tool in the analysis of real world data sets. It reduces the
level of complexity involved in getting real world data into a
variety of machine learning schemes and evaluating the out-
put of those schemes. It has also provided a flexible aid for
machine learning research and a tool for introducing people
to machine learning in an educational environment[4].
9. ACKNOWLEDGMENT
I wish to thank Pro. Yong LIANG for his patient teaching
on the class and vital suggestions on this report.
10. REFERENCES
Figure 23: Simple CLI interface
Figure 24: Apriori result shown in Simple CLI in-
terface
[1] D. Baumgartner and G. Serpen, (2009).Large
Experiment and Evaluation Tool for WEKA
Classifiers. DMIN. pp: 340-346.
[2] R. R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P.
Reutemann, A. Seewald & D. Scuse, (2015). WEKA
Manual for Version 3-6-13. University of Waikato,
Hamilton, New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html.
[3] E. Frank. Machine Learning with WEKA,
[Power-Point slides].University of Waikato, Hamilton,
New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html.
[4] S. R. Garner. (1995). WEKA: The waikato
environment for knowledge analysis Proceedings of the
New Zealand computer science research students
conference. pp: 57-64.
[5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Reutemann & I. H. Witten, (2009). The WEKA data
mining software: an update[J]. ACM SIGKDD
explorations newsletter, 11(1). pp:10-18.
[6] O. Maimon and L. Rokach,(2005). Data mining and
Knowledge discovery handbook. (Vol. 2). New York:
Springer.
[7] B. Pfahringer. (2007). WEKA: A tool for exploratory
data mining [Power-Point slides]. University of
Waikato, New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ ml/weka/ index.html.
[8] I. H. Witten, E. Frank and M. A. Hall, (2011). Data
Mining: Practical Machine Learning Tools and
Techniques. 3rd ed., Morgan Kaufmann, San
Francisco.
[9] H. Witten and E. Frank. (2005). Data Mining:
Practical Machine Learning Tools and Techniques. 2nd
ed., Morgan Kaufmann, San Francisco.
[10] WEKA: The University of Waikato. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/index.html