Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
TAO Fayan_ Introduction to WEKA
1. An Introduction to WEKA
Fayan TAO
∗
Computer and information science system
Macau University of Science and Technology
fytao2015@gmail.com
ABSTRACT
1
WEKA is a data mining software workbench, which has
wide applications in machine learning technology. It has an
active community and enjoys widespread acceptance in both
academia and business. This report provides an introduc-
tion to WEKA, and demonstrates how to use each applica-
tion, including Explorer, Experimenter, KnowledgeFlow and
Simple CLI of WEKA based on version 3.6.13. Two kind
of classical data–iris data and weather data are used to do
experiments and related output results are shown and anal-
ysed in this report.
Keywords
WEKA; data mining; machine learning
1. INTRODUCTION
The WEKA[10][5] workbench is a collection of state-of-
the-art machine learning algorithms and data preprocess-
ing tools. It was developed at the University of Waikato in
New Zealand; the name stands for Waikato Environment for
Knowledge Analysis. (Outside the university, the WEKA,
pronounced to rhyme with Mecca, is a flightless bird with an
inquisitive nature found only on the islands of New Zealand
[8].) The system is written in Java and distributed under the
terms of the GNU General Public License. It runs on almost
any platform and has been tested under Linux, Windows,
and Macintosh operating systemsaland even on a personal
digital assistant.
It provides a uniform interface to many different learning
algorithms, along with methods for pre- and post-processing
and for evaluating the result of learning schemes on any
given dataset[6].
It contains several standard data mining techniques, in-
cluding data preprocessing, classification, regression, clus-
∗Stu ID:1509853F-II20-0019
1
This report follows ACM TEX format.
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
Figure 1: WEKA interface
tering, and association.
2. BACKGROUND AND HISTORY
The WEKA project has been funded by the New Zealand
since 1993[1][7]. The initial goal at this time was as follows:
The programme aims to build a state-of-the-art facility for
developing techniques of machine learning and investigat-
ing their application in key areas of New Zealand economy.
Specifically we will create a workbench for machine learning,
determine the factors that contribute towards its successful
application in the agriculture industries, and develop new
methods of machine learning and ways of assessing their ef-
fectiveness.
In 1996, a mostly C version of WEKA was released, while
in 1999 it was redeveloped and released in Java to support
platform independence. Today, there are several versions of
WEKA available to the public. The GUI version (6.0) is the
most recent release. The developer version (3.5.8) allows
users to obtain and modify source code to add content or fix
bugs. The book version (3.4.14) is as described in the data
mining book released by Witten and Frank[9].
3. INTERFACES
WEKA has four interfaces, which are started from the
main GUI Chooser window, as shown in figure 1(this dis-
cussion is based on WEKA version 3.6.13).They can handle
data preprocessing, classification, regression, clustering, and
association [1][2]. Following shows the specified application
interfaces:
• Explorer: An environment for exploring data with WEKA,
which contains six main functions, namely Preprocess,
Classify, Cluster, Associate, Select attributes and Vi-
sualize, respectively.
• Experimenter: An environment for performing exper-
iments and conducting statistical tests between learn-
2. Figure 2: Explorer interface
Figure 3: Preprocess interface
ing schemes.
• KnowledgeFlow: This environment supports essentially
the same functions as the Explorer but with a drag-
and-drop interface. One advantage is that it supports
incremental learning.
• SimpleCLI: An environment providing a simple command-
line interface that allows direct execution of WEKA
commands for operating systems that do not provide
their own command line interface.
4. EXPLORER
Explorer is the main graphical user interface in WEKA.
It is shown in figure 2.
It has six different panels, accessed by the tabs at the top,
that correspond to the various Data Mining tasks supported
([8][2]). We will discuss these six functions one by one in the
following sections.
• Preprocess: Choose and modify the data being acted
on.
• Classify: Train and test learning schemes that classify
or perform regression.
• Cluster: Learn clusters for the data.
• Associate: Learn association rules for the data.
• Select attributes: Select the most relevant attributes in
the data.
• Visualize: View an interactive 2D plot of the data.
4.1 Preprocess
The interface of Preprocess can be seen in the figure 3.
The first four buttons at the top of the preprocess section
are used to load data into WEKA[2]:
• Open file. . . Brings up a dialog box allowing us to
browse for the data file on the local file system.
Figure 4: Left:Weather data (.CSV ); Right: Iris
data (.arff)
Figure 5: Iris data after preprocessing
• Open URL. . . Asks for a Uniform Resource Locator
address for where the data is stored.
• Open DB. . . Reads data from a database. (Note that
to make this work we might have to edit the file in
WEKA/experiment/DatabaseUtils.props.)
• Generate. . . Enables us to generate artificial data
from a variety of DataGenerators.
Using the Open file. . . button, we can read files in a vari-
ety of formats: WEKA’s ARFF format, CSV format, C4.5
format, or serialized Instances format. ARFF files typically
have a .arff extension, CSV files a .csv extension, C4.5 files a
.data and .names extension, and serialized Instances objects
has a .bsi extension.
Note: This list of formats can be extended by adding cus-
tom file converters to the WEKA.core.converters package.
Here, we take weather data and iris data as examples. Fig-
ure 4 shows the weather data in the .CSV format and the
iris data in the .arff format. figure 2 and figure 5 show the
weather data and iris data after preprocessing, respectively.
We can see that, preprocess interface demonstrates data in-
formation, such as data Relation, Instance and Attributes. It
also shows statistics values, including Minimum, Maximum,
Mean and StdDev values. Additional, we can discretize data.
As the figure 6 shows, the weather data contains 5 sunny
days, 4 overcast days and 5 rainy days, where all overcast
days have no bad influence in playing, while not all sunny
or rainy days can be regarded as suitable days to play.
4.2 Classification
The Classify interface is shown as figure 7. To analyse
the weather data, We have to choose a classifier and test
options. In our case, we choose trees.J48 as the classifier.
As to the Test options box. There are four test modes[2]:
• Use training set: The classifier is evaluated on how
well it predicts the class of the instances it was trained
on.
3. Figure 6: Discretize weather data
Figure 7: Classify interface
• Supplied test set: The classifier is evaluated on how
well it predicts the class of a set of instances loaded
from a file. Clicking the Set. . . button brings up a
dialog allowing us to choose the file to test on.
• Cross-validation: The classifier is evaluated by cross-
validation, using the number of folds that are entered
in the Folds text field.
• Percentage split: The classifier is evaluated on how
well it predicts a certain percentage of the data which
is held out for testing. The amount of data held out
depends on the value entered in the % field.
Here, we set Percentage split as 66%, which means that the
classifier will predict 66 percentage of the weather data. The
figure 8 shows the visualize tree. And the table 1 shows the
classifier output. As we can see, the text in the Classifier
output area is split into several sections:
• Run information: A list of information giving the learn-
ing scheme options, relation name, instances, attributes
and test mode that were involved in the process.
• Classifier model (full training set): A textual represen-
tation of the classification model that was produced on
the full training data.
• Summary: A list of statistics summarizing how accu-
rately the classifier was able to predict the true class
of the instances under the chosen test mode.
Figure 8: Classification visualize tree
Figure 9: Visualize cluster assignments (SimpleK-
Means)
Figure 10: Visualize cluster assignments (Hierarchi-
calClusterer)
• Detailed Accuracy By Class: A more detailed per-class
break down of the classifier’s prediction accuracy. Here,
the true positives (TP)[8] and true negatives (TN) are
correct classifications. A false positive (FP) is when
the outcome is incorrectly predicted as yes (or posi-
tive) when it is actually no (negative). A false negative
(FN) is when the outcome is incorrectly predicted as
negative when it is actually positive. The true positive
rate is TP divided by the total number of positives,
which is TP +FN (i.e T P
T P +F N
); the false positive rate
is FP divided by the total number of negatives, which
is FP + TN (i.e F P
F P +T N
). The overall success rate
is the number of correct classifiations divided by the
total number of classifications (i.e T P +T N
T P +T N+F P +F N
).
Finally, the error rate is 1 minus this.
• Confusion Matrix: Shows how many instances have
been assigned to each class. Elements show the num-
ber of test examples whose actual class is the row and
whose predicted class is the column. In the table 1, we
can see that there are 2 no instances are classified to
yes, while 1 yes instance is mistaken as no.
4.3 Clustering
WEKA contains clusterer for finding groups of similar in-
stances in a dataset. There are different kind of implemented
schemes, such as k-Means, EM, Cobweb, X-means and Far-
thestFirst. Besides, clusters can be visualized and compared
to ”true” clusters (if given). If clustering scheme produces
a probability distribution, the evaluation will be based on
log-likelihood.
In this report, we choose three different clusters: SimpleK-
Means, HierarchicalClusterer and EM to analyse weather
data.
Figure 9, figure 10 and figure 11 show visualize cluster
assignments by the cluster ways of SimpleKMeans, Hierar-
chicalClusterer and EM,respectively.
Table 2 shows the cluster output by Kmeans. We can see
that, there are 14 instances in total. They are clustered into
4. === Run information ===
Scheme :WEKA. c l a s s i f i e r s . t r e e s . J48 −C 0.25 −M 2
Relation : weatherData
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
play
Test mode : s p l i t 66.0% train , remainder t e s t
=== C l a s s i f i e r model ( f u l l t r a i n i n g set ) ===
J48 pruned tree
−−−−−−−−−−−−−−−−−−
outlook = Rainy
| humidity = High : No ( 3 . 0 )
| humidity = Normal : Yes ( 2 . 0 )
outlook = Overcast : Yes ( 4 . 0 )
outlook = Sunny
| windy = False : Yes ( 3 . 0 )
| windy = True : No ( 2 . 0 )
Number of Leaves : 5
Size of the tree : 8
Time taken to build model : 0.03 seconds
=== Evaluation on t e s t s p l i t ===
=== Summary ===
Correctly C l a s s i f i e d Instances 2 40 %
I n c o r r e c t l y C l a s s i f i e d Instances 3 60 %
Kappa s t a t i s t i c −0.3636
Mean absolute e r ror 0.6
Root mean squared e r ror 0.7746
Relative absolute er r or 126.9231 %
Root r e l a t i v e squared e r ror 157.6801 %
Total Number of Instances 5
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F−Measure ROC Area Class
0 0.333 0 0 0 0.333 No
0.667 1 0.5 0.667 0.571 0.333 Yes
Weighted Avg . 0.4 0.733 0.3 0.4 0.343 0.333
=== Confusion Matrix ===
a b <−− c l a s s i f i e d as
0 2 | a = No
1 2 | b = Yes
Table 1: Classifier output (J48)
5. Scheme :WEKA. c l u s t e r e r s . SimpleKMeans −N 2 −A ”WEKA. core . EuclideanDistance −R f i r s t −l a s t ” −I 500 −S 10
Relation : weather . symbolic
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
Ignored :
play
Test mode : Classes to c l u s t e r s evaluation on t r a i n i n g data
=== Model and evaluation on t r a i n i n g set ===
kMeans
======
Number of i t e r a t i o n s : 4
Within c l u s t e r sum of squared e r r o r s : 21.000000000000004
Missing values g l o b a l l y replaced with mean/mode
Cluster centroids :
Cluster#
Attribute Full Data 0 1
(14) (10) (4)
==============================================
outlook sunny sunny overcast
temperature mild mild cool
humidity high high normal
windy FALSE FALSE TRUE
Time taken to build model ( f u l l t r a i n i n g data ) : 0 seconds
=== Model and evaluation on t r a i n i n g set ===
Clustered Instances
0 10 ( 71%)
1 4 ( 29%)
Class a t t r i b u t e : play
Classes to Clusters :
0 1 <−− assigned to c l u s t e r
6 3 | yes
4 1 | no
Cluster 0 <−− yes
Cluster 1 <−− no
I n c o r r e c t l y c l u s t e r e d instances : 7.0 50 %
Table 2: Cluster output (Kmeans)
6. === Run information ===
Scheme : WEKA. a s s o c i a t i o n s . Apriori −N 10 −T 0 −C 0.9 −D 0.05 −U 1.0 −M 0.1 −S −1.0 −c −1
Relation : weatherData
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
play
=== Associator model ( f u l l t r a i n i n g set ) ===
Apriori
=======
Minimum support : 0.15 (2 instances )
Minimum metric <confidence >: 0.9
Number of c y c l e s performed : 17
Generated s e t s of l a r g e itemsets :
Size of set of l a r g e itemsets L ( 1 ) : 12
Size of set of l a r g e itemsets L ( 2 ) : 47
Size of set of l a r g e itemsets L ( 3 ) : 39
Size of set of l a r g e itemsets L ( 4 ) : 6
Best r u l e s found :
1. outlook=Overcast 4 ==> play=Yes 4 conf : ( 1 )
2. temperature=Cool 4 ==> humidity=Normal 4 conf : ( 1 )
3. humidity=Normal windy=False 4 ==> play=Yes 4 conf : ( 1 )
4. outlook=Rainy play=No 3 ==> humidity=High 3 conf : ( 1 )
5. outlook=Rainy humidity=High 3 ==> play=No 3 conf : ( 1 )
6. outlook=Sunny play=Yes 3 ==> windy=False 3 conf : ( 1 )
7. outlook=Sunny windy=False 3 ==> play=Yes 3 conf : ( 1 )
8. temperature=Cool play=Yes 3 ==> humidity=Normal 3 conf : ( 1 )
9. outlook=Rainy temperature=Hot 2 ==> humidity=High 2 conf : ( 1 )
10. temperature=Hot play=No 2 ==> outlook=Rainy 2 conf : ( 1 )
Table 3: Associator output (Apriori)
7. Figure 11: Visualize cluster assignments (EM)
Figure 12: Associate interface
two groups–0 and 1. Group 0 represents play, while group 1
means non-play. They contains 10 instances and 4 instances,
respectively. According to the output section–Classes to
Clusters, there are 3 play instance being mistaken as non-
play and 4 non-play instances being misclassified as play. So,
in fact, there are 9 play instances and 5 non-play instances
in the real data.
4.4 Association
WEKA contains an implementation of the Apriori algo-
rithm for learning association rules. Apriori can compute
all rules that have a given minimum support and exceed a
given confidence. But it works only with discrete data. It
can identify statistical dependencies between groups of at-
tributes[3].
Figure 12 shows the interface of Associate application. Ta-
ble 3 shows the output by means of apriori. We can get the
best rule by the Best rules found section. For example,
when the outlook is overcast, it is good for playing; While on
the rainy day and there is no play, which means the humidity
is high.
4.5 Select attributes
Attribute selection[2] involves searching through all pos-
sible combinations of attributes in the data to find which
subset of attributes works best for prediction. To do this,
two objects must be set up: an attribute evaluator and a
search method. A search method contains best-first, for-
ward selection, random, exhaustive, genetic algorithm and
ranking. An evaluation method includes correlation-based,
wrapper, information gain,chi-squared and so on.
The evaluator determines what method is used to assign
a worth to each subset of attributes. The search method
determines what style of search is performed. WEKA allows
(almost) arbitrary combinations of these two, so it is very
flexible.
Figure 13 and figure 14 demonstrate attribute select out-
puts by methods of (CfsSubsetEval + Best first) and (ChiSquare-
dAttributeEval + Ranker), respectively. The former selects
outlook and humidity as the best subset for prediction. The
latter method ranks the attributes in a descending rank or-
Figure 13: Attribute select output (CfsSubsetEval
+ Best first)
Figure 14: Attribute select output (ChiSquaredAt-
tributeEval + Ranker)
der: outlook, humidity, windy and temperature. Therefor,
we can see that outlook and humidity play an important role
in determining whether it is good for playing or not.
4.6 Visualization
Visualization is very useful in practice, for example, it
helps to determine which difficulty is in the learning prob-
lems. WEKA can visualize single attributes and pairs of
attributes, which can be shown in figure 15.
5. EXPERIMENTER
The Experimenter[8] enables us to set up large-scale ex-
periments, start them running, leave them and come back
when they have finished, and then analyze the performance
statistics that have been collected. They automate the ex-
perimental process. Experimenter makes it easy to compare
the performance of different learning schemes. The statistics
can be stored in ARFF format, and can themselves be the
subject of further data mining.
Figure 16, figure 17 and figure 19 show the three main
part of Experimenter. We have to setup the data first, then
run it, finally analyse the result.
To analyze the experiment that has been performed in
this section, click the Experiment button at the top right;
otherwise, supply a fie that contains the results of another
experiment. Then click Perform test (near the bottom left).
Figure 15: Visualization of weather data
8. Figure 16: An experiment: setting it up
Figure 17: An experiment: run
Figure 18: Experiment output result
Figure 19: Statistical test results for the experiment
Figure 20: KnowledgeFlow interface
The result of a statistical significance test of the performance
of the fist learning scheme (J48) versus the other two (OneR
and ZeroR) are displayed in the large panel on the right, just
as the figure 18 shows.
We are comparing the percent correct statistic: This is se-
lected by default as the comparison field shown toward the
left in figure 19. The three methods are displayed horizon-
tally, numbered (1), (2) and (3), as the heading of a little ta-
ble. The labels for the columns are repeated at the bottom–
trees.J48, rules.OneR, and rules.ZeroR–in case there is in-
sufficient space for them in the heading. The inscrutable in-
tegers beside the scheme names identify which version of the
scheme is being used. They are present by default to avoid
confusion among results generated using different versions
of the algorithms. The value in brackets at the beginning
of the iris row (100) is the number of experimental runs: 10
times tenfold cross-validation. The percentage correct for
the three schemes is shown in figure 19: 94.73% for method
1, 92.53% for method 2, and 33.33% for method 3. The
symbol placed beside a result indicates that it is statistically
better (v) or worse (*) than the baseline scheme-in this case
J48-at the specified significance level (0.05, or 5%). The cor-
rected resampled t-test[8] is used here. As shown, method 3
is significantly worse than method 1 because its success rate
is followed by an asterisk. At the bottom of columns 2 and
3 are counts (x/y/z) of the number of times the scheme was
better than (x), the same as (y), or worse than (z) the base-
line scheme on the datasets used in the experiment. In this
case there is only one dataset; method 2 was equivalent to
method 1 (the baseline) once, and method 3 was worse than
it once. (The annotation (v/ /*) is placed at the bottom of
column 1 to help you remember the meanings of the three
counts (x/y/z).
6. KNOWLEDGE FLOW
The KnowledgeFlow[2] provides an alternative to the Ex-
plorer as a graphical front end to WEKA’s core algorithms.
The interface of KnowledgeFlow is shown as figure 20. The
KnowledgeFlow is a work in progress so some of the func-
tionality from the Explorer is not yet available. On the other
hand, there are things that can be done in the Knowledge-
Flow but not in the Explorer.
The KnowledgeFlow presents a data-flow inspired inter-
face to WEKA. The user can select WEKA components
from a tool bar, place them on a layout canvas and con-
nect them together in order to form a knowledge flow for
processing and analyzing data. At present, all of WEKA’s
classifiers, filters, clusterers, loaders and savers are available
in the KnowledgeFlow along with some extra tools.
Figure 21 shows the J48 operational mechanism in Knowl-
edgeFlow application. Figure 22 reveals the related result,
9. Figure 21: KnowledgeFlow (J48)
Figure 22: J48 Result of KnowledgeFlow
which is the same as table 1 shows.
7. SIMPLE CLI
Lurking behind WEKA’s interactive interfacesalthe Ex-
plorer, the Knowledge Flow, and the Experimenter-lies its
basic functionality. This can be accessed more directly through
a command-line interface. That is Simple CLI. Its interface
is shown as figure 23. It has a plain textual panel with a
line at the bottom on which we enter commands.
For example, when we type ”java weka.associations.Apriori
-t data/weather.nominal.arff” into the plain textual panel,
the result will be shown in figure 24, which is the same as
that table 3 shows.
8. SUMMARY
WEKA has proved itself to be a useful and even essential
tool in the analysis of real world data sets. It reduces the
level of complexity involved in getting real world data into a
variety of machine learning schemes and evaluating the out-
put of those schemes. It has also provided a flexible aid for
machine learning research and a tool for introducing people
to machine learning in an educational environment[4].
9. ACKNOWLEDGMENT
I wish to thank Pro. Yong LIANG for his patient teaching
on the class and vital suggestions on this report.
10. REFERENCES
Figure 23: Simple CLI interface
Figure 24: Apriori result shown in Simple CLI in-
terface
[1] D. Baumgartner and G. Serpen, (2009).Large
Experiment and Evaluation Tool for WEKA
Classifiers. DMIN. pp: 340-346.
[2] R. R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P.
Reutemann, A. Seewald & D. Scuse, (2015). WEKA
Manual for Version 3-6-13. University of Waikato,
Hamilton, New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html.
[3] E. Frank. Machine Learning with WEKA,
[Power-Point slides].University of Waikato, Hamilton,
New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html.
[4] S. R. Garner. (1995). WEKA: The waikato
environment for knowledge analysis Proceedings of the
New Zealand computer science research students
conference. pp: 57-64.
[5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Reutemann & I. H. Witten, (2009). The WEKA data
mining software: an update[J]. ACM SIGKDD
explorations newsletter, 11(1). pp:10-18.
[6] O. Maimon and L. Rokach,(2005). Data mining and
Knowledge discovery handbook. (Vol. 2). New York:
Springer.
[7] B. Pfahringer. (2007). WEKA: A tool for exploratory
data mining [Power-Point slides]. University of
Waikato, New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ ml/weka/ index.html.
[8] I. H. Witten, E. Frank and M. A. Hall, (2011). Data
Mining: Practical Machine Learning Tools and
Techniques. 3rd ed., Morgan Kaufmann, San
Francisco.
[9] H. Witten and E. Frank. (2005). Data Mining:
Practical Machine Learning Tools and Techniques. 2nd
ed., Morgan Kaufmann, San Francisco.
[10] WEKA: The University of Waikato. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/index.html