SlideShare a Scribd company logo
1 of 9
Download to read offline
An Introduction to WEKA
Fayan TAO
∗
Computer and information science system
Macau University of Science and Technology
fytao2015@gmail.com
ABSTRACT
1
WEKA is a data mining software workbench, which has
wide applications in machine learning technology. It has an
active community and enjoys widespread acceptance in both
academia and business. This report provides an introduc-
tion to WEKA, and demonstrates how to use each applica-
tion, including Explorer, Experimenter, KnowledgeFlow and
Simple CLI of WEKA based on version 3.6.13. Two kind
of classical data–iris data and weather data are used to do
experiments and related output results are shown and anal-
ysed in this report.
Keywords
WEKA; data mining; machine learning
1. INTRODUCTION
The WEKA[10][5] workbench is a collection of state-of-
the-art machine learning algorithms and data preprocess-
ing tools. It was developed at the University of Waikato in
New Zealand; the name stands for Waikato Environment for
Knowledge Analysis. (Outside the university, the WEKA,
pronounced to rhyme with Mecca, is a flightless bird with an
inquisitive nature found only on the islands of New Zealand
[8].) The system is written in Java and distributed under the
terms of the GNU General Public License. It runs on almost
any platform and has been tested under Linux, Windows,
and Macintosh operating systemsaland even on a personal
digital assistant.
It provides a uniform interface to many different learning
algorithms, along with methods for pre- and post-processing
and for evaluating the result of learning schemes on any
given dataset[6].
It contains several standard data mining techniques, in-
cluding data preprocessing, classification, regression, clus-
∗Stu ID:1509853F-II20-0019
1
This report follows ACM TEX format.
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
Figure 1: WEKA interface
tering, and association.
2. BACKGROUND AND HISTORY
The WEKA project has been funded by the New Zealand
since 1993[1][7]. The initial goal at this time was as follows:
The programme aims to build a state-of-the-art facility for
developing techniques of machine learning and investigat-
ing their application in key areas of New Zealand economy.
Specifically we will create a workbench for machine learning,
determine the factors that contribute towards its successful
application in the agriculture industries, and develop new
methods of machine learning and ways of assessing their ef-
fectiveness.
In 1996, a mostly C version of WEKA was released, while
in 1999 it was redeveloped and released in Java to support
platform independence. Today, there are several versions of
WEKA available to the public. The GUI version (6.0) is the
most recent release. The developer version (3.5.8) allows
users to obtain and modify source code to add content or fix
bugs. The book version (3.4.14) is as described in the data
mining book released by Witten and Frank[9].
3. INTERFACES
WEKA has four interfaces, which are started from the
main GUI Chooser window, as shown in figure 1(this dis-
cussion is based on WEKA version 3.6.13).They can handle
data preprocessing, classification, regression, clustering, and
association [1][2]. Following shows the specified application
interfaces:
• Explorer: An environment for exploring data with WEKA,
which contains six main functions, namely Preprocess,
Classify, Cluster, Associate, Select attributes and Vi-
sualize, respectively.
• Experimenter: An environment for performing exper-
iments and conducting statistical tests between learn-
Figure 2: Explorer interface
Figure 3: Preprocess interface
ing schemes.
• KnowledgeFlow: This environment supports essentially
the same functions as the Explorer but with a drag-
and-drop interface. One advantage is that it supports
incremental learning.
• SimpleCLI: An environment providing a simple command-
line interface that allows direct execution of WEKA
commands for operating systems that do not provide
their own command line interface.
4. EXPLORER
Explorer is the main graphical user interface in WEKA.
It is shown in figure 2.
It has six different panels, accessed by the tabs at the top,
that correspond to the various Data Mining tasks supported
([8][2]). We will discuss these six functions one by one in the
following sections.
• Preprocess: Choose and modify the data being acted
on.
• Classify: Train and test learning schemes that classify
or perform regression.
• Cluster: Learn clusters for the data.
• Associate: Learn association rules for the data.
• Select attributes: Select the most relevant attributes in
the data.
• Visualize: View an interactive 2D plot of the data.
4.1 Preprocess
The interface of Preprocess can be seen in the figure 3.
The first four buttons at the top of the preprocess section
are used to load data into WEKA[2]:
• Open file. . . Brings up a dialog box allowing us to
browse for the data file on the local file system.
Figure 4: Left:Weather data (.CSV ); Right: Iris
data (.arff)
Figure 5: Iris data after preprocessing
• Open URL. . . Asks for a Uniform Resource Locator
address for where the data is stored.
• Open DB. . . Reads data from a database. (Note that
to make this work we might have to edit the file in
WEKA/experiment/DatabaseUtils.props.)
• Generate. . . Enables us to generate artificial data
from a variety of DataGenerators.
Using the Open file. . . button, we can read files in a vari-
ety of formats: WEKA’s ARFF format, CSV format, C4.5
format, or serialized Instances format. ARFF files typically
have a .arff extension, CSV files a .csv extension, C4.5 files a
.data and .names extension, and serialized Instances objects
has a .bsi extension.
Note: This list of formats can be extended by adding cus-
tom file converters to the WEKA.core.converters package.
Here, we take weather data and iris data as examples. Fig-
ure 4 shows the weather data in the .CSV format and the
iris data in the .arff format. figure 2 and figure 5 show the
weather data and iris data after preprocessing, respectively.
We can see that, preprocess interface demonstrates data in-
formation, such as data Relation, Instance and Attributes. It
also shows statistics values, including Minimum, Maximum,
Mean and StdDev values. Additional, we can discretize data.
As the figure 6 shows, the weather data contains 5 sunny
days, 4 overcast days and 5 rainy days, where all overcast
days have no bad influence in playing, while not all sunny
or rainy days can be regarded as suitable days to play.
4.2 Classification
The Classify interface is shown as figure 7. To analyse
the weather data, We have to choose a classifier and test
options. In our case, we choose trees.J48 as the classifier.
As to the Test options box. There are four test modes[2]:
• Use training set: The classifier is evaluated on how
well it predicts the class of the instances it was trained
on.
Figure 6: Discretize weather data
Figure 7: Classify interface
• Supplied test set: The classifier is evaluated on how
well it predicts the class of a set of instances loaded
from a file. Clicking the Set. . . button brings up a
dialog allowing us to choose the file to test on.
• Cross-validation: The classifier is evaluated by cross-
validation, using the number of folds that are entered
in the Folds text field.
• Percentage split: The classifier is evaluated on how
well it predicts a certain percentage of the data which
is held out for testing. The amount of data held out
depends on the value entered in the % field.
Here, we set Percentage split as 66%, which means that the
classifier will predict 66 percentage of the weather data. The
figure 8 shows the visualize tree. And the table 1 shows the
classifier output. As we can see, the text in the Classifier
output area is split into several sections:
• Run information: A list of information giving the learn-
ing scheme options, relation name, instances, attributes
and test mode that were involved in the process.
• Classifier model (full training set): A textual represen-
tation of the classification model that was produced on
the full training data.
• Summary: A list of statistics summarizing how accu-
rately the classifier was able to predict the true class
of the instances under the chosen test mode.
Figure 8: Classification visualize tree
Figure 9: Visualize cluster assignments (SimpleK-
Means)
Figure 10: Visualize cluster assignments (Hierarchi-
calClusterer)
• Detailed Accuracy By Class: A more detailed per-class
break down of the classifier’s prediction accuracy. Here,
the true positives (TP)[8] and true negatives (TN) are
correct classifications. A false positive (FP) is when
the outcome is incorrectly predicted as yes (or posi-
tive) when it is actually no (negative). A false negative
(FN) is when the outcome is incorrectly predicted as
negative when it is actually positive. The true positive
rate is TP divided by the total number of positives,
which is TP +FN (i.e T P
T P +F N
); the false positive rate
is FP divided by the total number of negatives, which
is FP + TN (i.e F P
F P +T N
). The overall success rate
is the number of correct classifiations divided by the
total number of classifications (i.e T P +T N
T P +T N+F P +F N
).
Finally, the error rate is 1 minus this.
• Confusion Matrix: Shows how many instances have
been assigned to each class. Elements show the num-
ber of test examples whose actual class is the row and
whose predicted class is the column. In the table 1, we
can see that there are 2 no instances are classified to
yes, while 1 yes instance is mistaken as no.
4.3 Clustering
WEKA contains clusterer for finding groups of similar in-
stances in a dataset. There are different kind of implemented
schemes, such as k-Means, EM, Cobweb, X-means and Far-
thestFirst. Besides, clusters can be visualized and compared
to ”true” clusters (if given). If clustering scheme produces
a probability distribution, the evaluation will be based on
log-likelihood.
In this report, we choose three different clusters: SimpleK-
Means, HierarchicalClusterer and EM to analyse weather
data.
Figure 9, figure 10 and figure 11 show visualize cluster
assignments by the cluster ways of SimpleKMeans, Hierar-
chicalClusterer and EM,respectively.
Table 2 shows the cluster output by Kmeans. We can see
that, there are 14 instances in total. They are clustered into
=== Run information ===
Scheme :WEKA. c l a s s i f i e r s . t r e e s . J48 −C 0.25 −M 2
Relation : weatherData
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
play
Test mode : s p l i t 66.0% train , remainder t e s t
=== C l a s s i f i e r model ( f u l l t r a i n i n g set ) ===
J48 pruned tree
−−−−−−−−−−−−−−−−−−
outlook = Rainy
| humidity = High : No ( 3 . 0 )
| humidity = Normal : Yes ( 2 . 0 )
outlook = Overcast : Yes ( 4 . 0 )
outlook = Sunny
| windy = False : Yes ( 3 . 0 )
| windy = True : No ( 2 . 0 )
Number of Leaves : 5
Size of the tree : 8
Time taken to build model : 0.03 seconds
=== Evaluation on t e s t s p l i t ===
=== Summary ===
Correctly C l a s s i f i e d Instances 2 40 %
I n c o r r e c t l y C l a s s i f i e d Instances 3 60 %
Kappa s t a t i s t i c −0.3636
Mean absolute e r ror 0.6
Root mean squared e r ror 0.7746
Relative absolute er r or 126.9231 %
Root r e l a t i v e squared e r ror 157.6801 %
Total Number of Instances 5
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F−Measure ROC Area Class
0 0.333 0 0 0 0.333 No
0.667 1 0.5 0.667 0.571 0.333 Yes
Weighted Avg . 0.4 0.733 0.3 0.4 0.343 0.333
=== Confusion Matrix ===
a b <−− c l a s s i f i e d as
0 2 | a = No
1 2 | b = Yes
Table 1: Classifier output (J48)
Scheme :WEKA. c l u s t e r e r s . SimpleKMeans −N 2 −A ”WEKA. core . EuclideanDistance −R f i r s t −l a s t ” −I 500 −S 10
Relation : weather . symbolic
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
Ignored :
play
Test mode : Classes to c l u s t e r s evaluation on t r a i n i n g data
=== Model and evaluation on t r a i n i n g set ===
kMeans
======
Number of i t e r a t i o n s : 4
Within c l u s t e r sum of squared e r r o r s : 21.000000000000004
Missing values g l o b a l l y replaced with mean/mode
Cluster centroids :
Cluster#
Attribute Full Data 0 1
(14) (10) (4)
==============================================
outlook sunny sunny overcast
temperature mild mild cool
humidity high high normal
windy FALSE FALSE TRUE
Time taken to build model ( f u l l t r a i n i n g data ) : 0 seconds
=== Model and evaluation on t r a i n i n g set ===
Clustered Instances
0 10 ( 71%)
1 4 ( 29%)
Class a t t r i b u t e : play
Classes to Clusters :
0 1 <−− assigned to c l u s t e r
6 3 | yes
4 1 | no
Cluster 0 <−− yes
Cluster 1 <−− no
I n c o r r e c t l y c l u s t e r e d instances : 7.0 50 %
Table 2: Cluster output (Kmeans)
=== Run information ===
Scheme : WEKA. a s s o c i a t i o n s . Apriori −N 10 −T 0 −C 0.9 −D 0.05 −U 1.0 −M 0.1 −S −1.0 −c −1
Relation : weatherData
Instances : 14
Attributes : 5
outlook
temperature
humidity
windy
play
=== Associator model ( f u l l t r a i n i n g set ) ===
Apriori
=======
Minimum support : 0.15 (2 instances )
Minimum metric <confidence >: 0.9
Number of c y c l e s performed : 17
Generated s e t s of l a r g e itemsets :
Size of set of l a r g e itemsets L ( 1 ) : 12
Size of set of l a r g e itemsets L ( 2 ) : 47
Size of set of l a r g e itemsets L ( 3 ) : 39
Size of set of l a r g e itemsets L ( 4 ) : 6
Best r u l e s found :
1. outlook=Overcast 4 ==> play=Yes 4 conf : ( 1 )
2. temperature=Cool 4 ==> humidity=Normal 4 conf : ( 1 )
3. humidity=Normal windy=False 4 ==> play=Yes 4 conf : ( 1 )
4. outlook=Rainy play=No 3 ==> humidity=High 3 conf : ( 1 )
5. outlook=Rainy humidity=High 3 ==> play=No 3 conf : ( 1 )
6. outlook=Sunny play=Yes 3 ==> windy=False 3 conf : ( 1 )
7. outlook=Sunny windy=False 3 ==> play=Yes 3 conf : ( 1 )
8. temperature=Cool play=Yes 3 ==> humidity=Normal 3 conf : ( 1 )
9. outlook=Rainy temperature=Hot 2 ==> humidity=High 2 conf : ( 1 )
10. temperature=Hot play=No 2 ==> outlook=Rainy 2 conf : ( 1 )
Table 3: Associator output (Apriori)
Figure 11: Visualize cluster assignments (EM)
Figure 12: Associate interface
two groups–0 and 1. Group 0 represents play, while group 1
means non-play. They contains 10 instances and 4 instances,
respectively. According to the output section–Classes to
Clusters, there are 3 play instance being mistaken as non-
play and 4 non-play instances being misclassified as play. So,
in fact, there are 9 play instances and 5 non-play instances
in the real data.
4.4 Association
WEKA contains an implementation of the Apriori algo-
rithm for learning association rules. Apriori can compute
all rules that have a given minimum support and exceed a
given confidence. But it works only with discrete data. It
can identify statistical dependencies between groups of at-
tributes[3].
Figure 12 shows the interface of Associate application. Ta-
ble 3 shows the output by means of apriori. We can get the
best rule by the Best rules found section. For example,
when the outlook is overcast, it is good for playing; While on
the rainy day and there is no play, which means the humidity
is high.
4.5 Select attributes
Attribute selection[2] involves searching through all pos-
sible combinations of attributes in the data to find which
subset of attributes works best for prediction. To do this,
two objects must be set up: an attribute evaluator and a
search method. A search method contains best-first, for-
ward selection, random, exhaustive, genetic algorithm and
ranking. An evaluation method includes correlation-based,
wrapper, information gain,chi-squared and so on.
The evaluator determines what method is used to assign
a worth to each subset of attributes. The search method
determines what style of search is performed. WEKA allows
(almost) arbitrary combinations of these two, so it is very
flexible.
Figure 13 and figure 14 demonstrate attribute select out-
puts by methods of (CfsSubsetEval + Best first) and (ChiSquare-
dAttributeEval + Ranker), respectively. The former selects
outlook and humidity as the best subset for prediction. The
latter method ranks the attributes in a descending rank or-
Figure 13: Attribute select output (CfsSubsetEval
+ Best first)
Figure 14: Attribute select output (ChiSquaredAt-
tributeEval + Ranker)
der: outlook, humidity, windy and temperature. Therefor,
we can see that outlook and humidity play an important role
in determining whether it is good for playing or not.
4.6 Visualization
Visualization is very useful in practice, for example, it
helps to determine which difficulty is in the learning prob-
lems. WEKA can visualize single attributes and pairs of
attributes, which can be shown in figure 15.
5. EXPERIMENTER
The Experimenter[8] enables us to set up large-scale ex-
periments, start them running, leave them and come back
when they have finished, and then analyze the performance
statistics that have been collected. They automate the ex-
perimental process. Experimenter makes it easy to compare
the performance of different learning schemes. The statistics
can be stored in ARFF format, and can themselves be the
subject of further data mining.
Figure 16, figure 17 and figure 19 show the three main
part of Experimenter. We have to setup the data first, then
run it, finally analyse the result.
To analyze the experiment that has been performed in
this section, click the Experiment button at the top right;
otherwise, supply a fie that contains the results of another
experiment. Then click Perform test (near the bottom left).
Figure 15: Visualization of weather data
Figure 16: An experiment: setting it up
Figure 17: An experiment: run
Figure 18: Experiment output result
Figure 19: Statistical test results for the experiment
Figure 20: KnowledgeFlow interface
The result of a statistical significance test of the performance
of the fist learning scheme (J48) versus the other two (OneR
and ZeroR) are displayed in the large panel on the right, just
as the figure 18 shows.
We are comparing the percent correct statistic: This is se-
lected by default as the comparison field shown toward the
left in figure 19. The three methods are displayed horizon-
tally, numbered (1), (2) and (3), as the heading of a little ta-
ble. The labels for the columns are repeated at the bottom–
trees.J48, rules.OneR, and rules.ZeroR–in case there is in-
sufficient space for them in the heading. The inscrutable in-
tegers beside the scheme names identify which version of the
scheme is being used. They are present by default to avoid
confusion among results generated using different versions
of the algorithms. The value in brackets at the beginning
of the iris row (100) is the number of experimental runs: 10
times tenfold cross-validation. The percentage correct for
the three schemes is shown in figure 19: 94.73% for method
1, 92.53% for method 2, and 33.33% for method 3. The
symbol placed beside a result indicates that it is statistically
better (v) or worse (*) than the baseline scheme-in this case
J48-at the specified significance level (0.05, or 5%). The cor-
rected resampled t-test[8] is used here. As shown, method 3
is significantly worse than method 1 because its success rate
is followed by an asterisk. At the bottom of columns 2 and
3 are counts (x/y/z) of the number of times the scheme was
better than (x), the same as (y), or worse than (z) the base-
line scheme on the datasets used in the experiment. In this
case there is only one dataset; method 2 was equivalent to
method 1 (the baseline) once, and method 3 was worse than
it once. (The annotation (v/ /*) is placed at the bottom of
column 1 to help you remember the meanings of the three
counts (x/y/z).
6. KNOWLEDGE FLOW
The KnowledgeFlow[2] provides an alternative to the Ex-
plorer as a graphical front end to WEKA’s core algorithms.
The interface of KnowledgeFlow is shown as figure 20. The
KnowledgeFlow is a work in progress so some of the func-
tionality from the Explorer is not yet available. On the other
hand, there are things that can be done in the Knowledge-
Flow but not in the Explorer.
The KnowledgeFlow presents a data-flow inspired inter-
face to WEKA. The user can select WEKA components
from a tool bar, place them on a layout canvas and con-
nect them together in order to form a knowledge flow for
processing and analyzing data. At present, all of WEKA’s
classifiers, filters, clusterers, loaders and savers are available
in the KnowledgeFlow along with some extra tools.
Figure 21 shows the J48 operational mechanism in Knowl-
edgeFlow application. Figure 22 reveals the related result,
Figure 21: KnowledgeFlow (J48)
Figure 22: J48 Result of KnowledgeFlow
which is the same as table 1 shows.
7. SIMPLE CLI
Lurking behind WEKA’s interactive interfacesalthe Ex-
plorer, the Knowledge Flow, and the Experimenter-lies its
basic functionality. This can be accessed more directly through
a command-line interface. That is Simple CLI. Its interface
is shown as figure 23. It has a plain textual panel with a
line at the bottom on which we enter commands.
For example, when we type ”java weka.associations.Apriori
-t data/weather.nominal.arff” into the plain textual panel,
the result will be shown in figure 24, which is the same as
that table 3 shows.
8. SUMMARY
WEKA has proved itself to be a useful and even essential
tool in the analysis of real world data sets. It reduces the
level of complexity involved in getting real world data into a
variety of machine learning schemes and evaluating the out-
put of those schemes. It has also provided a flexible aid for
machine learning research and a tool for introducing people
to machine learning in an educational environment[4].
9. ACKNOWLEDGMENT
I wish to thank Pro. Yong LIANG for his patient teaching
on the class and vital suggestions on this report.
10. REFERENCES
Figure 23: Simple CLI interface
Figure 24: Apriori result shown in Simple CLI in-
terface
[1] D. Baumgartner and G. Serpen, (2009).Large
Experiment and Evaluation Tool for WEKA
Classifiers. DMIN. pp: 340-346.
[2] R. R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P.
Reutemann, A. Seewald & D. Scuse, (2015). WEKA
Manual for Version 3-6-13. University of Waikato,
Hamilton, New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html.
[3] E. Frank. Machine Learning with WEKA,
[Power-Point slides].University of Waikato, Hamilton,
New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/documentation.html.
[4] S. R. Garner. (1995). WEKA: The waikato
environment for knowledge analysis Proceedings of the
New Zealand computer science research students
conference. pp: 57-64.
[5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Reutemann & I. H. Witten, (2009). The WEKA data
mining software: an update[J]. ACM SIGKDD
explorations newsletter, 11(1). pp:10-18.
[6] O. Maimon and L. Rokach,(2005). Data mining and
Knowledge discovery handbook. (Vol. 2). New York:
Springer.
[7] B. Pfahringer. (2007). WEKA: A tool for exploratory
data mining [Power-Point slides]. University of
Waikato, New Zealand. Retrieved from
http://www.cs.waikato.ac.nz/ ml/weka/ index.html.
[8] I. H. Witten, E. Frank and M. A. Hall, (2011). Data
Mining: Practical Machine Learning Tools and
Techniques. 3rd ed., Morgan Kaufmann, San
Francisco.
[9] H. Witten and E. Frank. (2005). Data Mining:
Practical Machine Learning Tools and Techniques. 2nd
ed., Morgan Kaufmann, San Francisco.
[10] WEKA: The University of Waikato. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/index.html

More Related Content

What's hot

Phd coursestatalez2datamanagement
Phd coursestatalez2datamanagementPhd coursestatalez2datamanagement
Phd coursestatalez2datamanagementMarco Delogu
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisYabebal Ayalew
 
Pranavi verma-class-9-spreadsheet
Pranavi verma-class-9-spreadsheetPranavi verma-class-9-spreadsheet
Pranavi verma-class-9-spreadsheetPranaviVerma
 
Sap abap modularization interview questions
Sap abap modularization interview questionsSap abap modularization interview questions
Sap abap modularization interview questionsPradipta Mohanty
 
SAP ABAP Practice exam
SAP ABAP Practice examSAP ABAP Practice exam
SAP ABAP Practice examIT LearnMore
 
Oracle Certification 1Z0-1041 Questions and Answers
Oracle Certification 1Z0-1041 Questions and AnswersOracle Certification 1Z0-1041 Questions and Answers
Oracle Certification 1Z0-1041 Questions and Answersdouglascarnicelli
 
Linkage mapping and QTL analysis_Lab
Linkage mapping and QTL analysis_LabLinkage mapping and QTL analysis_Lab
Linkage mapping and QTL analysis_LabSameer Khanal
 
Dervy bis-155-week-1-quiz-new
Dervy bis-155-week-1-quiz-newDervy bis-155-week-1-quiz-new
Dervy bis-155-week-1-quiz-newindividual484
 

What's hot (14)

Phd coursestatalez2datamanagement
Phd coursestatalez2datamanagementPhd coursestatalez2datamanagement
Phd coursestatalez2datamanagement
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Pranavi verma-class-9-spreadsheet
Pranavi verma-class-9-spreadsheetPranavi verma-class-9-spreadsheet
Pranavi verma-class-9-spreadsheet
 
Sap abap modularization interview questions
Sap abap modularization interview questionsSap abap modularization interview questions
Sap abap modularization interview questions
 
224-2009
224-2009224-2009
224-2009
 
SAP ABAP Practice exam
SAP ABAP Practice examSAP ABAP Practice exam
SAP ABAP Practice exam
 
Sap abap material
Sap abap materialSap abap material
Sap abap material
 
I stata
I stataI stata
I stata
 
Introduction to Stata
Introduction to Stata Introduction to Stata
Introduction to Stata
 
Ds 3
Ds 3Ds 3
Ds 3
 
Oracle Certification 1Z0-1041 Questions and Answers
Oracle Certification 1Z0-1041 Questions and AnswersOracle Certification 1Z0-1041 Questions and Answers
Oracle Certification 1Z0-1041 Questions and Answers
 
ADVANCE ITT BY PRASAD
ADVANCE ITT BY PRASADADVANCE ITT BY PRASAD
ADVANCE ITT BY PRASAD
 
Linkage mapping and QTL analysis_Lab
Linkage mapping and QTL analysis_LabLinkage mapping and QTL analysis_Lab
Linkage mapping and QTL analysis_Lab
 
Dervy bis-155-week-1-quiz-new
Dervy bis-155-week-1-quiz-newDervy bis-155-week-1-quiz-new
Dervy bis-155-week-1-quiz-new
 

Similar to TAO Fayan_ Introduction to WEKA

Introduction to weka
Introduction to wekaIntroduction to weka
Introduction to wekaJK Knowledge
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Siddharth Verma
 
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082Saurabh Singh
 
Programming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentProgramming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentMahmoud Samir Fayed
 
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsReport: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsRoman Atachiants
 
Test Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful ToolsTest Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful Toolsmcthedog
 
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxMajor AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxinfantsuk
 
Comp 220 ilab 6 of 7
Comp 220 ilab 6 of 7Comp 220 ilab 6 of 7
Comp 220 ilab 6 of 7ashhadiqbal
 
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSPREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSSamsung Electronics
 
Predicting performance of classification algorithms
Predicting performance of classification algorithmsPredicting performance of classification algorithms
Predicting performance of classification algorithmsIAEME Publication
 
Data Analysis – Technical learnings
Data Analysis – Technical learningsData Analysis – Technical learnings
Data Analysis – Technical learningsInvenkLearn
 
Tableau Basic Questions
Tableau Basic QuestionsTableau Basic Questions
Tableau Basic QuestionsSooraj Vinodan
 
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docxEN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docxYASHU40
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Week 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxWeek 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxcockekeshia
 
lab #6
lab #6lab #6
lab #6butest
 
Cmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutletCmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutletPoppinss
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdffashionbigchennai
 

Similar to TAO Fayan_ Introduction to WEKA (20)

Introduction to weka
Introduction to wekaIntroduction to weka
Introduction to weka
 
Itb weka nikhil
Itb weka nikhilItb weka nikhil
Itb weka nikhil
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)
 
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
 
Programming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentProgramming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) Environment
 
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsReport: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Test Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful ToolsTest Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful Tools
 
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docxMajor AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
Major AssignmentDue5pm Friday, Week 11. If you unable to submit on.docx
 
Comp 220 ilab 6 of 7
Comp 220 ilab 6 of 7Comp 220 ilab 6 of 7
Comp 220 ilab 6 of 7
 
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSPREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
 
Predicting performance of classification algorithms
Predicting performance of classification algorithmsPredicting performance of classification algorithms
Predicting performance of classification algorithms
 
Data Analysis – Technical learnings
Data Analysis – Technical learningsData Analysis – Technical learnings
Data Analysis – Technical learnings
 
Tableau Basic Questions
Tableau Basic QuestionsTableau Basic Questions
Tableau Basic Questions
 
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docxEN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Week 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxWeek 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docx
 
lab #6
lab #6lab #6
lab #6
 
Cmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutletCmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutlet
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
 

TAO Fayan_ Introduction to WEKA

  • 1. An Introduction to WEKA Fayan TAO ∗ Computer and information science system Macau University of Science and Technology fytao2015@gmail.com ABSTRACT 1 WEKA is a data mining software workbench, which has wide applications in machine learning technology. It has an active community and enjoys widespread acceptance in both academia and business. This report provides an introduc- tion to WEKA, and demonstrates how to use each applica- tion, including Explorer, Experimenter, KnowledgeFlow and Simple CLI of WEKA based on version 3.6.13. Two kind of classical data–iris data and weather data are used to do experiments and related output results are shown and anal- ysed in this report. Keywords WEKA; data mining; machine learning 1. INTRODUCTION The WEKA[10][5] workbench is a collection of state-of- the-art machine learning algorithms and data preprocess- ing tools. It was developed at the University of Waikato in New Zealand; the name stands for Waikato Environment for Knowledge Analysis. (Outside the university, the WEKA, pronounced to rhyme with Mecca, is a flightless bird with an inquisitive nature found only on the islands of New Zealand [8].) The system is written in Java and distributed under the terms of the GNU General Public License. It runs on almost any platform and has been tested under Linux, Windows, and Macintosh operating systemsaland even on a personal digital assistant. It provides a uniform interface to many different learning algorithms, along with methods for pre- and post-processing and for evaluating the result of learning schemes on any given dataset[6]. It contains several standard data mining techniques, in- cluding data preprocessing, classification, regression, clus- ∗Stu ID:1509853F-II20-0019 1 This report follows ACM TEX format. ACM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 Figure 1: WEKA interface tering, and association. 2. BACKGROUND AND HISTORY The WEKA project has been funded by the New Zealand since 1993[1][7]. The initial goal at this time was as follows: The programme aims to build a state-of-the-art facility for developing techniques of machine learning and investigat- ing their application in key areas of New Zealand economy. Specifically we will create a workbench for machine learning, determine the factors that contribute towards its successful application in the agriculture industries, and develop new methods of machine learning and ways of assessing their ef- fectiveness. In 1996, a mostly C version of WEKA was released, while in 1999 it was redeveloped and released in Java to support platform independence. Today, there are several versions of WEKA available to the public. The GUI version (6.0) is the most recent release. The developer version (3.5.8) allows users to obtain and modify source code to add content or fix bugs. The book version (3.4.14) is as described in the data mining book released by Witten and Frank[9]. 3. INTERFACES WEKA has four interfaces, which are started from the main GUI Chooser window, as shown in figure 1(this dis- cussion is based on WEKA version 3.6.13).They can handle data preprocessing, classification, regression, clustering, and association [1][2]. Following shows the specified application interfaces: • Explorer: An environment for exploring data with WEKA, which contains six main functions, namely Preprocess, Classify, Cluster, Associate, Select attributes and Vi- sualize, respectively. • Experimenter: An environment for performing exper- iments and conducting statistical tests between learn-
  • 2. Figure 2: Explorer interface Figure 3: Preprocess interface ing schemes. • KnowledgeFlow: This environment supports essentially the same functions as the Explorer but with a drag- and-drop interface. One advantage is that it supports incremental learning. • SimpleCLI: An environment providing a simple command- line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. 4. EXPLORER Explorer is the main graphical user interface in WEKA. It is shown in figure 2. It has six different panels, accessed by the tabs at the top, that correspond to the various Data Mining tasks supported ([8][2]). We will discuss these six functions one by one in the following sections. • Preprocess: Choose and modify the data being acted on. • Classify: Train and test learning schemes that classify or perform regression. • Cluster: Learn clusters for the data. • Associate: Learn association rules for the data. • Select attributes: Select the most relevant attributes in the data. • Visualize: View an interactive 2D plot of the data. 4.1 Preprocess The interface of Preprocess can be seen in the figure 3. The first four buttons at the top of the preprocess section are used to load data into WEKA[2]: • Open file. . . Brings up a dialog box allowing us to browse for the data file on the local file system. Figure 4: Left:Weather data (.CSV ); Right: Iris data (.arff) Figure 5: Iris data after preprocessing • Open URL. . . Asks for a Uniform Resource Locator address for where the data is stored. • Open DB. . . Reads data from a database. (Note that to make this work we might have to edit the file in WEKA/experiment/DatabaseUtils.props.) • Generate. . . Enables us to generate artificial data from a variety of DataGenerators. Using the Open file. . . button, we can read files in a vari- ety of formats: WEKA’s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and .names extension, and serialized Instances objects has a .bsi extension. Note: This list of formats can be extended by adding cus- tom file converters to the WEKA.core.converters package. Here, we take weather data and iris data as examples. Fig- ure 4 shows the weather data in the .CSV format and the iris data in the .arff format. figure 2 and figure 5 show the weather data and iris data after preprocessing, respectively. We can see that, preprocess interface demonstrates data in- formation, such as data Relation, Instance and Attributes. It also shows statistics values, including Minimum, Maximum, Mean and StdDev values. Additional, we can discretize data. As the figure 6 shows, the weather data contains 5 sunny days, 4 overcast days and 5 rainy days, where all overcast days have no bad influence in playing, while not all sunny or rainy days can be regarded as suitable days to play. 4.2 Classification The Classify interface is shown as figure 7. To analyse the weather data, We have to choose a classifier and test options. In our case, we choose trees.J48 as the classifier. As to the Test options box. There are four test modes[2]: • Use training set: The classifier is evaluated on how well it predicts the class of the instances it was trained on.
  • 3. Figure 6: Discretize weather data Figure 7: Classify interface • Supplied test set: The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set. . . button brings up a dialog allowing us to choose the file to test on. • Cross-validation: The classifier is evaluated by cross- validation, using the number of folds that are entered in the Folds text field. • Percentage split: The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field. Here, we set Percentage split as 66%, which means that the classifier will predict 66 percentage of the weather data. The figure 8 shows the visualize tree. And the table 1 shows the classifier output. As we can see, the text in the Classifier output area is split into several sections: • Run information: A list of information giving the learn- ing scheme options, relation name, instances, attributes and test mode that were involved in the process. • Classifier model (full training set): A textual represen- tation of the classification model that was produced on the full training data. • Summary: A list of statistics summarizing how accu- rately the classifier was able to predict the true class of the instances under the chosen test mode. Figure 8: Classification visualize tree Figure 9: Visualize cluster assignments (SimpleK- Means) Figure 10: Visualize cluster assignments (Hierarchi- calClusterer) • Detailed Accuracy By Class: A more detailed per-class break down of the classifier’s prediction accuracy. Here, the true positives (TP)[8] and true negatives (TN) are correct classifications. A false positive (FP) is when the outcome is incorrectly predicted as yes (or posi- tive) when it is actually no (negative). A false negative (FN) is when the outcome is incorrectly predicted as negative when it is actually positive. The true positive rate is TP divided by the total number of positives, which is TP +FN (i.e T P T P +F N ); the false positive rate is FP divided by the total number of negatives, which is FP + TN (i.e F P F P +T N ). The overall success rate is the number of correct classifiations divided by the total number of classifications (i.e T P +T N T P +T N+F P +F N ). Finally, the error rate is 1 minus this. • Confusion Matrix: Shows how many instances have been assigned to each class. Elements show the num- ber of test examples whose actual class is the row and whose predicted class is the column. In the table 1, we can see that there are 2 no instances are classified to yes, while 1 yes instance is mistaken as no. 4.3 Clustering WEKA contains clusterer for finding groups of similar in- stances in a dataset. There are different kind of implemented schemes, such as k-Means, EM, Cobweb, X-means and Far- thestFirst. Besides, clusters can be visualized and compared to ”true” clusters (if given). If clustering scheme produces a probability distribution, the evaluation will be based on log-likelihood. In this report, we choose three different clusters: SimpleK- Means, HierarchicalClusterer and EM to analyse weather data. Figure 9, figure 10 and figure 11 show visualize cluster assignments by the cluster ways of SimpleKMeans, Hierar- chicalClusterer and EM,respectively. Table 2 shows the cluster output by Kmeans. We can see that, there are 14 instances in total. They are clustered into
  • 4. === Run information === Scheme :WEKA. c l a s s i f i e r s . t r e e s . J48 −C 0.25 −M 2 Relation : weatherData Instances : 14 Attributes : 5 outlook temperature humidity windy play Test mode : s p l i t 66.0% train , remainder t e s t === C l a s s i f i e r model ( f u l l t r a i n i n g set ) === J48 pruned tree −−−−−−−−−−−−−−−−−− outlook = Rainy | humidity = High : No ( 3 . 0 ) | humidity = Normal : Yes ( 2 . 0 ) outlook = Overcast : Yes ( 4 . 0 ) outlook = Sunny | windy = False : Yes ( 3 . 0 ) | windy = True : No ( 2 . 0 ) Number of Leaves : 5 Size of the tree : 8 Time taken to build model : 0.03 seconds === Evaluation on t e s t s p l i t === === Summary === Correctly C l a s s i f i e d Instances 2 40 % I n c o r r e c t l y C l a s s i f i e d Instances 3 60 % Kappa s t a t i s t i c −0.3636 Mean absolute e r ror 0.6 Root mean squared e r ror 0.7746 Relative absolute er r or 126.9231 % Root r e l a t i v e squared e r ror 157.6801 % Total Number of Instances 5 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F−Measure ROC Area Class 0 0.333 0 0 0 0.333 No 0.667 1 0.5 0.667 0.571 0.333 Yes Weighted Avg . 0.4 0.733 0.3 0.4 0.343 0.333 === Confusion Matrix === a b <−− c l a s s i f i e d as 0 2 | a = No 1 2 | b = Yes Table 1: Classifier output (J48)
  • 5. Scheme :WEKA. c l u s t e r e r s . SimpleKMeans −N 2 −A ”WEKA. core . EuclideanDistance −R f i r s t −l a s t ” −I 500 −S 10 Relation : weather . symbolic Instances : 14 Attributes : 5 outlook temperature humidity windy Ignored : play Test mode : Classes to c l u s t e r s evaluation on t r a i n i n g data === Model and evaluation on t r a i n i n g set === kMeans ====== Number of i t e r a t i o n s : 4 Within c l u s t e r sum of squared e r r o r s : 21.000000000000004 Missing values g l o b a l l y replaced with mean/mode Cluster centroids : Cluster# Attribute Full Data 0 1 (14) (10) (4) ============================================== outlook sunny sunny overcast temperature mild mild cool humidity high high normal windy FALSE FALSE TRUE Time taken to build model ( f u l l t r a i n i n g data ) : 0 seconds === Model and evaluation on t r a i n i n g set === Clustered Instances 0 10 ( 71%) 1 4 ( 29%) Class a t t r i b u t e : play Classes to Clusters : 0 1 <−− assigned to c l u s t e r 6 3 | yes 4 1 | no Cluster 0 <−− yes Cluster 1 <−− no I n c o r r e c t l y c l u s t e r e d instances : 7.0 50 % Table 2: Cluster output (Kmeans)
  • 6. === Run information === Scheme : WEKA. a s s o c i a t i o n s . Apriori −N 10 −T 0 −C 0.9 −D 0.05 −U 1.0 −M 0.1 −S −1.0 −c −1 Relation : weatherData Instances : 14 Attributes : 5 outlook temperature humidity windy play === Associator model ( f u l l t r a i n i n g set ) === Apriori ======= Minimum support : 0.15 (2 instances ) Minimum metric <confidence >: 0.9 Number of c y c l e s performed : 17 Generated s e t s of l a r g e itemsets : Size of set of l a r g e itemsets L ( 1 ) : 12 Size of set of l a r g e itemsets L ( 2 ) : 47 Size of set of l a r g e itemsets L ( 3 ) : 39 Size of set of l a r g e itemsets L ( 4 ) : 6 Best r u l e s found : 1. outlook=Overcast 4 ==> play=Yes 4 conf : ( 1 ) 2. temperature=Cool 4 ==> humidity=Normal 4 conf : ( 1 ) 3. humidity=Normal windy=False 4 ==> play=Yes 4 conf : ( 1 ) 4. outlook=Rainy play=No 3 ==> humidity=High 3 conf : ( 1 ) 5. outlook=Rainy humidity=High 3 ==> play=No 3 conf : ( 1 ) 6. outlook=Sunny play=Yes 3 ==> windy=False 3 conf : ( 1 ) 7. outlook=Sunny windy=False 3 ==> play=Yes 3 conf : ( 1 ) 8. temperature=Cool play=Yes 3 ==> humidity=Normal 3 conf : ( 1 ) 9. outlook=Rainy temperature=Hot 2 ==> humidity=High 2 conf : ( 1 ) 10. temperature=Hot play=No 2 ==> outlook=Rainy 2 conf : ( 1 ) Table 3: Associator output (Apriori)
  • 7. Figure 11: Visualize cluster assignments (EM) Figure 12: Associate interface two groups–0 and 1. Group 0 represents play, while group 1 means non-play. They contains 10 instances and 4 instances, respectively. According to the output section–Classes to Clusters, there are 3 play instance being mistaken as non- play and 4 non-play instances being misclassified as play. So, in fact, there are 9 play instances and 5 non-play instances in the real data. 4.4 Association WEKA contains an implementation of the Apriori algo- rithm for learning association rules. Apriori can compute all rules that have a given minimum support and exceed a given confidence. But it works only with discrete data. It can identify statistical dependencies between groups of at- tributes[3]. Figure 12 shows the interface of Associate application. Ta- ble 3 shows the output by means of apriori. We can get the best rule by the Best rules found section. For example, when the outlook is overcast, it is good for playing; While on the rainy day and there is no play, which means the humidity is high. 4.5 Select attributes Attribute selection[2] involves searching through all pos- sible combinations of attributes in the data to find which subset of attributes works best for prediction. To do this, two objects must be set up: an attribute evaluator and a search method. A search method contains best-first, for- ward selection, random, exhaustive, genetic algorithm and ranking. An evaluation method includes correlation-based, wrapper, information gain,chi-squared and so on. The evaluator determines what method is used to assign a worth to each subset of attributes. The search method determines what style of search is performed. WEKA allows (almost) arbitrary combinations of these two, so it is very flexible. Figure 13 and figure 14 demonstrate attribute select out- puts by methods of (CfsSubsetEval + Best first) and (ChiSquare- dAttributeEval + Ranker), respectively. The former selects outlook and humidity as the best subset for prediction. The latter method ranks the attributes in a descending rank or- Figure 13: Attribute select output (CfsSubsetEval + Best first) Figure 14: Attribute select output (ChiSquaredAt- tributeEval + Ranker) der: outlook, humidity, windy and temperature. Therefor, we can see that outlook and humidity play an important role in determining whether it is good for playing or not. 4.6 Visualization Visualization is very useful in practice, for example, it helps to determine which difficulty is in the learning prob- lems. WEKA can visualize single attributes and pairs of attributes, which can be shown in figure 15. 5. EXPERIMENTER The Experimenter[8] enables us to set up large-scale ex- periments, start them running, leave them and come back when they have finished, and then analyze the performance statistics that have been collected. They automate the ex- perimental process. Experimenter makes it easy to compare the performance of different learning schemes. The statistics can be stored in ARFF format, and can themselves be the subject of further data mining. Figure 16, figure 17 and figure 19 show the three main part of Experimenter. We have to setup the data first, then run it, finally analyse the result. To analyze the experiment that has been performed in this section, click the Experiment button at the top right; otherwise, supply a fie that contains the results of another experiment. Then click Perform test (near the bottom left). Figure 15: Visualization of weather data
  • 8. Figure 16: An experiment: setting it up Figure 17: An experiment: run Figure 18: Experiment output result Figure 19: Statistical test results for the experiment Figure 20: KnowledgeFlow interface The result of a statistical significance test of the performance of the fist learning scheme (J48) versus the other two (OneR and ZeroR) are displayed in the large panel on the right, just as the figure 18 shows. We are comparing the percent correct statistic: This is se- lected by default as the comparison field shown toward the left in figure 19. The three methods are displayed horizon- tally, numbered (1), (2) and (3), as the heading of a little ta- ble. The labels for the columns are repeated at the bottom– trees.J48, rules.OneR, and rules.ZeroR–in case there is in- sufficient space for them in the heading. The inscrutable in- tegers beside the scheme names identify which version of the scheme is being used. They are present by default to avoid confusion among results generated using different versions of the algorithms. The value in brackets at the beginning of the iris row (100) is the number of experimental runs: 10 times tenfold cross-validation. The percentage correct for the three schemes is shown in figure 19: 94.73% for method 1, 92.53% for method 2, and 33.33% for method 3. The symbol placed beside a result indicates that it is statistically better (v) or worse (*) than the baseline scheme-in this case J48-at the specified significance level (0.05, or 5%). The cor- rected resampled t-test[8] is used here. As shown, method 3 is significantly worse than method 1 because its success rate is followed by an asterisk. At the bottom of columns 2 and 3 are counts (x/y/z) of the number of times the scheme was better than (x), the same as (y), or worse than (z) the base- line scheme on the datasets used in the experiment. In this case there is only one dataset; method 2 was equivalent to method 1 (the baseline) once, and method 3 was worse than it once. (The annotation (v/ /*) is placed at the bottom of column 1 to help you remember the meanings of the three counts (x/y/z). 6. KNOWLEDGE FLOW The KnowledgeFlow[2] provides an alternative to the Ex- plorer as a graphical front end to WEKA’s core algorithms. The interface of KnowledgeFlow is shown as figure 20. The KnowledgeFlow is a work in progress so some of the func- tionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge- Flow but not in the Explorer. The KnowledgeFlow presents a data-flow inspired inter- face to WEKA. The user can select WEKA components from a tool bar, place them on a layout canvas and con- nect them together in order to form a knowledge flow for processing and analyzing data. At present, all of WEKA’s classifiers, filters, clusterers, loaders and savers are available in the KnowledgeFlow along with some extra tools. Figure 21 shows the J48 operational mechanism in Knowl- edgeFlow application. Figure 22 reveals the related result,
  • 9. Figure 21: KnowledgeFlow (J48) Figure 22: J48 Result of KnowledgeFlow which is the same as table 1 shows. 7. SIMPLE CLI Lurking behind WEKA’s interactive interfacesalthe Ex- plorer, the Knowledge Flow, and the Experimenter-lies its basic functionality. This can be accessed more directly through a command-line interface. That is Simple CLI. Its interface is shown as figure 23. It has a plain textual panel with a line at the bottom on which we enter commands. For example, when we type ”java weka.associations.Apriori -t data/weather.nominal.arff” into the plain textual panel, the result will be shown in figure 24, which is the same as that table 3 shows. 8. SUMMARY WEKA has proved itself to be a useful and even essential tool in the analysis of real world data sets. It reduces the level of complexity involved in getting real world data into a variety of machine learning schemes and evaluating the out- put of those schemes. It has also provided a flexible aid for machine learning research and a tool for introducing people to machine learning in an educational environment[4]. 9. ACKNOWLEDGMENT I wish to thank Pro. Yong LIANG for his patient teaching on the class and vital suggestions on this report. 10. REFERENCES Figure 23: Simple CLI interface Figure 24: Apriori result shown in Simple CLI in- terface [1] D. Baumgartner and G. Serpen, (2009).Large Experiment and Evaluation Tool for WEKA Classifiers. DMIN. pp: 340-346. [2] R. R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald & D. Scuse, (2015). WEKA Manual for Version 3-6-13. University of Waikato, Hamilton, New Zealand. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/documentation.html. [3] E. Frank. Machine Learning with WEKA, [Power-Point slides].University of Waikato, Hamilton, New Zealand. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/documentation.html. [4] S. R. Garner. (1995). WEKA: The waikato environment for knowledge analysis Proceedings of the New Zealand computer science research students conference. pp: 57-64. [5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. H. Witten, (2009). The WEKA data mining software: an update[J]. ACM SIGKDD explorations newsletter, 11(1). pp:10-18. [6] O. Maimon and L. Rokach,(2005). Data mining and Knowledge discovery handbook. (Vol. 2). New York: Springer. [7] B. Pfahringer. (2007). WEKA: A tool for exploratory data mining [Power-Point slides]. University of Waikato, New Zealand. Retrieved from http://www.cs.waikato.ac.nz/ ml/weka/ index.html. [8] I. H. Witten, E. Frank and M. A. Hall, (2011). Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed., Morgan Kaufmann, San Francisco. [9] H. Witten and E. Frank. (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., Morgan Kaufmann, San Francisco. [10] WEKA: The University of Waikato. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/index.html