Data Mining using WEKA tool
Dr. S KANNIMUTHU,
Professor,
Head-Center of Excellence in Algorithms,
Department of CSE,
Karpagam College of Engineering,
Coimbatore.
1 10/30/2022
WORKSHOP ON DATA MINING
USING WEKA AND PYTHON
AGENDA
 Introduction to Data Mining and Tools
 Introduction to WEKA Tool
 WEKA-Explorer
 WEKA-Experimenter
 WEKA-Knowledgeflow
 Classification Analysis
 Predictive Analysis
 Cluster Analysis
 Hands-on
10/30/2022
2
Data Mining
 Data mining is the exploration and analysis of
large quantities of data in order to discover
valid, novel, potentially useful, and ultimately
understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and
comprehend the patterns.
10/30/2022
3
What are data mining tools?
 Data mining tools are software components and
theories that allow users to extract information
from data.
 The tools provide individuals and companies with
the ability to gather large amounts of data and
use it to make determinations about a particular
user or groups of users.
 Some of the most common uses of data mining
tools are in the fields of marketing, healthcare
system, fraud protections and surveillance.
10/30/2022
4
WEKA: the software
5
 Machine learning/data mining software written in
Java (distributed under the GNU Public License)
 Used for research, education, and applications
 Main features:
 Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
 Graphical user interfaces (incl. data visualization)
 Environment for comparing learning algorithms
10/30/2022
WEKA only deals with “flat” files
6
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
10/30/2022
WEKA only deals with “flat” files
7
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
10/30/2022
8
java weka.gui.GUIChooser
10/30/2022
9
java -jar weka.jar
10/30/2022
Explorer: pre-processing the data
10
 Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
 Pre-processing tools in WEKA are called “filters”
 WEKA contains filters for:
 Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
10/30/2022
11 10/30/2022
12 10/30/2022
10/30/2022
13
14 10/30/2022
15 10/30/2022
16 10/30/2022
10/30/2022
17
18 10/30/2022
Explorer: building “classifiers”
19
 Classifiers in WEKA are models for predicting
nominal or numeric quantities
 Implemented learning schemes include:
 Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …
 “Meta”-classifiers include:
 Bagging, boosting, stacking, error-correcting output
codes, locally weighted learning, …
10/30/2022
10/30/2022
20
10/30/2022
21
10/30/2022
22
10/30/2022
23
10/30/2022
24
10/30/2022
25
10/30/2022
26
10/30/2022
27
10/30/2022
28
10/30/2022
29
10/30/2022
30
10/30/2022
31
10/30/2022
32
10/30/2022
33
10/30/2022
34
10/30/2022
35
10/30/2022
36
10/30/2022
37
10/30/2022
38
10/30/2022
39
40 10/30/2022
10/30/2022
41
10/30/2022
42
10/30/2022
43
10/30/2022
44
10/30/2022
45
10/30/2022
46
10/30/2022
47
10/30/2022
48
10/30/2022
49
10/30/2022
50
10/30/2022
51
10/30/2022
52
10/30/2022
53
10/30/2022
54
10/30/2022
55
10/30/2022
56
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
10/30/2022
57
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
10/30/2022
58
10/30/2022
59
60 10/30/2022
10/30/2022
61
10/30/2022
62
Quick T
ime™an
d aT
IFF(L
ZW)de
c o
mpre
s sorar
e n
ee
de
dtos e
e t
his p
ic t
ure
.
10/30/2022
63
10/30/2022
64
10/30/2022
65
10/30/2022
66
10/30/2022
67
10/30/2022
68
10/30/2022
69
10/30/2022
70
10/30/2022
71
10/30/2022
72
10/30/2022
73
74 10/30/2022
75 10/30/2022
10/30/2022
76
Explorer: clustering data
77
 WEKA contains “clusterers” for finding groups of
similar instances in a dataset
 Some implemented schemes are:
 k-Means, EM, Cobweb, X-means, FarthestFirst
 Clusters can be visualized and compared to “true”
clusters (if given)
10/30/2022
10/30/2022
78
10/30/2022
79
10/30/2022
80
10/30/2022
81
10/30/2022
82
10/30/2022
83
10/30/2022
84
10/30/2022
85
10/30/2022
86
Explorer: finding associations
87
 WEKA contains the Apriori algorithm (among
others) for learning association rules
 Works only with discrete data
 Can identify statistical dependencies between
groups of attributes:
 milk, butter  bread, eggs (with confidence 0.9 and
support 2000)
 Apriori can compute all rules that have a given
minimum support and exceed a given confidence
10/30/2022
10/30/2022
88
10/30/2022
89
10/30/2022
90
Explorer: attribute selection
91
 Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
 Attribute selection methods contain two parts:
 A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
 An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
 Very flexible: WEKA allows (almost) arbitrary
combinations of these two
10/30/2022
10/30/2022
92
10/30/2022
93
10/30/2022
94
10/30/2022
95
10/30/2022
96
10/30/2022
97
Explorer: data visualization
98
 Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
 WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
 Color-coded class values
 “Jitter” option to deal with nominal attributes (and
to detect “hidden” data points)
 “Zoom-in” function
10/30/2022
10/30/2022
99
10/30/2022
100
10/30/2022
101
Performing experiments
102
 Experimenter makes it easy to compare the
performance of different learning schemes
 For classification and regression problems
 Results can be written into file or database
10/30/2022
10/30/2022
103
10/30/2022
104
10/30/2022
105
106 10/30/2022
10/30/2022
107
10/30/2022
108
10/30/2022
109
10/30/2022
110
The Knowledge Flow GUI
111
 Java-Beans-based interface for setting up and
running machine learning experiments
 Data sources, classifiers, etc. are beans and can
be connected graphically
 Data “flows” through components: e.g.,
“data source” -> “filter” -> “classifier” ->
“evaluator”
 Layouts can be saved and loaded again later
10/30/2022
112 10/30/2022
113 10/30/2022
Conclusion:
114
 WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
 Also has a list of projects based on WEKA
 (probably incomplete list of) WEKA contributors:
Abdelaziz Mahoui, Alexander K. Seewald, Ashraf M. Kibriya,
Bernhard Pfahringer, Brent Martin, Peter Flach, Eibe Frank, Gabi
Schmidberger, Ian H. Witten, J. Lindgren, Janice Boughton,
Jason Wells, Len Trigg, Lucio de Souza Coelho, Malcolm Ware,
Mark Hall, Remco Bouckaert, Richard Kirkby, Shane Butler,
Shane Legg, Stuart Inglis, Sylvain Roy, Tony Voyle, Xin Xu, Yong
Wang, Zhihai Wang
10/30/2022
Reference
 Ian H.Witten & Eibe Frank, “Data mining-Practical
Machine Learning Tools and Techniques”,Elsevier
publication, Second Edition
 Weka Manual
10/30/2022
115
Contact
 WhatsApp: 8838941811
 Email: drkannimuthu@gmail.com
 Linkedin: https://www.linkedin.com/in/kannimuthu-
subramanian-292924a/
 Google Scholar:
https://scholar.google.co.in/citations?user=eSdX5
S0AAAAJ&hl=en
 Publon:
https://publons.com/researcher/1686169/kannimu
thu-subramanian/
 Blog: www.skannimuthu.blogspot.in
 Github: www.github.com/kannimuthu 10/30/2022
116
THANK YOU
10/30/2022
117

Weka-Presentation.ppt