Introduction to Weka and Preprocessing.ppt

a) Installation of WEKA Tool, Understand the features of
WEKA toolkit, Explore the available data sets.
b) Demonstration of preprocessing on .arff file using student
data .arff
1

Content
 What is WEKA?
 The Explorer:
 Preprocess data
 Classification
 Clustering
 Association Rules
 Attribute Selection
 Data Visualization
 References and Resources
3

What is WEKA?
 Waikato Environment for Knowledge Analysis
 It’s a data mining/machine learning tool developed by
Department of Computer Science, University of
Waikato, New Zealand.
 Data mining software written in Java
 distributed under the GNU Public License
 Used for research, education, and applications
 Main features:
 Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
 Multiple interfaces that do not require programming
 Experimenter can compare learning algorithms
 Weka is also a bird found only on the islands of New
4

Download and Install WEKA
 Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
 Support multiple platforms (written in java):
 Windows, Mac OS X and Linux
5

Why Weka?
 There are many options on what data mining
suite or toolkit one can use
 Weka has the following advantages:
 Easy to learn
 Free and open-source
 Easy to download and runs on many platforms
 Does not require programming knowledge
 This is important since a few students may not have much
programming experience
6

Main Features
 49 data preprocessing tools
 76 classification/regression algorithms
 8 clustering algorithms
 3 algorithms for finding association rules
 15 attribute/subset evaluators + 10 search
algorithms for feature selection
7

The Weka Interfaces
 Explorer:
 We cover in detail
 Enables you to apply multiple actions but does not explicitly
model them
 Experimenter
 Run many experiments in controlled manner and compare
results
 KnowledgeFlow:
 Like Explorer but each step is represented as a node in a
graph, so flow explicitly represented
 Workbench
 Combines all GUI interfaces into one
 Simple CLI (command line interface)
 No GUI so can use shell scripts to control experiments
 Similar to using Python where each command is a function
9

Main GUI
 Three graphical user interfaces
 “The Explorer” (exploratory data
analysis)
 “The Experimenter” (experimental
environment)
 “The KnowledgeFlow” (new process
model inspired interface)
10

Content
 What is WEKA?
 The Explorer:
 Preprocess data
 Classification
 Clustering
 Association Rules
 Attribute Selection
 Data Visualization
 References and Resources
11

12
Explorer: pre-processing the data
 Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
 Data can also be read from a URL or from an
SQL database (using JDBC)
 Pre-processing tools in WEKA are called “filters”
 WEKA contains filters for:
 Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …

Weka ARFF File Format
 ARFF= Attribute Relation File Format
 Weka ARFF files include two main parts:
 Specification of the features
 The actual data
 Files are “flat files” usually comma separated
 Other tools use a separate file for each part
 C4.5 decision tree tool, which was once very
popular, had a “names” and “data” file
 I find the single file format odd, since specification is
short but actual data can be millions of lines
13

14
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
38,female,non_anginal,?,no,not_present
...
WEKA only deals with “flat” files

15
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
38,female,non_anginal,?,no,not_present
...
WEKA only deals with “flat” files
This just defines dataset name
Categorical feature must list
values

Preprocessing Filters
 Pre-processing tools in WEKA called “filters”
 Many preprocessing filters available:
 Discretization
 Normalization
 Resampling
 Feature selection
 Feature transformation
 Many more
 Know what available so can use it when
needed
 We will focus on Discretization
23

25
Clicking on this will bring up a list of all of
the filters, organized into a hierarchy.
Click on each folder to expand the list.
There are dozens of choices

Discretization
 Discretization: numerical feature  categorical
 In preprocess tab select “Choose”, expand
“unsupervised” then “attribute”
 Select “Discretize” filter
 Note discretize applies to attributes not instances
 The following few slides are from an earlier version and may
look just a little bit different
 Note: there is a version under the “supervised” folder but it
works differently and has different options
26

Discretization
27
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Attribute Age Age Age Age
1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old

31
This shows the command line
with options. Click within this
region to bring up a window
with the options and more
info

1) Toggle to “True”
Click for documentation
2) Click OK
Create 10 equal frequency bins
Will do for all numerical features given “attributeIndices” value
34

41
Explorer: building “classifiers”
 Classifiers in WEKA are models for predicting
nominal or numeric quantities
 Implemented learning schemes include:
 Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …

42
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This
follows an
example of
Quinlan’s
ID3
(Playing
Tennis)
Decision Tree Induction: Training Dataset

43
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
Output: A Decision Tree for “buys_computer”

70
Explorer: finding associations
 WEKA contains an implementation of the Apriori
algorithm for learning association rules
 Works only with discrete data
 Can identify statistical dependencies between
groups of attributes:
 milk, butter  bread, eggs (with confidence 0.9 and
support 2000)
 Apriori can compute all rules that have a given
minimum support and exceed a given confidence

71
Basic Concepts: Frequent
Patterns
 itemset: A set of one or more
items
 k-itemset X = {x1, …, xk}
 (absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
 (relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk

72
Basic Concepts: Association
Rules
 Find all the rules X  Y with
minimum support and confidence
 support, s, probability that a
transaction contains X  Y
 confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys
diaper
Customer
buys both
Customer
buys beer
Nuts, Eggs, Milk
40
Nuts, Coffee, Diaper, Eggs, Milk
50
Beer, Diaper, Eggs
30
Beer, Coffee, Diaper
20
Beer, Nuts, Diaper
10
Items bought
Tid
 Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)

78
Explorer: attribute selection
 Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
 Attribute selection methods contain two parts:
 A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
 An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
 Very flexible: WEKA allows (almost) arbitrary
combinations of these two

87
Explorer: data visualization
 Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
 WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
 To do: rotating 3-d visualizations (Xgobi-style)
 Color-coded class values
 “Jitter” option to deal with nominal attributes (and
to detect “hidden” data points)
 “Zoom-in” function

References and Resources
 References:
 WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
 WEKA Tutorial:
 Machine Learning with WEKA: A presentation demonstrating all
graphical user interfaces (GUI) in Weka.
 A presentation which explains how to use Weka for exploratory
data mining.
 WEKA Data Mining Book:
 Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)
 WEKA Wiki:
http://weka.sourceforge.net/wiki/index.php/Main_Page
 Others:
 Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 2nd ed.
98

Introduction to Weka and Preprocessing.ppt

More Related Content

What's hot

Similar to Introduction to Weka and Preprocessing.ppt

Recently uploaded

Introduction to Weka and Preprocessing.ppt