Introduction to Data Mining

Introduction to
Data Mining
Kai Koenig
@AgentK

Web/Mobile Developer since the late 1990s
Interested in: Java & JVM, CFML, Functional
Programming, Go, Android, Data Science
And this is my view of the world…
Me

1.What is Data Mining? 
2. Concepts and Terminology 
3.Weka 
4.Algorithms 
5. Dealing with Text 
6. Java integration
Agenda

Fundamentals
Why do we nowadays have SO MUCH data?
Reasons include:
- Cheap storage and better processing power
- Legal & Business requirements
- Digital hoarding

Fundamentals
Data Mining is all about going from data to useful
and meaningful information.
- Recommendation in online shops
- Finding an “optimal” partner
- Weather prediction
- Judgement decisions (credit applications)

A better deﬁnition
“Data Mining is deﬁned as the process of
discovering patterns in data.The process must be
automatic or (more usually) semiautomatic.The
patterns discovered must be meaningful in that
they lead to some advantage, often an economic
one.”
(Prof. Dr. Ian Witten)

Finding and applying rules
Tear Production
Rate == reduced
none

Finding and applying rules
Age == young &&
Astigmatism == no
soft
Age == young &&
Astigmatism == no
soft

A Result: Decision lists
If outlook = sunny and humidity = high then play = no 
If outlook = rainy and windy = true then play = no 
If outlook = overcast then play = yes 
If humidity = normal then play = yes 
If none of the above then play = yes

Not all rules are equal
Classiﬁcation rules: predict an outcome
Association rules: rules that strongly associate
different attribute values
If temperature = cool then humidity = normal 
If humidity = normal and windy = false then play = yes  
If outlook = sunny and play = no then humidity = high

Learning
What is Learning? And what is Machine Learning?
A good approach is:
“Things learn when they change their
behaviour in a way that makes them perform
better in the future”

Learning types
Classiﬁcation learning
Association learning
Clustering
Numerical Prediction

Some basic terminology
The thing to be learned is the concept.
The output of a learning scheme is the
concept description.
Classiﬁcation learning is sometimes called
supervised learning. The outcome is the
class.
Examples are called instances.

Some more basic terminology
Discrete attribute values are usually called
nominal values, continuous attribute values are
called just numeric values.
Algorithms used to process data and find
patterns are often called classifiers.There are
lots of them and all of them can be heavily
configured.

What is Weka?
Waikato Environment for Knowledge Analysis
Developed by a group in the Dept. of Computer
Science at the University of Waikato in New
Zealand.
 
Also,Weka is a New Zealand-only bird.

What is Weka?
Download for Mac OS X, Linux and Windows:
http://www.cs.waikato.ac.nz/~ml/weka/
index.html 
Weka is written in Java, comes either as native
applications or executable .jar ﬁle and is licensed
under GPL v3.

Getting data into Weka
Easiest and common for experimenting: .arff
Also supported: CSV, JSON, XML, JDBC
connections etc.
Filters in Weka can then be used to preprocess
data.

Features
50+ Preprocessing tools
75+ Classiﬁcation/Regression algorithms
~10 clustering algorithms
… and a packet manager to load and install
more if you want.

Classiﬁers
There are literally hundreds with lots of tuning
options.
Main Categories:
- Rule-based (ZeroR, OneR, PART etc.)
- Tree-based (J48, J48graft, CART etc.)
- Bayes-based (NaiveBayes etc.)
- Functions-based (LR, Logistic etc.)
- Lazy (IB1, IBk etc.)

OneR
Very simplistic classiﬁer and based on a single
attribute.
For each attribute,
For each value of that attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute value.
Calculate the error rate of the rules.
Choose the rules with the smallest error rate.

C4.5 (J48)
Produces a decision tree, derived from divide-
and-conquer tree building techniques.
Decision trees are often verbose and need to be
pruned - J48 uses post-pruning, pruning can in
some instances be costly.
J48 usually provides a good balance re quality vs.
cost (execution times etc.)

NaiveBayes
Very good and popular for document (text)
classiﬁcation.
Based on statistical modelling (Bayes formula of
conditional probability)
In document classiﬁcation we treat the existence
or absence of a word as a Boolean attribute.

Training and Testing
We implicitly trained and tested our classiﬁers in
the previous examples using Cross-Validation.

Training and Testing
Test data and Training data NEED to be different.
If you have only one dataset, split it up.
n-fold Cross-Validation:
- Divides your dataset into n parts, holds out
each part in turn
- Trains with n-1 parts, tests with the held out
part
- Stratiﬁed CV is even better

Bag of Words
Generally for document classiﬁcation we treat a
document as a bag of words and the existence
or absence of a word is a Boolean attribute.
This results in problems with very many
attributes having 2 values each.
This is quite a bit different from the usual
classiﬁcation problem.

Filtered Classifiers
First step: use Filtered classifier with J48 and
StringToWordVector filter.
Example: Reuters Corn datasets (train/test)
We get 97% accuracy, but there’s still an issue
here -> investigate the confusion matrix
Is accuracy the best way to evaluate quality?

Better approaches to evaluation
Accuracy: (a+d)/(a+b+c+d)
Recall: R = d/(c+d)
Precision: P = d/(b+d)
F-Measure: 2PR/(P+R)
False positive rate FP: b/(a+b)
True negative rate TN: a/(a+b)
False negative rate FN: c/(c+d)
predicted
– +
true
– a b
+ c d

ROC (threshold) curves
Area under the threshold curve determines the
overall quality of a classiﬁer.

NaiveBayesMultinomial
Often the best classiﬁer for document
classiﬁcation. In particular:
- good ROC
- good results on minority class (often what we
want)

J48: 96% accuracy, 38/57 on grain docs, 544/547
on non-grain docs, ROC 0.91
NaiveBayes: 80% accuracy, 46/57 on grain docs,
439/547 on non-grain docs, ROC 0.885
NaiveBayesMultinomial: 91% accuracy, 52/57 on
grain docs, 496/547 on non-grain docs, ROC
0.973

NaiveBayesMultinomial with stoplist, lowerCase
and outputWords: 94% accuracy, 56/57 on grain
docs, 504/547 on non-grain docs, ROC 0.978
Why? NBM is designed for text:
- based solely on word appearance
- can deal with multiple repetitions of a word
- faster than NB

Weka is written in Java
The UI is essentially making use of a vast
underlying data mining and machine learning
API.
Obviously this fact
invites us to use the
API directly :)

Setting up a project (IntelliJ IDEA)
Create new Java project in IntelliJ
Import weka.jar
Import weka-src.jar
Off you go!

The main classes/packages you need…
import weka.classifiers.Evaluation; 
import weka.classifiers.trees.J48; 
import weka.core.Instances;

Getting stuff done
Instances train = new Instances(bReader); 
train.setClassIndex(train.numAttributes()-1);
J48 j48 = new J48(); 
j48.buildClassifier(train);
Evaluation eval = new Evaluation(train); 
eval.crossValidateModel(
j48,
train,
10,
new Random(1));

You can also grab Java code off Weka UI

Photo Credits
https://www.flickr.com/photos/johnnystiletto/3339808858/
https://www.flickr.com/photos/theequinest/5056055144/
https://www.flickr.com/photos/flyingkiwigirl/17385243168
https://www.flickr.com/photos/x6e38/3440973490/
https://www.flickr.com/photos/42931449@N07/5418402840/
https://www.flickr.com/photos/gerardstolk/12194108005/
https://www.flickr.com/photos/zzpza/3269784239/in/
https://www.flickr.com/photos/internationaltransportforum/
14258907973/

Get in touch
Kai Koenig
Email: kai@ventego-creative.co.nz
www.ventego-creative.co.nz
Blog: www.bloginblack.de
Twitter: @AgentK

Introduction to Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Introduction to Data Mining

Similar to Introduction to Data Mining (20)

More from Kai Koenig

More from Kai Koenig (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Mining