SlideShare a Scribd company logo
1 of 58
Download to read offline
Introduction to
Data Mining
Kai Koenig
@AgentK
Web/Mobile Developer since the late 1990s
Interested in: Java & JVM, CFML, Functional
Programming, Go, Android, Data Science
And this is my view of the world…
Me
1.What is Data Mining?

2. Concepts and Terminology

3.Weka

4.Algorithms

5. Dealing with Text

6. Java integration
Agenda
We are overwhelmed
with data.
1.What is Data Mining?
Fundamentals
Why do we nowadays have SO MUCH data?
Reasons include:
- Cheap storage and better processing power
- Legal & Business requirements
- Digital hoarding
Fundamentals
Data Mining is all about going from data to useful
and meaningful information.
- Recommendation in online shops
- Finding an “optimal” partner
- Weather prediction
- Judgement decisions (credit applications)
Fundamentals
A better definition
“Data Mining is defined as the process of
discovering patterns in data.The process must be
automatic or (more usually) semiautomatic.The
patterns discovered must be meaningful in that
they lead to some advantage, often an economic
one.”
(Prof. Dr. Ian Witten)
How can you express patterns?
Finding and applying rules
Tear Production
Rate == reduced
none
Finding and applying rules
Age == young &&
Astigmatism == no
soft
Age == young &&
Astigmatism == no
soft
A Result: Decision lists
If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes
Not all rules are equal
Classification rules: predict an outcome
Association rules: rules that strongly associate
different attribute values
If temperature = cool then humidity = normal

If humidity = normal and windy = false then play = yes 

If outlook = sunny and play = no then humidity = high

2. Concepts and
Terminology
Learning
What is Learning? And what is Machine Learning?
A good approach is:
“Things learn when they change their
behaviour in a way that makes them perform
better in the future”
Learning types
Classification learning
Association learning
Clustering
Numerical Prediction
Some basic terminology
The thing to be learned is the concept.
The output of a learning scheme is the
concept description.
Classification learning is sometimes called
supervised learning. The outcome is the
class.
Examples are called instances.
Some more basic terminology
Discrete attribute values are usually called
nominal values, continuous attribute values are
called just numeric values.
Algorithms used to process data and find
patterns are often called classifiers.There are
lots of them and all of them can be heavily
configured.
3.Weka
What is Weka?
Waikato Environment for Knowledge Analysis
Developed by a group in the Dept. of Computer
Science at the University of Waikato in New
Zealand.


Also,Weka is a New Zealand-only bird.
What is Weka?
Download for Mac OS X, Linux and Windows:
http://www.cs.waikato.ac.nz/~ml/weka/
index.html

Weka is written in Java, comes either as native
applications or executable .jar file and is licensed
under GPL v3.
Getting data into Weka
Easiest and common for experimenting: .arff
Also supported: CSV, JSON, XML, JDBC
connections etc.
Filters in Weka can then be used to preprocess
data.
Features
50+ Preprocessing tools
75+ Classification/Regression algorithms
~10 clustering algorithms
… and a packet manager to load and install
more if you want.
4.Algorithms
Classifiers
There are literally hundreds with lots of tuning
options.
Main Categories:
- Rule-based (ZeroR, OneR, PART etc.)
- Tree-based (J48, J48graft, CART etc.)
- Bayes-based (NaiveBayes etc.)
- Functions-based (LR, Logistic etc.)
- Lazy (IB1, IBk etc.)
OneR
Very simplistic classifier and based on a single
attribute.
For each attribute,
For each value of that attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute value.
Calculate the error rate of the rules.
Choose the rules with the smallest error rate.
C4.5 (J48)
Produces a decision tree, derived from divide-
and-conquer tree building techniques.
Decision trees are often verbose and need to be
pruned - J48 uses post-pruning, pruning can in
some instances be costly.
J48 usually provides a good balance re quality vs.
cost (execution times etc.)
NaiveBayes
Very good and popular for document (text)
classification.
Based on statistical modelling (Bayes formula of
conditional probability)
In document classification we treat the existence
or absence of a word as a Boolean attribute.
Training and Testing
We implicitly trained and tested our classifiers in
the previous examples using Cross-Validation.
Training and Testing
Test data and Training data NEED to be different.
If you have only one dataset, split it up.
n-fold Cross-Validation:
- Divides your dataset into n parts, holds out
each part in turn
- Trains with n-1 parts, tests with the held out
part
- Stratified CV is even better
5. Dealing with Text
Bag of Words
Generally for document classification we treat a
document as a bag of words and the existence
or absence of a word is a Boolean attribute.
This results in problems with very many
attributes having 2 values each.
This is quite a bit different from the usual
classification problem.
Filtered Classifiers
First step: use Filtered classifier with J48 and
StringToWordVector filter.
Example: Reuters Corn datasets (train/test)
We get 97% accuracy, but there’s still an issue
here -> investigate the confusion matrix
Is accuracy the best way to evaluate quality?
Better approaches to evaluation
Accuracy: (a+d)/(a+b+c+d)
Recall: R = d/(c+d)
Precision: P = d/(b+d)
F-Measure: 2PR/(P+R)
False positive rate FP: b/(a+b)
True negative rate TN: a/(a+b)
False negative rate FN: c/(c+d)
predicted
– +
true
– a b
+ c d
ROC (threshold) curves
Area under the threshold curve determines the
overall quality of a classifier.
NaiveBayesMultinomial
Often the best classifier for document
classification. In particular:
- good ROC
- good results on minority class (often what we
want)
NaiveBayesMultinomial
J48: 96% accuracy, 38/57 on grain docs, 544/547
on non-grain docs, ROC 0.91
NaiveBayes: 80% accuracy, 46/57 on grain docs,
439/547 on non-grain docs, ROC 0.885
NaiveBayesMultinomial: 91% accuracy, 52/57 on
grain docs, 496/547 on non-grain docs, ROC
0.973
NaiveBayesMultinomial
NaiveBayesMultinomial with stoplist, lowerCase
and outputWords: 94% accuracy, 56/57 on grain
docs, 504/547 on non-grain docs, ROC 0.978
Why? NBM is designed for text:
- based solely on word appearance
- can deal with multiple repetitions of a word
- faster than NB
6. Java integration
Weka is written in Java
The UI is essentially making use of a vast
underlying data mining and machine learning
API.
Obviously this fact
invites us to use the
API directly :)
Setting up a project (IntelliJ IDEA)
Create new Java project in IntelliJ
Import weka.jar
Import weka-src.jar
Off you go!
The main classes/packages you need…
import weka.classifiers.Evaluation;

import weka.classifiers.trees.J48;

import weka.core.Instances;
Getting stuff done
Instances train = new Instances(bReader);

train.setClassIndex(train.numAttributes()-1);
J48 j48 = new J48();

j48.buildClassifier(train);
Evaluation eval = new Evaluation(train);

eval.crossValidateModel(
j48,
train,
10,
new Random(1));
You can also grab Java code off Weka UI
Photo Credits
https://www.flickr.com/photos/johnnystiletto/3339808858/
https://www.flickr.com/photos/theequinest/5056055144/
https://www.flickr.com/photos/flyingkiwigirl/17385243168
https://www.flickr.com/photos/x6e38/3440973490/
https://www.flickr.com/photos/42931449@N07/5418402840/
https://www.flickr.com/photos/gerardstolk/12194108005/
https://www.flickr.com/photos/zzpza/3269784239/in/
https://www.flickr.com/photos/internationaltransportforum/
14258907973/


Get in touch
Kai Koenig
Email: kai@ventego-creative.co.nz
www.ventego-creative.co.nz
Blog: www.bloginblack.de
Twitter: @AgentK

More Related Content

What's hot

Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scopeTanmay Sethi
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data miningEr. Nawaraj Bhandari
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378nitttin
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseKartik Kalpande Patil
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)Kartik Kalpande Patil
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 

What's hot (19)

Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scope
 
3 classification
3  classification3  classification
3 classification
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Data mining
Data miningData mining
Data mining
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 

Similar to Introduction to Data Mining

Brief Tour of Machine Learning
Brief Tour of Machine LearningBrief Tour of Machine Learning
Brief Tour of Machine Learningbutest
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict TestabilityMiguel Lopez
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...butest
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
machinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfmachinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfPranavPatil822557
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsVidya sagar Sharma
 

Similar to Introduction to Data Mining (20)

Brief Tour of Machine Learning
Brief Tour of Machine LearningBrief Tour of Machine Learning
Brief Tour of Machine Learning
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Unit 3.pptx
Unit 3.pptxUnit 3.pptx
Unit 3.pptx
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict Testability
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
machinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfmachinecanthink-160226155704.pdf
machinecanthink-160226155704.pdf
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory Concepts
 
Machine Can Think
Machine Can ThinkMachine Can Think
Machine Can Think
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 

More from Kai Koenig

Why a whole country skipped a day - Fun with Timezones
Why a whole country skipped a day - Fun with Timezones Why a whole country skipped a day - Fun with Timezones
Why a whole country skipped a day - Fun with Timezones Kai Koenig
 
Android 103 - Firebase and Architecture Components
Android 103 - Firebase and Architecture ComponentsAndroid 103 - Firebase and Architecture Components
Android 103 - Firebase and Architecture ComponentsKai Koenig
 
Android 102 - Flow, Layouts and other things
Android 102 - Flow, Layouts and other thingsAndroid 102 - Flow, Layouts and other things
Android 102 - Flow, Layouts and other thingsKai Koenig
 
Android 101 - Building a simple app with Kotlin in 90 minutes
Android 101 - Building a simple app with Kotlin in 90 minutesAndroid 101 - Building a simple app with Kotlin in 90 minutes
Android 101 - Building a simple app with Kotlin in 90 minutesKai Koenig
 
Kotlin Coroutines and Android sitting in a tree - 2018 version
Kotlin Coroutines and Android sitting in a tree - 2018 versionKotlin Coroutines and Android sitting in a tree - 2018 version
Kotlin Coroutines and Android sitting in a tree - 2018 versionKai Koenig
 
Kotlin Coroutines and Android sitting in a tree
Kotlin Coroutines and Android sitting in a treeKotlin Coroutines and Android sitting in a tree
Kotlin Coroutines and Android sitting in a treeKai Koenig
 
Improving your CFML code quality
Improving your CFML code qualityImproving your CFML code quality
Improving your CFML code qualityKai Koenig
 
Summer of Tech 2017 - Kotlin/Android bootcamp
Summer of Tech 2017 - Kotlin/Android bootcampSummer of Tech 2017 - Kotlin/Android bootcamp
Summer of Tech 2017 - Kotlin/Android bootcampKai Koenig
 
2017: Kotlin - now more than ever
2017: Kotlin - now more than ever2017: Kotlin - now more than ever
2017: Kotlin - now more than everKai Koenig
 
Anko - The Ultimate Ninja of Kotlin Libraries?
Anko - The Ultimate Ninja of Kotlin Libraries?Anko - The Ultimate Ninja of Kotlin Libraries?
Anko - The Ultimate Ninja of Kotlin Libraries?Kai Koenig
 
Coding for Android on steroids with Kotlin
Coding for Android on steroids with KotlinCoding for Android on steroids with Kotlin
Coding for Android on steroids with KotlinKai Koenig
 
API management with Taffy and API Blueprint
API management with Taffy and API BlueprintAPI management with Taffy and API Blueprint
API management with Taffy and API BlueprintKai Koenig
 
Little Helpers for Android Development with Kotlin
Little Helpers for Android Development with KotlinLittle Helpers for Android Development with Kotlin
Little Helpers for Android Development with KotlinKai Koenig
 
Garbage First and you
Garbage First and youGarbage First and you
Garbage First and youKai Koenig
 
Real World Lessons in jQuery Mobile
Real World Lessons in jQuery MobileReal World Lessons in jQuery Mobile
Real World Lessons in jQuery MobileKai Koenig
 
The JVM is your friend
The JVM is your friendThe JVM is your friend
The JVM is your friendKai Koenig
 
Regular Expressions 101
Regular Expressions 101Regular Expressions 101
Regular Expressions 101Kai Koenig
 
There's a time and a place
There's a time and a placeThere's a time and a place
There's a time and a placeKai Koenig
 
Clojure - an introduction (and some CFML)
Clojure - an introduction (and some CFML)Clojure - an introduction (and some CFML)
Clojure - an introduction (and some CFML)Kai Koenig
 
AngularJS for designers and developers
AngularJS for designers and developersAngularJS for designers and developers
AngularJS for designers and developersKai Koenig
 

More from Kai Koenig (20)

Why a whole country skipped a day - Fun with Timezones
Why a whole country skipped a day - Fun with Timezones Why a whole country skipped a day - Fun with Timezones
Why a whole country skipped a day - Fun with Timezones
 
Android 103 - Firebase and Architecture Components
Android 103 - Firebase and Architecture ComponentsAndroid 103 - Firebase and Architecture Components
Android 103 - Firebase and Architecture Components
 
Android 102 - Flow, Layouts and other things
Android 102 - Flow, Layouts and other thingsAndroid 102 - Flow, Layouts and other things
Android 102 - Flow, Layouts and other things
 
Android 101 - Building a simple app with Kotlin in 90 minutes
Android 101 - Building a simple app with Kotlin in 90 minutesAndroid 101 - Building a simple app with Kotlin in 90 minutes
Android 101 - Building a simple app with Kotlin in 90 minutes
 
Kotlin Coroutines and Android sitting in a tree - 2018 version
Kotlin Coroutines and Android sitting in a tree - 2018 versionKotlin Coroutines and Android sitting in a tree - 2018 version
Kotlin Coroutines and Android sitting in a tree - 2018 version
 
Kotlin Coroutines and Android sitting in a tree
Kotlin Coroutines and Android sitting in a treeKotlin Coroutines and Android sitting in a tree
Kotlin Coroutines and Android sitting in a tree
 
Improving your CFML code quality
Improving your CFML code qualityImproving your CFML code quality
Improving your CFML code quality
 
Summer of Tech 2017 - Kotlin/Android bootcamp
Summer of Tech 2017 - Kotlin/Android bootcampSummer of Tech 2017 - Kotlin/Android bootcamp
Summer of Tech 2017 - Kotlin/Android bootcamp
 
2017: Kotlin - now more than ever
2017: Kotlin - now more than ever2017: Kotlin - now more than ever
2017: Kotlin - now more than ever
 
Anko - The Ultimate Ninja of Kotlin Libraries?
Anko - The Ultimate Ninja of Kotlin Libraries?Anko - The Ultimate Ninja of Kotlin Libraries?
Anko - The Ultimate Ninja of Kotlin Libraries?
 
Coding for Android on steroids with Kotlin
Coding for Android on steroids with KotlinCoding for Android on steroids with Kotlin
Coding for Android on steroids with Kotlin
 
API management with Taffy and API Blueprint
API management with Taffy and API BlueprintAPI management with Taffy and API Blueprint
API management with Taffy and API Blueprint
 
Little Helpers for Android Development with Kotlin
Little Helpers for Android Development with KotlinLittle Helpers for Android Development with Kotlin
Little Helpers for Android Development with Kotlin
 
Garbage First and you
Garbage First and youGarbage First and you
Garbage First and you
 
Real World Lessons in jQuery Mobile
Real World Lessons in jQuery MobileReal World Lessons in jQuery Mobile
Real World Lessons in jQuery Mobile
 
The JVM is your friend
The JVM is your friendThe JVM is your friend
The JVM is your friend
 
Regular Expressions 101
Regular Expressions 101Regular Expressions 101
Regular Expressions 101
 
There's a time and a place
There's a time and a placeThere's a time and a place
There's a time and a place
 
Clojure - an introduction (and some CFML)
Clojure - an introduction (and some CFML)Clojure - an introduction (and some CFML)
Clojure - an introduction (and some CFML)
 
AngularJS for designers and developers
AngularJS for designers and developersAngularJS for designers and developers
AngularJS for designers and developers
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Introduction to Data Mining

  • 2. Web/Mobile Developer since the late 1990s Interested in: Java & JVM, CFML, Functional Programming, Go, Android, Data Science And this is my view of the world… Me
  • 3.
  • 4. 1.What is Data Mining?
 2. Concepts and Terminology
 3.Weka
 4.Algorithms
 5. Dealing with Text
 6. Java integration Agenda
  • 5.
  • 7.
  • 8. 1.What is Data Mining?
  • 9. Fundamentals Why do we nowadays have SO MUCH data? Reasons include: - Cheap storage and better processing power - Legal & Business requirements - Digital hoarding
  • 10. Fundamentals Data Mining is all about going from data to useful and meaningful information. - Recommendation in online shops - Finding an “optimal” partner - Weather prediction - Judgement decisions (credit applications)
  • 12. A better definition “Data Mining is defined as the process of discovering patterns in data.The process must be automatic or (more usually) semiautomatic.The patterns discovered must be meaningful in that they lead to some advantage, often an economic one.” (Prof. Dr. Ian Witten)
  • 13. How can you express patterns?
  • 14. Finding and applying rules Tear Production Rate == reduced none
  • 15. Finding and applying rules Age == young && Astigmatism == no soft Age == young && Astigmatism == no soft
  • 16. A Result: Decision lists If outlook = sunny and humidity = high then play = no
 If outlook = rainy and windy = true then play = no
 If outlook = overcast then play = yes
 If humidity = normal then play = yes
 If none of the above then play = yes
  • 17. Not all rules are equal Classification rules: predict an outcome Association rules: rules that strongly associate different attribute values If temperature = cool then humidity = normal
 If humidity = normal and windy = false then play = yes 
 If outlook = sunny and play = no then humidity = high

  • 19. Learning What is Learning? And what is Machine Learning? A good approach is: “Things learn when they change their behaviour in a way that makes them perform better in the future”
  • 20. Learning types Classification learning Association learning Clustering Numerical Prediction
  • 21. Some basic terminology The thing to be learned is the concept. The output of a learning scheme is the concept description. Classification learning is sometimes called supervised learning. The outcome is the class. Examples are called instances.
  • 22.
  • 23.
  • 24. Some more basic terminology Discrete attribute values are usually called nominal values, continuous attribute values are called just numeric values. Algorithms used to process data and find patterns are often called classifiers.There are lots of them and all of them can be heavily configured.
  • 25.
  • 27. What is Weka? Waikato Environment for Knowledge Analysis Developed by a group in the Dept. of Computer Science at the University of Waikato in New Zealand. 
 Also,Weka is a New Zealand-only bird.
  • 28. What is Weka? Download for Mac OS X, Linux and Windows: http://www.cs.waikato.ac.nz/~ml/weka/ index.html
 Weka is written in Java, comes either as native applications or executable .jar file and is licensed under GPL v3.
  • 29. Getting data into Weka Easiest and common for experimenting: .arff Also supported: CSV, JSON, XML, JDBC connections etc. Filters in Weka can then be used to preprocess data.
  • 30. Features 50+ Preprocessing tools 75+ Classification/Regression algorithms ~10 clustering algorithms … and a packet manager to load and install more if you want.
  • 32. Classifiers There are literally hundreds with lots of tuning options. Main Categories: - Rule-based (ZeroR, OneR, PART etc.) - Tree-based (J48, J48graft, CART etc.) - Bayes-based (NaiveBayes etc.) - Functions-based (LR, Logistic etc.) - Lazy (IB1, IBk etc.)
  • 33. OneR Very simplistic classifier and based on a single attribute. For each attribute, For each value of that attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute value. Calculate the error rate of the rules. Choose the rules with the smallest error rate.
  • 34. C4.5 (J48) Produces a decision tree, derived from divide- and-conquer tree building techniques. Decision trees are often verbose and need to be pruned - J48 uses post-pruning, pruning can in some instances be costly. J48 usually provides a good balance re quality vs. cost (execution times etc.)
  • 35. NaiveBayes Very good and popular for document (text) classification. Based on statistical modelling (Bayes formula of conditional probability) In document classification we treat the existence or absence of a word as a Boolean attribute.
  • 36.
  • 37. Training and Testing We implicitly trained and tested our classifiers in the previous examples using Cross-Validation.
  • 38. Training and Testing Test data and Training data NEED to be different. If you have only one dataset, split it up. n-fold Cross-Validation: - Divides your dataset into n parts, holds out each part in turn - Trains with n-1 parts, tests with the held out part - Stratified CV is even better
  • 39.
  • 41. Bag of Words Generally for document classification we treat a document as a bag of words and the existence or absence of a word is a Boolean attribute. This results in problems with very many attributes having 2 values each. This is quite a bit different from the usual classification problem.
  • 42.
  • 43. Filtered Classifiers First step: use Filtered classifier with J48 and StringToWordVector filter. Example: Reuters Corn datasets (train/test) We get 97% accuracy, but there’s still an issue here -> investigate the confusion matrix Is accuracy the best way to evaluate quality?
  • 44. Better approaches to evaluation Accuracy: (a+d)/(a+b+c+d) Recall: R = d/(c+d) Precision: P = d/(b+d) F-Measure: 2PR/(P+R) False positive rate FP: b/(a+b) True negative rate TN: a/(a+b) False negative rate FN: c/(c+d) predicted – + true – a b + c d
  • 45. ROC (threshold) curves Area under the threshold curve determines the overall quality of a classifier.
  • 46.
  • 47. NaiveBayesMultinomial Often the best classifier for document classification. In particular: - good ROC - good results on minority class (often what we want)
  • 48. NaiveBayesMultinomial J48: 96% accuracy, 38/57 on grain docs, 544/547 on non-grain docs, ROC 0.91 NaiveBayes: 80% accuracy, 46/57 on grain docs, 439/547 on non-grain docs, ROC 0.885 NaiveBayesMultinomial: 91% accuracy, 52/57 on grain docs, 496/547 on non-grain docs, ROC 0.973
  • 49.
  • 50. NaiveBayesMultinomial NaiveBayesMultinomial with stoplist, lowerCase and outputWords: 94% accuracy, 56/57 on grain docs, 504/547 on non-grain docs, ROC 0.978 Why? NBM is designed for text: - based solely on word appearance - can deal with multiple repetitions of a word - faster than NB
  • 52. Weka is written in Java The UI is essentially making use of a vast underlying data mining and machine learning API. Obviously this fact invites us to use the API directly :)
  • 53. Setting up a project (IntelliJ IDEA) Create new Java project in IntelliJ Import weka.jar Import weka-src.jar Off you go!
  • 54. The main classes/packages you need… import weka.classifiers.Evaluation;
 import weka.classifiers.trees.J48;
 import weka.core.Instances;
  • 55. Getting stuff done Instances train = new Instances(bReader);
 train.setClassIndex(train.numAttributes()-1); J48 j48 = new J48();
 j48.buildClassifier(train); Evaluation eval = new Evaluation(train);
 eval.crossValidateModel( j48, train, 10, new Random(1));
  • 56. You can also grab Java code off Weka UI
  • 58. Get in touch Kai Koenig Email: kai@ventego-creative.co.nz www.ventego-creative.co.nz Blog: www.bloginblack.de Twitter: @AgentK