SlideShare a Scribd company logo
1 of 70
Gary M. Weiss, CIS Dept, Fordham University 1
Data Mining
Last updated 9/20/2020
Module:
Data Mining with Weka
Overview of DM with Weka Module
 Weka-1: Weka Basics
 Weka-2: Preprocessing
 Weka-3: DecisionTrees
 Weka-4: Other Classifiers & Feature Selection
 Weka-5: KnowledgeFlow Interface
Gary M. Weiss, CIS Dept, Fordham University 2
Weka-1: Weka Basics
 What isWeka?
 Why useWeka (and why not)
 Weka versions and downloading the software
 Weka ARFF file format
 StartingWeka and the 5Weka Interfaces
Gary M. Weiss, CIS Dept, Fordham University 3
What is Weka?
 Weka is a data mining suite developed at
University ofWaikato
 Weka stands forWaikato Environment for
Knowledge Analysis
 Weka includes everything necessary to generate
and apply data mining models
 Covers all major data mining tasks
 Includes tools to preprocess and visualize data
 Includes multiple (5) interfaces
 We will focus on the explorer interface
 Briefly discuss the knowledgeflow
 Does not require any programming
Gary M. Weiss, CIS Dept, Fordham University 4
WEKA is also a bird
5
Copyright: Martin Kramer (mkramer@wxs.nl)
Gary M. Weiss, CIS Dept, Fordham University
Why Weka?
 There are many options on what data mining
suite or toolkit one can use
 Weka has the following advantages:
 Easy to learn
 Free and open-source
 Easy to download and runs on many platforms
 Does not require programming knowledge
 This is important since a few students may not have
much programming experience
Gary M. Weiss, CIS Dept, Fordham University 6
Weka Disadvantages
 Weka is not as flexible as other tools that are
programming based
 If you are programming anyway you can combine
capabilities more flexibly and write code to do
things not supported byWeka
 Tools based on Python or R have large ecosystems
Gary M. Weiss, CIS Dept, Fordham University 7
DM Tools for This Course
 For your project you may useWeka, Python,
or even another data mining toolkit
 I do provide snippets of Python code and
Weka examples in some lectures
 But everyone should learnWEKA and complete
the tutorial/exercises in this module
 We will not cover the details of how to use
Python in this course
Gary M. Weiss, CIS Dept, Fordham University 8
WEKA: the software
 Data mining software written in Java
 distributed under the GNU Public License
 Used for research, education, and applications
 Main features:
 Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
 Multiple interfaces that do not require programming
 Experimenter can compare learning algorithms
Gary M. Weiss, CIS Dept, Fordham University 9
Weka: Versions
 There are several versions ofWEKA:
 As of Feb 2020 the stable version is 3.8.4 and that is
the one you should be using
 Many of these slides are based on older version
and will look different from what you see
 Dr.Weiss has added notes for significant differences
 Class WEKA page provides relevant info and links
 https://storm.cis.fordham.edu/~gweiss/data-mining/weka.html
 Can also use following link to download software and
documentation
 https://www.cs.waikato.ac.nz/ml/weka/
10
Gary M. Weiss, CIS Dept, Fordham University
Downloading Weka
 Weka runs on all major platforms
 It is free and easy to download
 Download Weka from here:
 https://waikato.github.io/weka-wiki/downloading_weka/
 Use the latest stable version (3.8.4 as of Aug 2020)
 There is plenty of documentation
 https://waikato.github.io/weka-wiki/documentation/
11
Gary M. Weiss, CIS Dept, Fordham University
Weka ARFF File Format
 ARFF= Attribute Relation File Format
 Weka ARFF files include two main parts:
 Specification of the features
 The actual data
 Files are “flat files” usually comma separated
 Other tools use a separate file for each part
 C4.5 decision tree tool, which was once very
popular, had a “names” and “data” file
 I find the single file format odd, since specification
is short but actual data can be millions of lines
Gary M. Weiss, CIS Dept, Fordham University 12
Weka ARFF Example
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
Gary M. Weiss, CIS Dept, Fordham University 13
Categorical feature must list values
This just defines dataset name
Starting Weka
 RunWeka 3.8.4 (with console)
 Different version if a new stable version released
 May want to create a shortcut on your desktop
Gary M. Weiss, CIS Dept, Fordham University 14
This is the
GUI Chooser
Window
The Weka Interfaces
 Explorer:
 We cover in detail
 Enables you to apply multiple actions but does not explicitly
model them
 Experimenter
 Run many experiments in controlled manner and compare results
 KnowledgeFlow:
 Like Explorer but each step is represented as a node in a graph, so
flow explicitly represented
 Workbench
 Combines all GUI interfaces into one
 Simple CLI (command line interface)
 No GUI so can use shell scripts to control experiments
 Similar to using Python where each command is a function
Gary M. Weiss, CIS Dept, Fordham University 15
Weka-1 Review: Weka Basics
 You should know the following:
 WhatWeka is
 The benefits & drawbacks ofWEKA
 How to downloadWEKA
 Ideally you should have installed it
 The ARFF format and understand it
 The 5 WEKA interfaces and what they are used for
Gary M. Weiss, CIS Dept, Fordham University 16
Weka-2: Preprocessing
 How to import data
 Visualizing feature value distribution and
relationships to class values
 Preprocessing filters
 Example using “Discretize” filter
 It is strongly recommended that you follow
along on your own installation ofWeka
 Reproduce each step on your computer
Gary M. Weiss, CIS Dept, Fordham University 17
Start the Explorer
Gary M. Weiss, CIS Dept, Fordham University 18
Click on “Explorer “ Button in the GUI Chooser Window
Importing Data
 Data can be imported from various formats:
 ARFF, CSV, C4.5
 You will most often import data in csv if not already prepared
forWEKA
 Weka steps
 Make sure Explorer “PreProcess” tab is selected
 Click “Open File…” or “Open URL…”
 Set the extension appropriately (default is .arff) and navigate
to the file and select it
 If first line has feature names they are used
 If no feature names then generic names are generated
 Add feature names if missing or models won’t be interpretable
Gary M. Weiss, CIS Dept, Fordham University 19
Reading in the Iris Dataset
 We start with the well-known Iris data set
 Weka probably installed it so search for “iris”.
 Otherwise download it from my copy on internet:
 https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/iris.arff
 Open it using “Open File” or “Open URL”
 Weka immediately shows stats about the features and
how they are distributed
 Other data sets it you want to play around:
 https://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html
 Hundreds of others available from UCI Machine Learning Repository
 https://archive.ics.uci.edu/ml/
20
Gary M. Weiss, CIS Dept, Fordham University
21
Gary M. Weiss, CIS Dept, Fordham University
22
Distribution of sepallength with
colors representing 3 class values
Gary M. Weiss, CIS Dept, Fordham University
Distribution of the 3 classes,
which are evenly distribute
23
Gary M. Weiss, CIS Dept, Fordham University
Gary M. Weiss, CIS Dept, Fordham University 24
25
Visualization of all
feature values with
class information. This
makes it clear that most
features are useful for
distinguishing classes.
Gary M. Weiss, CIS Dept, Fordham University
Preprocessing Filters
 Pre-processing tools inWEKA called “filters”
 Many preprocessing filters available:
 Discretization
 Normalization
 Resampling
 Feature selection
 Feature transformation
 Many more
 Know what available so can use it when needed
 We will focus on Discretization
26
Gary M. Weiss, CIS Dept, Fordham University
Gary M. Weiss, CIS Dept, Fordham University 27
Clicking on this will bring up a list of all of the
filters, organized into a hierarchy. Click on
each folder to expand the list.There are
dozens of choices
Discretization
 Discretization: numerical feature  categorical
 In preprocess tab select “Choose”, expand
“unsupervised” then “attribute”
 Select “Discretize” filter
 Note discretize applies to attributes not instances
 The following few slides are from an earlier version
and may look just a little bit different
 Note: there is a version under the “supervised” folder
but it works differently and has different options
Gary M. Weiss, CIS Dept, Fordham University 28
29
Gary M. Weiss, CIS Dept, Fordham University
30
Gary M. Weiss, CIS Dept, Fordham University
31
Gary M. Weiss, CIS Dept, Fordham University
This shows the command line
with options. Click within this
region to bring up a window
with the options and more info
32
Gary M. Weiss, CIS Dept, Fordham University
1)Toggle to “True”
Click for documentation
2) Click OK
Create 10 equal frequency bins
Will do for all numerical features given “attributeIndices” value
33
Gary M. Weiss, CIS Dept, Fordham University
34
Gary M. Weiss, CIS Dept, Fordham University
Note there are 10 bins
Frequencies not equal but
as close as possible
35
Can use undo to undo the last command
Gary M. Weiss, CIS Dept, Fordham University
Weka-2 Review: Preprocessing
 You should now be able to:
 Import data intoWeka
 Visualize the features and how they relate to class
 Have an idea of the preprocessing facilities (filters)
inWeka and be able to apply them
Gary M. Weiss, CIS Dept, Fordham University 36
Weka-3: Decision Trees
 Weka supports many classification and
regression algorithms
 We will go over a DT example is some detail
 Build a decision tree to classify Iris data set
 Examine the results
 Visualize the decision tree
 Visualize the errors
Gary M. Weiss, CIS Dept, Fordham University 37
Weka-3: Prediction Algorithms
 Weka supports all major classification and
regression methods
 DecisionTrees, Rule learners, Nearest Neighbor,
Naïve Bayes, Neural Networks, etc.
 Also support ensemble classifiers:
 Will learn about these later, but combine classifiers
 We first focus on decision trees
 Let’s start fresh without the discretization
 Undo last step or close window and reload “Iris”
Gary M. Weiss, CIS Dept, Fordham University 38
Gary M. Weiss, CIS Dept, Fordham University 39
Click on ClassifyTab
This is just the first classifier that shows up
Again, we have reverted to slides with old
version that looks a bit different
40
Gary M. Weiss, CIS Dept, Fordham University
Lots of choices, organized
under several categories.We
will use J48 that can be found
under “Trees”.
Note that the version of Weka
we are using has a somewhat
different organization
41
Gary M. Weiss, CIS Dept, Fordham University
Click in this area
42
Gary M. Weiss, CIS Dept, Fordham University
Accept defaults. Just click OK.
43
Gary M. Weiss, CIS Dept, Fordham University
Build and evaluate model using a
percentage split instead of 10-fold cross
validation.The 66% is for the training set
so 34% is for testing.
44
Gary M. Weiss, CIS Dept, Fordham University
Click on “More Options” to
open this window
45
Gary M. Weiss, CIS Dept, Fordham University
Just clickOK
46
Gary M. Weiss, CIS Dept, Fordham University
Click Start
47
Records command line
with options
Textual tree representation
Note: Do not worry if your results are slightly different
Gary M. Weiss, CIS Dept, Fordham University
Scroll down to
see more output
48
Gary M. Weiss, CIS Dept, Fordham University
Right click on model
Window pops up
then select visualize tree
49
Gary M. Weiss, CIS Dept, Fordham University
Gary M. Weiss, CIS Dept, Fordham University 50
Right click then select
visualize errors
51
Gary M. Weiss, CIS Dept, Fordham University
Gary M. Weiss, CIS Dept, Fordham University 52
Set X to petallength andY to petalwidth
Here are the two errors
Virginica classified asVersicolor
Not surprisingly near the decision boundary
If points to close modify jitter
Weka-3 Review: Decision Trees
 Covered an example in detail
 You should be able to build a decision tree
classifier and understand the results
 You should be able to find the necessary
documentation to understand and set the
various parameters
Gary M. Weiss, CIS Dept, Fordham University 53
Weka-4: Other Classifiers and
Feature Selection
 Briefly explore two more classifiers
 Artificial Neural Networks
 Nearest Neighbor
 Demonstrate that withWeka it is easy to run
any classifier and can treat like “Black Box”
 Briefly introduceWEKA feature selection
Gary M. Weiss, CIS Dept, Fordham University 54
Artificial Neural Networks (ANNs)
 ANNs covered later in course
 Inspired by the brain
 Simple processing nodes connected together
 Each node has a weight and learning occurs by
adjusting the weights to get desired output
 ANNs also called Multilayer Perceptrons
 Examples of universal function approximators
Gary M. Weiss, CIS Dept, Fordham University 55
Creating ANN in WEKA for Iris
 While processing Iris data set, choose ANN:
 In Classify tab click “Choose” button
 Navigate to “Classifiers” and then “functions”
 Select “Multilayer Perceptron”
 Make sure the training/test is still set 66% training
 Click “Start” to run ANN (use default options)
Gary M. Weiss, CIS Dept, Fordham University 56
Gary M. Weiss, CIS Dept, Fordham University 57
Good Performance
Only 1 Error
Nearest Neighbor Classification
 Nearest Neighbor classifies example based on
most similar instance(s)
 Relators use this method to price your home
 Also called instance-based learning
 k used to specify number of nearest neighbors
 Common values 1, 3, 5
 Method sometimes referred to as kNN
 Weka uses the IBk algorithm (IB=Instance Based)
 Also called “Lazy Learning”
 No work is done up front to build model
 All work done at classification time
Gary M. Weiss, CIS Dept, Fordham University 58
Creating IBk Model with Weka
 While processing Iris data set, choose IBk:
 In Classify tab, click “Choose” button
 Navigate to “Classifiers” and then “Lazy”
 Select “IBk”
 Keep all of the default options
 Make sure the training/test is still set 66% training
 Click “Start” to run method
 Results on next slide (uses default k=1)
 Results for k=5 on slide after that: try it by modifying
the option
Gary M. Weiss, CIS Dept, Fordham University 59
Gary M. Weiss, CIS Dept, Fordham University 60
Note 2 errors total
for 1NN
Gary M. Weiss, CIS Dept, Fordham University 61
Note k=5
Note 1 error total
for 5NN
Run a Few Classifiers on Own
 Run Random Forest
 A tree ensemble method that is underTrees
 It builds lots of trees using slightly different training
examples and features and then combines them
 We will study them and other ensembles later in course
 The numIterations parameter controls # of trees
 Run Bagging
 An ensemble (meta learning) method, so find it under
classifiersmeta
 Bagging runs base classifier repeatedly with different training data
 Run with default classifier and then change classifier to J48
Gary M. Weiss, CIS Dept, Fordham University 62
Weka Feature Selection
 Explore simple feature selection methods
 UseVote Dataset that has more features
 Go to Preprocess tab and load using url:
 https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/vote.arff
 Feature selection is needed for methods that cannot
handle irrelevant or redundant features
 Experiments:
 Best first forward search
 Pick best feature and then add more, but with
backtracking to explore more of the space
 Info gain with ranking
 Rank the features based on their information gain
Gary M. Weiss, CIS Dept, Fordham University 63
Gary M. Weiss, CIS Dept, Fordham University 64
1) Click on “Select Attributes”
2) Hit start using the defaults
Results window may only have
room for experiment info so
scroll down to see actual results
(displayed next slide)
Best First Forward Search for feature subset
Gary M. Weiss, CIS Dept, Fordham University 65
4 attributes
selected of the 17
Best First Search Subset Selection
Gary M. Weiss, CIS Dept, Fordham University 66
1) Modify both defaults as shown
2) Hit Start
3) Scroll to end of results window
Rank Features by Info Gain
Review Weka-4: Other Classifiers
and Feature Selection
 Built and evaluated neural network and
nearest neighbor models
 You should be able to useWeka to run any
classification model
 You should understand how to run feature
selection usingWeka
 It is okay if you do not understand the details of
each method
 Can look them up in the documentation
Gary M. Weiss, CIS Dept, Fordham University 67
Weka-5: KnowledgeFlow Interface
 Provides a drag and drop interface
 Each step is a node that you drop onto workspace
 You connect nodes to create a process flow
 Advantage: supports multiple process flows
 Can have a path with steps to preprocess data etc.
 Then branch to build different classifiers
 Other DM tools provide similar interfaces
 My early DM tools in industry had this interface
 SAS Enterprise Miner
 Clementine (now IBM SPSS Modeler)
Gary M. Weiss, CIS Dept, Fordham University 68
Video Demonstrating KnowledgeFlow
 You are not required to try KnowledgeFlow
 Watch first 4 minutes of 7-minute video
 It does not show how to use mouse to connect
nodes, which can be a bit tricky
 It conveys the basic ideas and power of interface
 Link to IanWitten’s Youtube video:
 https://youtu.be/sHSgoVX9z-8
 Also listed under myYoutube playlist:
 https://www.youtube.com/playlist?list=PLg2TXNAfR8_ukJY1fEPDMuXV43ZrBBpUh
 Select “More Data Mining withWeka (1.4 ..”
Gary M. Weiss, CIS Dept, Fordham University 69
Review of DM with Weka Module
 You should have installed Weka
 Understand the different interfaces
 Using Explorer interface, be able to:
 Load any data set
 Visualize the features
 Build any classifier and know how to learn about
parameters and change them
 Examine and interpret the results
 Know how to perform feature selection
 Understand basics of the knowledgeFlow
Gary M. Weiss, CIS Dept, Fordham University 70

More Related Content

Similar to Introduction to Weka- beginner tutorial.pptx

Client Side Exploits Using Pdf
Client Side Exploits Using PdfClient Side Exploits Using Pdf
Client Side Exploits Using Pdf
titanlambda
 
ARTAK_SAMUEL_HAKOBYAN_RESUME
ARTAK_SAMUEL_HAKOBYAN_RESUMEARTAK_SAMUEL_HAKOBYAN_RESUME
ARTAK_SAMUEL_HAKOBYAN_RESUME
Artak Hakobyan
 
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
evonnehoggarth79783
 
Windows installerbasics
Windows installerbasicsWindows installerbasics
Windows installerbasics
seethalux
 

Similar to Introduction to Weka- beginner tutorial.pptx (20)

Client Side Exploits Using Pdf
Client Side Exploits Using PdfClient Side Exploits Using Pdf
Client Side Exploits Using Pdf
 
Windows Kernel & Driver Development
Windows Kernel & Driver DevelopmentWindows Kernel & Driver Development
Windows Kernel & Driver Development
 
01 spring-intro
01 spring-intro01 spring-intro
01 spring-intro
 
ARTAK_SAMUEL_HAKOBYAN_RESUME
ARTAK_SAMUEL_HAKOBYAN_RESUMEARTAK_SAMUEL_HAKOBYAN_RESUME
ARTAK_SAMUEL_HAKOBYAN_RESUME
 
1 Introduction to SPSS.ppt
1 Introduction to SPSS.ppt1 Introduction to SPSS.ppt
1 Introduction to SPSS.ppt
 
Design the implementation of NMEA Get GPS Data from Record
Design the implementation of NMEA Get GPS Data from RecordDesign the implementation of NMEA Get GPS Data from Record
Design the implementation of NMEA Get GPS Data from Record
 
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
 
Client Side Exploits using PDF
Client Side Exploits using PDFClient Side Exploits using PDF
Client Side Exploits using PDF
 
nullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexitiesnullcon 2011 - Fuzzing with Complexities
nullcon 2011 - Fuzzing with Complexities
 
德國 Maxqda 12 質性分析軟體 入門及完整參考手冊
德國 Maxqda 12 質性分析軟體 入門及完整參考手冊德國 Maxqda 12 質性分析軟體 入門及完整參考手冊
德國 Maxqda 12 質性分析軟體 入門及完整參考手冊
 
Wek1
Wek1Wek1
Wek1
 
Programmability in spss 14
Programmability in spss 14Programmability in spss 14
Programmability in spss 14
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
 
Windows installerbasics
Windows installerbasicsWindows installerbasics
Windows installerbasics
 
Windows installerbasics
Windows installerbasicsWindows installerbasics
Windows installerbasics
 
Exploiting and analyzing Microsoft Surface Applications
Exploiting and analyzing Microsoft Surface ApplicationsExploiting and analyzing Microsoft Surface Applications
Exploiting and analyzing Microsoft Surface Applications
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Research software and Dataverse
Research software and DataverseResearch software and Dataverse
Research software and Dataverse
 
PowerApps
PowerAppsPowerApps
PowerApps
 
Programming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentProgramming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) Environment
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 

Introduction to Weka- beginner tutorial.pptx

  • 1. Gary M. Weiss, CIS Dept, Fordham University 1 Data Mining Last updated 9/20/2020 Module: Data Mining with Weka
  • 2. Overview of DM with Weka Module  Weka-1: Weka Basics  Weka-2: Preprocessing  Weka-3: DecisionTrees  Weka-4: Other Classifiers & Feature Selection  Weka-5: KnowledgeFlow Interface Gary M. Weiss, CIS Dept, Fordham University 2
  • 3. Weka-1: Weka Basics  What isWeka?  Why useWeka (and why not)  Weka versions and downloading the software  Weka ARFF file format  StartingWeka and the 5Weka Interfaces Gary M. Weiss, CIS Dept, Fordham University 3
  • 4. What is Weka?  Weka is a data mining suite developed at University ofWaikato  Weka stands forWaikato Environment for Knowledge Analysis  Weka includes everything necessary to generate and apply data mining models  Covers all major data mining tasks  Includes tools to preprocess and visualize data  Includes multiple (5) interfaces  We will focus on the explorer interface  Briefly discuss the knowledgeflow  Does not require any programming Gary M. Weiss, CIS Dept, Fordham University 4
  • 5. WEKA is also a bird 5 Copyright: Martin Kramer (mkramer@wxs.nl) Gary M. Weiss, CIS Dept, Fordham University
  • 6. Why Weka?  There are many options on what data mining suite or toolkit one can use  Weka has the following advantages:  Easy to learn  Free and open-source  Easy to download and runs on many platforms  Does not require programming knowledge  This is important since a few students may not have much programming experience Gary M. Weiss, CIS Dept, Fordham University 6
  • 7. Weka Disadvantages  Weka is not as flexible as other tools that are programming based  If you are programming anyway you can combine capabilities more flexibly and write code to do things not supported byWeka  Tools based on Python or R have large ecosystems Gary M. Weiss, CIS Dept, Fordham University 7
  • 8. DM Tools for This Course  For your project you may useWeka, Python, or even another data mining toolkit  I do provide snippets of Python code and Weka examples in some lectures  But everyone should learnWEKA and complete the tutorial/exercises in this module  We will not cover the details of how to use Python in this course Gary M. Weiss, CIS Dept, Fordham University 8
  • 9. WEKA: the software  Data mining software written in Java  distributed under the GNU Public License  Used for research, education, and applications  Main features:  Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods  Multiple interfaces that do not require programming  Experimenter can compare learning algorithms Gary M. Weiss, CIS Dept, Fordham University 9
  • 10. Weka: Versions  There are several versions ofWEKA:  As of Feb 2020 the stable version is 3.8.4 and that is the one you should be using  Many of these slides are based on older version and will look different from what you see  Dr.Weiss has added notes for significant differences  Class WEKA page provides relevant info and links  https://storm.cis.fordham.edu/~gweiss/data-mining/weka.html  Can also use following link to download software and documentation  https://www.cs.waikato.ac.nz/ml/weka/ 10 Gary M. Weiss, CIS Dept, Fordham University
  • 11. Downloading Weka  Weka runs on all major platforms  It is free and easy to download  Download Weka from here:  https://waikato.github.io/weka-wiki/downloading_weka/  Use the latest stable version (3.8.4 as of Aug 2020)  There is plenty of documentation  https://waikato.github.io/weka-wiki/documentation/ 11 Gary M. Weiss, CIS Dept, Fordham University
  • 12. Weka ARFF File Format  ARFF= Attribute Relation File Format  Weka ARFF files include two main parts:  Specification of the features  The actual data  Files are “flat files” usually comma separated  Other tools use a separate file for each part  C4.5 decision tree tool, which was once very popular, had a “names” and “data” file  I find the single file format odd, since specification is short but actual data can be millions of lines Gary M. Weiss, CIS Dept, Fordham University 12
  • 13. Weka ARFF Example @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... Gary M. Weiss, CIS Dept, Fordham University 13 Categorical feature must list values This just defines dataset name
  • 14. Starting Weka  RunWeka 3.8.4 (with console)  Different version if a new stable version released  May want to create a shortcut on your desktop Gary M. Weiss, CIS Dept, Fordham University 14 This is the GUI Chooser Window
  • 15. The Weka Interfaces  Explorer:  We cover in detail  Enables you to apply multiple actions but does not explicitly model them  Experimenter  Run many experiments in controlled manner and compare results  KnowledgeFlow:  Like Explorer but each step is represented as a node in a graph, so flow explicitly represented  Workbench  Combines all GUI interfaces into one  Simple CLI (command line interface)  No GUI so can use shell scripts to control experiments  Similar to using Python where each command is a function Gary M. Weiss, CIS Dept, Fordham University 15
  • 16. Weka-1 Review: Weka Basics  You should know the following:  WhatWeka is  The benefits & drawbacks ofWEKA  How to downloadWEKA  Ideally you should have installed it  The ARFF format and understand it  The 5 WEKA interfaces and what they are used for Gary M. Weiss, CIS Dept, Fordham University 16
  • 17. Weka-2: Preprocessing  How to import data  Visualizing feature value distribution and relationships to class values  Preprocessing filters  Example using “Discretize” filter  It is strongly recommended that you follow along on your own installation ofWeka  Reproduce each step on your computer Gary M. Weiss, CIS Dept, Fordham University 17
  • 18. Start the Explorer Gary M. Weiss, CIS Dept, Fordham University 18 Click on “Explorer “ Button in the GUI Chooser Window
  • 19. Importing Data  Data can be imported from various formats:  ARFF, CSV, C4.5  You will most often import data in csv if not already prepared forWEKA  Weka steps  Make sure Explorer “PreProcess” tab is selected  Click “Open File…” or “Open URL…”  Set the extension appropriately (default is .arff) and navigate to the file and select it  If first line has feature names they are used  If no feature names then generic names are generated  Add feature names if missing or models won’t be interpretable Gary M. Weiss, CIS Dept, Fordham University 19
  • 20. Reading in the Iris Dataset  We start with the well-known Iris data set  Weka probably installed it so search for “iris”.  Otherwise download it from my copy on internet:  https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/iris.arff  Open it using “Open File” or “Open URL”  Weka immediately shows stats about the features and how they are distributed  Other data sets it you want to play around:  https://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html  Hundreds of others available from UCI Machine Learning Repository  https://archive.ics.uci.edu/ml/ 20 Gary M. Weiss, CIS Dept, Fordham University
  • 21. 21 Gary M. Weiss, CIS Dept, Fordham University
  • 22. 22 Distribution of sepallength with colors representing 3 class values Gary M. Weiss, CIS Dept, Fordham University
  • 23. Distribution of the 3 classes, which are evenly distribute 23 Gary M. Weiss, CIS Dept, Fordham University
  • 24. Gary M. Weiss, CIS Dept, Fordham University 24
  • 25. 25 Visualization of all feature values with class information. This makes it clear that most features are useful for distinguishing classes. Gary M. Weiss, CIS Dept, Fordham University
  • 26. Preprocessing Filters  Pre-processing tools inWEKA called “filters”  Many preprocessing filters available:  Discretization  Normalization  Resampling  Feature selection  Feature transformation  Many more  Know what available so can use it when needed  We will focus on Discretization 26 Gary M. Weiss, CIS Dept, Fordham University
  • 27. Gary M. Weiss, CIS Dept, Fordham University 27 Clicking on this will bring up a list of all of the filters, organized into a hierarchy. Click on each folder to expand the list.There are dozens of choices
  • 28. Discretization  Discretization: numerical feature  categorical  In preprocess tab select “Choose”, expand “unsupervised” then “attribute”  Select “Discretize” filter  Note discretize applies to attributes not instances  The following few slides are from an earlier version and may look just a little bit different  Note: there is a version under the “supervised” folder but it works differently and has different options Gary M. Weiss, CIS Dept, Fordham University 28
  • 29. 29 Gary M. Weiss, CIS Dept, Fordham University
  • 30. 30 Gary M. Weiss, CIS Dept, Fordham University
  • 31. 31 Gary M. Weiss, CIS Dept, Fordham University
  • 32. This shows the command line with options. Click within this region to bring up a window with the options and more info 32 Gary M. Weiss, CIS Dept, Fordham University
  • 33. 1)Toggle to “True” Click for documentation 2) Click OK Create 10 equal frequency bins Will do for all numerical features given “attributeIndices” value 33 Gary M. Weiss, CIS Dept, Fordham University
  • 34. 34 Gary M. Weiss, CIS Dept, Fordham University
  • 35. Note there are 10 bins Frequencies not equal but as close as possible 35 Can use undo to undo the last command Gary M. Weiss, CIS Dept, Fordham University
  • 36. Weka-2 Review: Preprocessing  You should now be able to:  Import data intoWeka  Visualize the features and how they relate to class  Have an idea of the preprocessing facilities (filters) inWeka and be able to apply them Gary M. Weiss, CIS Dept, Fordham University 36
  • 37. Weka-3: Decision Trees  Weka supports many classification and regression algorithms  We will go over a DT example is some detail  Build a decision tree to classify Iris data set  Examine the results  Visualize the decision tree  Visualize the errors Gary M. Weiss, CIS Dept, Fordham University 37
  • 38. Weka-3: Prediction Algorithms  Weka supports all major classification and regression methods  DecisionTrees, Rule learners, Nearest Neighbor, Naïve Bayes, Neural Networks, etc.  Also support ensemble classifiers:  Will learn about these later, but combine classifiers  We first focus on decision trees  Let’s start fresh without the discretization  Undo last step or close window and reload “Iris” Gary M. Weiss, CIS Dept, Fordham University 38
  • 39. Gary M. Weiss, CIS Dept, Fordham University 39
  • 40. Click on ClassifyTab This is just the first classifier that shows up Again, we have reverted to slides with old version that looks a bit different 40 Gary M. Weiss, CIS Dept, Fordham University
  • 41. Lots of choices, organized under several categories.We will use J48 that can be found under “Trees”. Note that the version of Weka we are using has a somewhat different organization 41 Gary M. Weiss, CIS Dept, Fordham University
  • 42. Click in this area 42 Gary M. Weiss, CIS Dept, Fordham University
  • 43. Accept defaults. Just click OK. 43 Gary M. Weiss, CIS Dept, Fordham University
  • 44. Build and evaluate model using a percentage split instead of 10-fold cross validation.The 66% is for the training set so 34% is for testing. 44 Gary M. Weiss, CIS Dept, Fordham University
  • 45. Click on “More Options” to open this window 45 Gary M. Weiss, CIS Dept, Fordham University
  • 46. Just clickOK 46 Gary M. Weiss, CIS Dept, Fordham University
  • 47. Click Start 47 Records command line with options Textual tree representation Note: Do not worry if your results are slightly different Gary M. Weiss, CIS Dept, Fordham University
  • 48. Scroll down to see more output 48 Gary M. Weiss, CIS Dept, Fordham University
  • 49. Right click on model Window pops up then select visualize tree 49 Gary M. Weiss, CIS Dept, Fordham University
  • 50. Gary M. Weiss, CIS Dept, Fordham University 50
  • 51. Right click then select visualize errors 51 Gary M. Weiss, CIS Dept, Fordham University
  • 52. Gary M. Weiss, CIS Dept, Fordham University 52 Set X to petallength andY to petalwidth Here are the two errors Virginica classified asVersicolor Not surprisingly near the decision boundary If points to close modify jitter
  • 53. Weka-3 Review: Decision Trees  Covered an example in detail  You should be able to build a decision tree classifier and understand the results  You should be able to find the necessary documentation to understand and set the various parameters Gary M. Weiss, CIS Dept, Fordham University 53
  • 54. Weka-4: Other Classifiers and Feature Selection  Briefly explore two more classifiers  Artificial Neural Networks  Nearest Neighbor  Demonstrate that withWeka it is easy to run any classifier and can treat like “Black Box”  Briefly introduceWEKA feature selection Gary M. Weiss, CIS Dept, Fordham University 54
  • 55. Artificial Neural Networks (ANNs)  ANNs covered later in course  Inspired by the brain  Simple processing nodes connected together  Each node has a weight and learning occurs by adjusting the weights to get desired output  ANNs also called Multilayer Perceptrons  Examples of universal function approximators Gary M. Weiss, CIS Dept, Fordham University 55
  • 56. Creating ANN in WEKA for Iris  While processing Iris data set, choose ANN:  In Classify tab click “Choose” button  Navigate to “Classifiers” and then “functions”  Select “Multilayer Perceptron”  Make sure the training/test is still set 66% training  Click “Start” to run ANN (use default options) Gary M. Weiss, CIS Dept, Fordham University 56
  • 57. Gary M. Weiss, CIS Dept, Fordham University 57 Good Performance Only 1 Error
  • 58. Nearest Neighbor Classification  Nearest Neighbor classifies example based on most similar instance(s)  Relators use this method to price your home  Also called instance-based learning  k used to specify number of nearest neighbors  Common values 1, 3, 5  Method sometimes referred to as kNN  Weka uses the IBk algorithm (IB=Instance Based)  Also called “Lazy Learning”  No work is done up front to build model  All work done at classification time Gary M. Weiss, CIS Dept, Fordham University 58
  • 59. Creating IBk Model with Weka  While processing Iris data set, choose IBk:  In Classify tab, click “Choose” button  Navigate to “Classifiers” and then “Lazy”  Select “IBk”  Keep all of the default options  Make sure the training/test is still set 66% training  Click “Start” to run method  Results on next slide (uses default k=1)  Results for k=5 on slide after that: try it by modifying the option Gary M. Weiss, CIS Dept, Fordham University 59
  • 60. Gary M. Weiss, CIS Dept, Fordham University 60 Note 2 errors total for 1NN
  • 61. Gary M. Weiss, CIS Dept, Fordham University 61 Note k=5 Note 1 error total for 5NN
  • 62. Run a Few Classifiers on Own  Run Random Forest  A tree ensemble method that is underTrees  It builds lots of trees using slightly different training examples and features and then combines them  We will study them and other ensembles later in course  The numIterations parameter controls # of trees  Run Bagging  An ensemble (meta learning) method, so find it under classifiersmeta  Bagging runs base classifier repeatedly with different training data  Run with default classifier and then change classifier to J48 Gary M. Weiss, CIS Dept, Fordham University 62
  • 63. Weka Feature Selection  Explore simple feature selection methods  UseVote Dataset that has more features  Go to Preprocess tab and load using url:  https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/vote.arff  Feature selection is needed for methods that cannot handle irrelevant or redundant features  Experiments:  Best first forward search  Pick best feature and then add more, but with backtracking to explore more of the space  Info gain with ranking  Rank the features based on their information gain Gary M. Weiss, CIS Dept, Fordham University 63
  • 64. Gary M. Weiss, CIS Dept, Fordham University 64 1) Click on “Select Attributes” 2) Hit start using the defaults Results window may only have room for experiment info so scroll down to see actual results (displayed next slide) Best First Forward Search for feature subset
  • 65. Gary M. Weiss, CIS Dept, Fordham University 65 4 attributes selected of the 17 Best First Search Subset Selection
  • 66. Gary M. Weiss, CIS Dept, Fordham University 66 1) Modify both defaults as shown 2) Hit Start 3) Scroll to end of results window Rank Features by Info Gain
  • 67. Review Weka-4: Other Classifiers and Feature Selection  Built and evaluated neural network and nearest neighbor models  You should be able to useWeka to run any classification model  You should understand how to run feature selection usingWeka  It is okay if you do not understand the details of each method  Can look them up in the documentation Gary M. Weiss, CIS Dept, Fordham University 67
  • 68. Weka-5: KnowledgeFlow Interface  Provides a drag and drop interface  Each step is a node that you drop onto workspace  You connect nodes to create a process flow  Advantage: supports multiple process flows  Can have a path with steps to preprocess data etc.  Then branch to build different classifiers  Other DM tools provide similar interfaces  My early DM tools in industry had this interface  SAS Enterprise Miner  Clementine (now IBM SPSS Modeler) Gary M. Weiss, CIS Dept, Fordham University 68
  • 69. Video Demonstrating KnowledgeFlow  You are not required to try KnowledgeFlow  Watch first 4 minutes of 7-minute video  It does not show how to use mouse to connect nodes, which can be a bit tricky  It conveys the basic ideas and power of interface  Link to IanWitten’s Youtube video:  https://youtu.be/sHSgoVX9z-8  Also listed under myYoutube playlist:  https://www.youtube.com/playlist?list=PLg2TXNAfR8_ukJY1fEPDMuXV43ZrBBpUh  Select “More Data Mining withWeka (1.4 ..” Gary M. Weiss, CIS Dept, Fordham University 69
  • 70. Review of DM with Weka Module  You should have installed Weka  Understand the different interfaces  Using Explorer interface, be able to:  Load any data set  Visualize the features  Build any classifier and know how to learn about parameters and change them  Examine and interpret the results  Know how to perform feature selection  Understand basics of the knowledgeFlow Gary M. Weiss, CIS Dept, Fordham University 70