SlideShare a Scribd company logo
Brief Weka Introduction
Shuang Wu
Guided by Dr. Thanh Tran
Weka
• The software: Waikato Environment for
Knowledge Analysis
– Machine learning/data mining software written in
Java (distributed under the GNU Public License)
• The bird: an endemic bird of New Zealand
Outline
• ARFF format and loading files to Weka
• Basic preprocess and classifier Demo
• Attribute selection & Demo
• Filtering datasets & Demo
ARFF format and loading files to Weka
Attribute-Relation File Format (ARFF)
• Two distinct sections
– Header & Data
• Four data types supported
– numeric
– <nominal-specification>
– string
– date [<date-format>]
• E.g.: DATE "yyyy-MM-dd HH:mm:ss"
(http://www.cs.waikato.ac.nz/ml/weka/arff.html)
Converting Files to ARFF
• Weka has converters for the following file
formats:
– Spreadsheet files with extension .csv.
– C4.5’s native file format with extensions .names
and .data.
– Serialized instances with extension .bsi.
– LIBSVM format files with extension .libsvm.
– SVM-Light format files with extension .dat.
– XML-based ARFF format files with extension .xrff.
(Witten, Frank & Witten, 2011)
(Witten, Frank & Witten, 2011)
(Witten, Frank & Witten, 2011)
Basic preprocess and classifier Demo
More Information
can be seen from
here.
Attribute selection
Why Feature Selection
• Not all the features contained in the datasets
of a classification problem are useful
• Redundant or irrelevant features may even
reduce the classification performance
• Eliminating noisy and unnecessary features
can
– Improve classification performance
– Make learning and executing processes faster
– Simplify the structure of the learned models
Feature Selection
• Two categories of feature selection
– Wrapper approaches:
• Conduct a search for the best feature subset using the learning
algorithm itself as part of the evaluation function
• A feature selection algorithm exists as a wrapper around a learning
algorithm
– Filter approaches:
• Independent of a learning algorithm
• Argued to be computationally less expensive and more general
• By considering the performance of the selected feature
subset on a particular learning algorithm, wrappers can
usually achieve better results than filter approaches
Wrapper v.s. Filter
(Kohavi & John, 1997)
Filter: one example
• One algorithm that falls into the filter approach: the
FOCUS algorithm
– Exhaustively examines all subsets of features, selecting the
minimal subset of features that is sufficient to determine
the label value for all instances in the training set.
– May introduces the MIN-FEATURES bias.
– For example, in a medical diagnosis task, a set of features
describing a patient might include the patient’s social
security number (SSN). When FOCUS searches for the
minimum set of features, it will pick the SSN as the only
feature needed to uniquely determine the label. Given
only the SSN, any induction algorithm is expected to
generalize very poorly.
(Kohavi & John, 1997)
Searching Attribute Space
• The size of search space for n features is 2n, so it is
impractical to search the whole space exhaustively in
most situations
• Single Feature Ranking
– A relaxed version of feature selection that only requires
the computation of the relative importance of the features
and subsequently sorting them
– Computationally cheap, but the combination of the top-
ranked features may be a redundant subset
• Feature Subset Ranking, such as
– Greedy Algorithms
– Genetic Algorithm (GA)
WEKA Attribute Selection Function
• Two ways to do attribute selection:
– Normally done by searching the space of attribute
subsets, evaluating each one (Feature Subset Ranking)
• By combining 1 attribute subset evaluator and 1 search
method
– A potentially faster but less accurate approach is to
evaluate the attributes individually and sort them,
discarding attributes that fall below a chosen cutoff
point (Single Feature Ranking)
• By using 1 single-attribute evaluator and the ranking
method
Two Wrapper Methods in Weka
• ClassifierSubsetEval
– Use a classifier, specified in the object editor as a
parameter, to evaluate sets of attributes on the
training data or on a separate holdout set.
• WrapperSubsetEval
– Also use a classifier to evaluate attribute sets, but
employ cross-validation to estimate the accuracy
of the learning scheme for each set
Attribute Subset Evaluators
(Witten, Frank & Witten, 2011)
This one
will be used
in Demo
Search Methods
(Witten, Frank & Witten, 2011)
This one
will be used
in Demo
Single-Attribute Evaluators
(Witten, Frank & Witten, 2011)
Ranking Method
(Witten, Frank & Witten, 2011)
Attribute selection Demo
Filtering datasets
Filtering Algorithms
• There are two kinds of filter
– Supervised : taking advantage of the class
information. A class must be assigned. Default
behavior uses the last attribute as class.
– Unsupervised: A class is not taking into consideration
here.
• Both unsupervised and supervised filters have
– Attribute filters, which work on the attributes in the
datasets, and
– Instance filters, which work on the instances
Unsupervised Attribute Filters
• Including operations of
– Adding and Removing Attributes
– Changing Values
– Converting attributes from one form to another
– Converting multi-instance data into single-
instance format
– Working with time series data
– Randomizing
(Witten, Frank & Witten, 2011)
This one will
be used in the
Demo.
(Witten, Frank & Witten, 2011)
(Witten, Frank & Witten, 2011)
Unsupervised Instance Filters
(Witten, Frank & Witten, 2011)
This one will
be used in
the Demo.
Supervised Attribute and Instance
Filters
(Witten, Frank & Witten, 2011)
Filtering datasets Demo
Noted that the data type of
the attribute “temperature ”
is numeric.
First, let’s filter the attributes.
Set the “attributeIndices” to 2
(the “temperature” attribute)
and the “bins” to 5 (which
means to discretize the datasets
to 5 bins)
Noted the discretization result.
We can also filter the instances.
Noted here that
there are 3
instances that has
label (-inf-68.2].
Set the “attributeIndex” to 2 (the
“temperature” attribute) and the
“nominalIndices” to 1 (which
means to remove all the instances
with label (-inf-68.2].)
All the instances labeled
as (-inf-68.2] have been
removed.
Then when you do the
classification, it will be based
on the filtered datasets, as
shown here.
Resources
• Weka official website:
http://www.cs.waikato.ac.nz/ml/weka/
• Two Weka tutorials on YouTube:
– https://www.youtube.com/user/WekaMOOC
– https://www.youtube.com/user/rushdishams/videos
• Book: Data Mining:
Practical Machine Learning Tools and Techniques.
Please refer to
http://www.cs.waikato.ac.nz/ml/weka/book.html
for more details.
References
• Frank, E., Machine Learning with WEKA. Retrieved April 05, 2014
from http://www.cs.waikato.ac.nz/ml/weka/documentation.html
• Kohavi, R. & John, G.H. (1997), Wrappers for feature subset
selection, Articial Intelligence 97, 315–333.
• Reservoir sampling. Retrieved April 05, 2014, from
http://en.wikipedia.org/wiki/Reservoir_sampling
• Witten, I. H., Frank, E., Hall, M. (2011) Data Mining: Practical
Machine Learning Tools and Techniques (Third Edition). Morgan
Kaufmann.
• Xue, B., Zhang, M., & Browne, W. N. (2012). Single feature ranking
and binary particle swarm optimisation based feature subset
ranking for feature selection. Paper presented at the Proceedings of
the Thirty-fifth Australasian Computer Science Conference - Volume
122, Melbourne, Australia.

More Related Content

What's hot

Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
Mohammed Aejazuddin
 
WEKA: Introduction To Weka
WEKA: Introduction To WekaWEKA: Introduction To Weka
WEKA: Introduction To Weka
DataminingTools Inc
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
Analytics machine learning in weka
Analytics machine learning in wekaAnalytics machine learning in weka
Analytics machine learning in weka
Sudhakar Chavan
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
weka data mining
weka data mining weka data mining
weka data mining
kalthoom almaqbali
 
Weka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningWeka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningTony Frame
 
Wek1
Wek1Wek1
Weka library, JAVA
Weka library, JAVAWeka library, JAVA
Weka library, JAVA
Kamthorn Puntumapon
 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to weka
Pamoda Vajiramali
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
Pratik Doshi
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Abrar ali
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
AlbanLevy
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
DataminingTools Inc
 
WEKA: The Explorer
WEKA: The ExplorerWEKA: The Explorer
WEKA: The Explorer
DataminingTools Inc
 

What's hot (17)

Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
WEKA: Introduction To Weka
WEKA: Introduction To WekaWEKA: Introduction To Weka
WEKA: Introduction To Weka
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
Analytics machine learning in weka
Analytics machine learning in wekaAnalytics machine learning in weka
Analytics machine learning in weka
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
weka data mining
weka data mining weka data mining
weka data mining
 
Weka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningWeka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_mining
 
Wek1
Wek1Wek1
Wek1
 
Weka library, JAVA
Weka library, JAVAWeka library, JAVA
Weka library, JAVA
 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to weka
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
 
WEKA: The Explorer
WEKA: The ExplorerWEKA: The Explorer
WEKA: The Explorer
 

Viewers also liked

Hinf6210 Project Classification Of Breast Cancer Dataset
Hinf6210 Project Classification Of Breast Cancer Dataset Hinf6210 Project Classification Of Breast Cancer Dataset
Hinf6210 Project Classification Of Breast Cancer Dataset
Abel Gebreyesus
 
L1 l2 l3 introduction to machine translation
L1 l2 l3  introduction to machine translationL1 l2 l3  introduction to machine translation
L1 l2 l3 introduction to machine translationRushdi Shams
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logicRushdi Shams
 
L13 why software fails
L13  why software failsL13  why software fails
L13 why software failsRushdi Shams
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representationRushdi Shams
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Lecture 5, 6 and 7 cpu scheduling
Lecture 5, 6 and 7  cpu schedulingLecture 5, 6 and 7  cpu scheduling
Lecture 5, 6 and 7 cpu schedulingRushdi Shams
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
Daniyar Mukhanov
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
Daniyar Mukhanov
 
Weka.arff
Weka.arffWeka.arff
Weka.arff
Daniyar Mukhanov
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
Clement Robert Habimana
 
Semi-supervised classification for natural language processing
Semi-supervised classification for natural language processingSemi-supervised classification for natural language processing
Semi-supervised classification for natural language processing
Rushdi Shams
 
Amazon mp
Amazon mpAmazon mp
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
Daniyar Mukhanov
 
Twitter r t under crisis
Twitter r t under crisisTwitter r t under crisis
Twitter r t under crisis
Clement Robert Habimana
 
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating SystemsLecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating SystemsRushdi Shams
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
Upekha Vandebona
 

Viewers also liked (20)

Hinf6210 Project Classification Of Breast Cancer Dataset
Hinf6210 Project Classification Of Breast Cancer Dataset Hinf6210 Project Classification Of Breast Cancer Dataset
Hinf6210 Project Classification Of Breast Cancer Dataset
 
L1 l2 l3 introduction to machine translation
L1 l2 l3  introduction to machine translationL1 l2 l3  introduction to machine translation
L1 l2 l3 introduction to machine translation
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
 
L13 why software fails
L13  why software failsL13  why software fails
L13 why software fails
 
L15 fuzzy logic
L15  fuzzy logicL15  fuzzy logic
L15 fuzzy logic
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Lecture 5, 6 and 7 cpu scheduling
Lecture 5, 6 and 7  cpu schedulingLecture 5, 6 and 7  cpu scheduling
Lecture 5, 6 and 7 cpu scheduling
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
 
Weka
WekaWeka
Weka
 
Weka.arff
Weka.arffWeka.arff
Weka.arff
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
 
Semi-supervised classification for natural language processing
Semi-supervised classification for natural language processingSemi-supervised classification for natural language processing
Semi-supervised classification for natural language processing
 
Amazon mp
Amazon mpAmazon mp
Amazon mp
 
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
 
Twitter r t under crisis
Twitter r t under crisisTwitter r t under crisis
Twitter r t under crisis
 
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating SystemsLecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
 
Weka
WekaWeka
Weka
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 

Similar to Weka

Name IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docxName IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docx
rosemarybdodson23141
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
Nagasuri Bala Venkateswarlu
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
IJTET Journal
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
Data Structure & Algorithms - Operations
Data Structure & Algorithms - OperationsData Structure & Algorithms - Operations
Data Structure & Algorithms - Operations
babuk110
 
Machine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin TablanMachine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin Tablanbutest
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
Sease
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Alessandro Benedetti
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Alessandro Benedetti
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
dhabalia
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
Cataldo Musto
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
Alessandro Benedetti
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
Saranya Natarajan
 
QTP Tutorial
QTP TutorialQTP Tutorial
QTP Tutorialpingkapil
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntityAnkita Kumari
 
Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780Azhar Satti
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
arthi v
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
DianaGray10
 

Similar to Weka (20)

Name IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docxName IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docx
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Data Structure & Algorithms - Operations
Data Structure & Algorithms - OperationsData Structure & Algorithms - Operations
Data Structure & Algorithms - Operations
 
Machine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin TablanMachine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin Tablan
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
 
QTP Tutorial
QTP TutorialQTP Tutorial
QTP Tutorial
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
 
Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Weka

  • 1. Brief Weka Introduction Shuang Wu Guided by Dr. Thanh Tran
  • 2. Weka • The software: Waikato Environment for Knowledge Analysis – Machine learning/data mining software written in Java (distributed under the GNU Public License) • The bird: an endemic bird of New Zealand
  • 3. Outline • ARFF format and loading files to Weka • Basic preprocess and classifier Demo • Attribute selection & Demo • Filtering datasets & Demo
  • 4. ARFF format and loading files to Weka
  • 5. Attribute-Relation File Format (ARFF) • Two distinct sections – Header & Data • Four data types supported – numeric – <nominal-specification> – string – date [<date-format>] • E.g.: DATE "yyyy-MM-dd HH:mm:ss" (http://www.cs.waikato.ac.nz/ml/weka/arff.html)
  • 6. Converting Files to ARFF • Weka has converters for the following file formats: – Spreadsheet files with extension .csv. – C4.5’s native file format with extensions .names and .data. – Serialized instances with extension .bsi. – LIBSVM format files with extension .libsvm. – SVM-Light format files with extension .dat. – XML-based ARFF format files with extension .xrff. (Witten, Frank & Witten, 2011)
  • 7. (Witten, Frank & Witten, 2011)
  • 8. (Witten, Frank & Witten, 2011)
  • 9. Basic preprocess and classifier Demo
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. More Information can be seen from here.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 27. Why Feature Selection • Not all the features contained in the datasets of a classification problem are useful • Redundant or irrelevant features may even reduce the classification performance • Eliminating noisy and unnecessary features can – Improve classification performance – Make learning and executing processes faster – Simplify the structure of the learned models
  • 28. Feature Selection • Two categories of feature selection – Wrapper approaches: • Conduct a search for the best feature subset using the learning algorithm itself as part of the evaluation function • A feature selection algorithm exists as a wrapper around a learning algorithm – Filter approaches: • Independent of a learning algorithm • Argued to be computationally less expensive and more general • By considering the performance of the selected feature subset on a particular learning algorithm, wrappers can usually achieve better results than filter approaches
  • 30. Filter: one example • One algorithm that falls into the filter approach: the FOCUS algorithm – Exhaustively examines all subsets of features, selecting the minimal subset of features that is sufficient to determine the label value for all instances in the training set. – May introduces the MIN-FEATURES bias. – For example, in a medical diagnosis task, a set of features describing a patient might include the patient’s social security number (SSN). When FOCUS searches for the minimum set of features, it will pick the SSN as the only feature needed to uniquely determine the label. Given only the SSN, any induction algorithm is expected to generalize very poorly. (Kohavi & John, 1997)
  • 31. Searching Attribute Space • The size of search space for n features is 2n, so it is impractical to search the whole space exhaustively in most situations • Single Feature Ranking – A relaxed version of feature selection that only requires the computation of the relative importance of the features and subsequently sorting them – Computationally cheap, but the combination of the top- ranked features may be a redundant subset • Feature Subset Ranking, such as – Greedy Algorithms – Genetic Algorithm (GA)
  • 32. WEKA Attribute Selection Function • Two ways to do attribute selection: – Normally done by searching the space of attribute subsets, evaluating each one (Feature Subset Ranking) • By combining 1 attribute subset evaluator and 1 search method – A potentially faster but less accurate approach is to evaluate the attributes individually and sort them, discarding attributes that fall below a chosen cutoff point (Single Feature Ranking) • By using 1 single-attribute evaluator and the ranking method
  • 33. Two Wrapper Methods in Weka • ClassifierSubsetEval – Use a classifier, specified in the object editor as a parameter, to evaluate sets of attributes on the training data or on a separate holdout set. • WrapperSubsetEval – Also use a classifier to evaluate attribute sets, but employ cross-validation to estimate the accuracy of the learning scheme for each set
  • 34. Attribute Subset Evaluators (Witten, Frank & Witten, 2011) This one will be used in Demo
  • 35. Search Methods (Witten, Frank & Witten, 2011) This one will be used in Demo
  • 37. Ranking Method (Witten, Frank & Witten, 2011)
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 49. Filtering Algorithms • There are two kinds of filter – Supervised : taking advantage of the class information. A class must be assigned. Default behavior uses the last attribute as class. – Unsupervised: A class is not taking into consideration here. • Both unsupervised and supervised filters have – Attribute filters, which work on the attributes in the datasets, and – Instance filters, which work on the instances
  • 50. Unsupervised Attribute Filters • Including operations of – Adding and Removing Attributes – Changing Values – Converting attributes from one form to another – Converting multi-instance data into single- instance format – Working with time series data – Randomizing
  • 51. (Witten, Frank & Witten, 2011) This one will be used in the Demo.
  • 52. (Witten, Frank & Witten, 2011)
  • 53. (Witten, Frank & Witten, 2011)
  • 54. Unsupervised Instance Filters (Witten, Frank & Witten, 2011) This one will be used in the Demo.
  • 55. Supervised Attribute and Instance Filters (Witten, Frank & Witten, 2011)
  • 57.
  • 58.
  • 59. Noted that the data type of the attribute “temperature ” is numeric.
  • 60. First, let’s filter the attributes.
  • 61.
  • 62.
  • 63. Set the “attributeIndices” to 2 (the “temperature” attribute) and the “bins” to 5 (which means to discretize the datasets to 5 bins)
  • 64.
  • 66. We can also filter the instances.
  • 67. Noted here that there are 3 instances that has label (-inf-68.2].
  • 68. Set the “attributeIndex” to 2 (the “temperature” attribute) and the “nominalIndices” to 1 (which means to remove all the instances with label (-inf-68.2].)
  • 69.
  • 70. All the instances labeled as (-inf-68.2] have been removed.
  • 71. Then when you do the classification, it will be based on the filtered datasets, as shown here.
  • 72. Resources • Weka official website: http://www.cs.waikato.ac.nz/ml/weka/ • Two Weka tutorials on YouTube: – https://www.youtube.com/user/WekaMOOC – https://www.youtube.com/user/rushdishams/videos • Book: Data Mining: Practical Machine Learning Tools and Techniques. Please refer to http://www.cs.waikato.ac.nz/ml/weka/book.html for more details.
  • 73. References • Frank, E., Machine Learning with WEKA. Retrieved April 05, 2014 from http://www.cs.waikato.ac.nz/ml/weka/documentation.html • Kohavi, R. & John, G.H. (1997), Wrappers for feature subset selection, Articial Intelligence 97, 315–333. • Reservoir sampling. Retrieved April 05, 2014, from http://en.wikipedia.org/wiki/Reservoir_sampling • Witten, I. H., Frank, E., Hall, M. (2011) Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann. • Xue, B., Zhang, M., & Browne, W. N. (2012). Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. Paper presented at the Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122, Melbourne, Australia.