• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Wek1
 

Wek1

on

  • 5,686 views

A short description to weka data mining tool

A short description to weka data mining tool

Statistics

Views

Total Views
5,686
Views on SlideShare
5,675
Embed Views
11

Actions

Likes
5
Downloads
294
Comments
0

3 Embeds 11

http://www.slideshare.net 8
http://115.112.206.131 2
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Wek1 Wek1 Presentation Transcript

  • MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/10/09 Machine learning with WEKA Machine Learning with WEKA
  • Agenda 06/10/09 Machine learning with WEKA
  • Introduction to WEKA
    • W aikato E nvironment for K nowledge A nalysis
    • Weka is a collection of machine learning algorithms for data mining tasks.
    • Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
    • Official Web Site: http://www.cs.waikato.ac.nz/ml/weka/
    06/10/09 Machine learning with WEKA
  • WEKA System Hierarchy April 10, 2006
  • Weka’s R ole in the B ig P icture 06/10/09 Machine learning with WEKA
    • Input
    • Raw data
    • Data Ming
    • by Weka
    • Pre-processing
    • Classification
    • Regression
    • Clustering
    • Association Rules
    • Visualization
    • Output
    • Result
  • KDD Process Machine learning with WEKA 06/10/09
  • WEKA: the software
    • Machine learning/data mining software written in Java (distributed under the GNU Public License)
    • Used for research, education, and applications
    • Complements “Data Mining” by Witten & Frank
    • Main features:
      • Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods
      • Graphical user interfaces (incl. data visualization)
      • Environment for comparing learning algorithms
    06/10/09 Machine learning with WEKA
  • History
    • Project funded by the NZ government since 1993
      • Develop state-of-the art workbench of data mining tools
      • Explore fielded applications
      • Develop new fundamental methods
    06/10/09 Machine learning with WEKA
  • History
    • July 1997 - W EKA 2.2
      • Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5
      • Included a facility (based on Unix makefiles) for configuring and running large scale experiments
    • Early 1997 - decision was made to rewrite W EKA in Java
      • Originated from code written by Eibe Frank for his PhD
      • Originally codenamed JAWS ( JA va W eka S ystem)
    • May 1998 - W EKA 2.3
      • Last release of the TCL/TK-based system
    • Mid 1999 - W EKA 3 (100% Java) released
      • Version to complement the Data Mining book
      • Development version (including GUI)
    06/10/09 Machine learning with WEKA
  • WEKA: versions
    • There are several versions of WEKA:
      • WEKA 3.4: “book version” compatible with description in data mining book
      • WEKA 3.5.5: “development version” with lots of improvements
    • This talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007)
    • With latest being WEKA 3.6 series
    06/10/09 Machine learning with WEKA
  • 06/10/09 Machine learning with WEKA java weka.gui.GUIChooser
  • Explorer - Preprocessing
    • Import from file s : ARFF, CSV, C4.5, binary
    • Import from URL or an SQL database (using JDBC)
    • Preprocessing filters
      • Adding/removing attributes
      • Attribute value substitution
      • Discretization (MDL, Kononenko, etc.)
      • Time series filters (delta, shift)
      • Sampling, randomization
      • Missing value management
      • Normalization and other numeric transformations
    Machine learning with WEKA 06/10/09
  • ARFF File Format
    • Require declarations of @RELATION , @ATTRIBUTE and @DATA
    • @RELATION declaration associates a name with the dataset
      • @RELATION <relation-name>
        • @RELATION iris
    • @ATTRIBUTE declaration specifies the name and type of an attribute
      • @attribute <attribute-name> <datatype>
      • Datatype can be numeric, nominal, string or date
        • @ATTRIBUTE sepallength NUMERIC
        • @ATTRIBUTE petalwidth NUMERIC
        • @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
    • @DATA declaration is a single line denoting the start of the data segment
      • Missing values are represented by ?
        • @DATA
        • 5.1, 3.5, 1.4, 0.2, Iris-setosa
        • 4.9, ?, 1.4, ?, Iris-versicolor
    06/10/09 Machine learning with WEKA
  • Explorer - Classification
    • P redicted attribute is categorical
    • Implemented methods
      • Naïve Bayes
      • decision trees and rules
      • neural networks
      • support vector machines
      • instance-based classifiers …
    • Evaluation
      • test set
      • crossvalidation ...
    Machine learning with WEKA 06/10/09
  • J48 = Decision Tree
    • petalwidth <= 0.6: Iris-setosa (50.0) : # under node
    • petalwidth > 0.6 # ..number wrong
    • | petalwidth <= 1.7
    • | | petallength <= 4.9: Iris-versicolor (48.0/1.0)
    • | | petallength > 4.9
    • | | | petalwidth <= 1.5: Iris-virginica (3.0)
    • | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
    • | petalwidth > 1.7: Iris-virginica (46.0/1.0)
    06/10/09 Machine learning with WEKA
  • Cross-validation
    • Correctly Classified Instances 143 95.3%
    • Incorrectly Classified Instances 7 4.67 %
    • Default 10-fold cross validation i.e.
      • Split data into 10 equal sized pieces
      • Train on 9 pieces and test on remainder
      • Do for all possibilities and average
    06/10/09 Machine learning with WEKA
  • J48 Confusion Matrix
    • Old data set from statistics: 50 of each class
    • a b c <-- classified as
    • 49 1 0 | a = Iris-setosa
    • 0 47 3 | b = Iris-versicolor
    • 0 3 47 | c = Iris-virginica
    06/10/09 Machine learning with WEKA
  • Precision, Recall, and Accuracy
    • Precision: probability of being correct given that your decision.
      • Precision of iris-setosa is 49/49 = 100%
      • Specificity in medical literature
    • Recall: probability of correctly identifying class.
      • Recall accuracy for iris-setosa is 49/50 = 98%
      • Sensitity in medical literature
    • Accuracy: # right/total = 143/150 =~95%
    06/10/09 Machine learning with WEKA
  • Explorer - Clustering
    • Implemented methods
      • k -Means
      • EM
      • Cobweb
      • X-means
      • FarthestFirst …
    • Clusters can be visualized and compared to “true” clusters (if given)
    • Evaluation based on loglikelihood if clustering scheme produces a probability distribution
    Machine learning with WEKA 06/10/09
  • Explorer - Associations
    • WEKA contains the Apriori algorithm (among others) for learning association rules
      • Works only with discrete data
    • Can identify statistical dependencies between groups of attributes:
      • milk, butter  bread, eggs (with confidence 0.9 and support 2000)
    • Apriori can compute all rules that have a given minimum support and exceed a given confidence
    06/10/09 Machine learning with WEKA
    • CONCEPT HIERARCY
    Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 1
  • CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 2
  • CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 3
  • Sample Execution (1)
    • java weka.associations.Apriori -t data/weather.nominal.arff -I yes
    • Apriori
    • =======
    • Minimum support: 0.2
    • Minimum confidence: 0.9
    • Number of cycles performed: 17
    • Generated sets of large itemsets:
    • Size of set of large itemsets L(1): 12
    06/10/09 Machine learning with WEKA
  • Sample Execution (2)
    • Best rules found:
    • 1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1)
    • 2. temperature=cool 4 ==> humidity=normal 4 (1)
    • 3. outlook=overcast 4 ==> play=yes 4 (1)
    • 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1)
    • 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1)
    • 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1)
    • 7. outlook=sunny humidity=high 3 ==> play=no 3 (1)
    • 8. outlook=sunny play=no 3 ==> humidity=high 3 (1)
    06/10/09 Machine learning with WEKA
  • Regression
    • P redicted attribute is continuous
    • Implemented methods
      • (linear regression)
      • neural networks
      • regression trees …
    Machine learning with WEKA 06/10/09
  • Explorer - Attribute Selection
    • Very flexible: arbitrary combination of search and evaluation methods
    • Both filtering and wrapping methods
    • Search methods
      • b est-first
      • g enetic
      • r anking ...
    • Evaluation m easures
      • ReliefF
      • information gain
      • g ain rati o …
    Machine learning with WEKA 06/10/09
  • Explorer - Data Visualization
    • Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem
    • WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)
      • To do: rotating 3-d visualizations (Xgobi-style)
    • Color-coded class values
    • “ Jitter” option to deal with nominal attributes (and to detect “hidden” data points)
    • “ Zoom-in” function
    06/10/09 Machine learning with WEKA
  • Performing experiments
    • Experimenter makes it easy to compare the performance of different learning schemes
    • For classification and regression problems
    • Results can be written into file or database
    • Evaluation options: cross-validation, learning curve, hold-out
    • Can also iterate over different parameter settings
    • Significance-testing built in!
    06/10/09 Machine learning with WEKA
  • The Knowledge Flow GUI
    • Java-Beans-based interface for setting up and running machine learning experiments
    • Data sources, classifiers, etc. are beans and can be connected graphically
    • Data “flows” through components: e.g.,
    • “ data source” -> “filter” -> “classifier” -> “evaluator”
    • Layouts can be saved and loaded again later
    • cf. Clementine ™
    06/10/09 Machine learning with WEKA
  • Projects based on W EKA
    • 45 projects currently (30/01/07) listed on the WekaWiki
    • Incorporate/wrap W EKA
      • GRB Tool Shed - a tool to aid gamma ray burst research
      • YALE - facility for large scale ML experiments
      • GATE - NLP workbench with a W EKA interface
      • Judge - document clustering and classification
      • RWeka - an R interface to Weka
    • Extend/modify W EKA
      • BioWeka - extension library for knowledge discovery in biology
      • WekaMetal - meta learning extension to W EKA
      • Weka-Parallel - parallel processing for W EKA
      • Grid Weka - grid computing using W EKA
      • Weka-CG - computational genetics tool library
    06/10/09 Machine learning with WEKA
  • W EKA and P ENTAHO
    • Pentaho – The leader in Open Source Business Intelligence (BI)
    • September 2006 – Pentaho acquires the Weka project (exclusive license and SF.net page)
    • Weka will be used/integrated as data mining component in their BI suite
    • Weka will be still available as GPL open source software
    • Most likely to evolve 2 editions:
      • Community edition
      • BI oriented edition
    06/10/09 Machine learning with WEKA
  • Limitations of W EKA
    • Traditional algorithms need to have all data in main memory
    • ==> big datasets are an issue
    • Solution:
      • Incremental schemes
      • Stream algorithms
      • MOA “ M assive O nline A nalysis”
      • (not only a flightless bird, but also extinct !)
    06/10/09 Machine learning with WEKA
  • Summary
    • Introduction to WEKA
    • WEKA System Hierarchy
    • WEKA features
    • Brief History
    • Explorer
    • Experimenter
    • CLI
    • Knowledge Flow
    • Project Based on WEKA
    • Limitations of WEKA
    06/10/09 Machine learning with WEKA
  • References
    • Ian H. Witten and Eibe Frank (2005) &quot;Data Mining: Practical machine learning tools and techniques&quot;, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
    • http://www.itl.nist.gov/div898/handbook/index.htm
    26/Sep/2006 S.P.Vimal, CS IS Group, BITS-Pilani