MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/10/09 Machine learning with WEKA Machine Learning with WEKA
Agenda 06/10/09 Machine learning with WEKA
Introduction to WEKA W aikato  E nvironment for  K nowledge  A nalysis  Weka is a collection of machine learning algorithms for data mining tasks.  Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.   Official Web Site:  http://www.cs.waikato.ac.nz/ml/weka/ 06/10/09 Machine learning with WEKA
WEKA System Hierarchy April 10, 2006
Weka’s  R ole in the  B ig  P icture 06/10/09 Machine learning with WEKA Input Raw data Data Ming by  Weka Pre-processing  Classification Regression  Clustering  Association Rules  Visualization Output Result
KDD Process Machine learning with WEKA 06/10/09
WEKA: the software Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 06/10/09 Machine learning with WEKA
History Project funded by the NZ government since 1993   Develop state-of-the art workbench of data mining tools Explore fielded applications Develop new fundamental methods 06/10/09 Machine learning with WEKA
History July 1997 - W EKA  2.2 Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5 Included a facility (based on Unix makefiles) for configuring and running large scale experiments Early 1997 - decision was made to rewrite W EKA  in Java Originated from code written by Eibe Frank for his PhD Originally codenamed  JAWS  ( JA va  W eka  S ystem) May 1998 - W EKA  2.3 Last release of the TCL/TK-based system Mid 1999 - W EKA  3 (100% Java) released Version to complement the Data Mining book Development version (including GUI) 06/10/09 Machine learning with WEKA
WEKA: versions There are several versions of WEKA: WEKA 3.4: “book version” compatible with description in data mining book WEKA 3.5.5: “development version” with lots of improvements This talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007) With latest being WEKA 3.6 series 06/10/09 Machine learning with WEKA
06/10/09 Machine learning with WEKA java weka.gui.GUIChooser
Explorer -  Preprocessing Import   from file s : ARFF, CSV, C4.5, binary Import from  URL or an SQL database (using JDBC) Preprocessing filters Adding/removing attributes Attribute value substitution  Discretization (MDL, Kononenko, etc.) Time series filters (delta, shift) Sampling, randomization Missing value management Normalization and other numeric transformations Machine learning with WEKA 06/10/09
ARFF File Format Require declarations of  @RELATION ,  @ATTRIBUTE  and  @DATA   @RELATION  declaration associates a name with the dataset @RELATION <relation-name> @RELATION iris  @ATTRIBUTE  declaration specifies the name and type of an attribute @attribute <attribute-name> <datatype>  Datatype can be numeric, nominal, string or date @ATTRIBUTE sepallength NUMERIC  @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}  @DATA  declaration is a single line denoting the start of the data segment Missing values are represented by ? @DATA  5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, ?, 1.4, ?, Iris-versicolor  06/10/09 Machine learning with WEKA
Explorer - Classification P redicted attribute is categorical Implemented methods Naïve Bayes decision trees and rules neural networks support vector machines instance-based classifiers  … Evaluation test set crossvalidation ... Machine learning with WEKA 06/10/09
J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6  # ..number wrong |  petalwidth <= 1.7 |  |  petallength <= 4.9: Iris-versicolor (48.0/1.0) |  |  petallength > 4.9 |  |  |  petalwidth <= 1.5: Iris-virginica (3.0) |  |  |  petalwidth > 1.5: Iris-versicolor (3.0/1.0) |  petalwidth > 1.7: Iris-virginica (46.0/1.0) 06/10/09 Machine learning with WEKA
Cross-validation Correctly Classified Instances  143  95.3% Incorrectly Classified Instances  7  4.67 % Default 10-fold cross validation i.e. Split data into 10 equal sized pieces Train on 9 pieces and test on remainder Do for all possibilities and average 06/10/09 Machine learning with WEKA
J48 Confusion Matrix Old data set from statistics: 50 of each class a  b  c  <-- classified as 49  1  0 |  a = Iris-setosa 0 47  3 |  b = Iris-versicolor 0  3 47 |  c = Iris-virginica 06/10/09 Machine learning with WEKA
Precision, Recall, and Accuracy Precision: probability of being correct given that your decision. Precision of iris-setosa is  49/49 = 100% Specificity in medical literature Recall: probability of correctly identifying class. Recall accuracy for iris-setosa is 49/50 = 98% Sensitity in medical literature Accuracy: # right/total = 143/150 =~95% 06/10/09 Machine learning with WEKA
Explorer -  Clustering Implemented  methods k -Means EM Cobweb X-means FarthestFirst … Clusters can be visualized and compared to “true” clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution Machine learning with WEKA 06/10/09
Explorer - Associations WEKA contains the Apriori algorithm (among others) for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter    bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence 06/10/09 Machine learning with WEKA
CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 1
CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 2
CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 3
Sample Execution (1) java weka.associations.Apriori -t data/weather.nominal.arff -I yes  Apriori ======= Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 06/10/09 Machine learning with WEKA
Sample Execution (2) Best rules found: 1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) 2. temperature=cool 4 ==> humidity=normal 4 (1) 3. outlook=overcast 4 ==> play=yes 4 (1) 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1) 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1) 7. outlook=sunny humidity=high 3 ==> play=no 3 (1) 8. outlook=sunny play=no 3 ==> humidity=high 3 (1) 06/10/09 Machine learning with WEKA
Regression P redicted attribute is continuous Implemented methods (linear regression) neural networks regression trees … Machine learning with WEKA 06/10/09
Explorer -  Attribute Selection Very flexible: arbitrary combination of search and evaluation methods Both filtering and wrapping methods Search methods b est-first g enetic r anking ... Evaluation m easures ReliefF information gain g ain rati o  … Machine learning with WEKA 06/10/09
Explorer - Data Visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values “ Jitter” option to deal with nominal attributes (and to detect “hidden” data points) “ Zoom-in” function 06/10/09 Machine learning with WEKA
Performing experiments Experimenter makes it easy to compare the performance of different learning schemes For classification and regression problems Results can be written into file or database Evaluation options: cross-validation, learning curve, hold-out Can also iterate over different parameter settings Significance-testing built in! 06/10/09 Machine learning with WEKA
The Knowledge Flow GUI Java-Beans-based interface for setting up and running machine learning experiments Data sources, classifiers, etc. are beans and can be connected graphically Data “flows” through components: e.g., “ data source” -> “filter” -> “classifier” -> “evaluator” Layouts can be saved and loaded again later cf. Clementine ™ 06/10/09 Machine learning with WEKA
Projects based on W EKA 45 projects currently (30/01/07) listed on the  WekaWiki Incorporate/wrap W EKA GRB Tool Shed - a tool to aid gamma ray burst research YALE - facility for large scale ML experiments GATE - NLP workbench with a W EKA  interface Judge - document clustering and classification RWeka - an R interface to Weka Extend/modify W EKA BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to W EKA Weka-Parallel - parallel processing for W EKA Grid Weka - grid computing using W EKA Weka-CG - computational genetics tool library 06/10/09 Machine learning with WEKA
W EKA  and P ENTAHO Pentaho  –  The leader in Open Source Business Intelligence (BI) September 2006 – Pentaho  acquires  the Weka project (exclusive license and SF.net page) Weka will be used/integrated as data mining component in their BI suite Weka will be still available as GPL open source software Most likely to evolve 2 editions: Community edition BI oriented edition 06/10/09 Machine learning with WEKA
Limitations of W EKA Traditional algorithms need to have all data in main memory ==> big datasets are an issue Solution: Incremental schemes Stream algorithms MOA “ M assive  O nline  A nalysis” (not only a  flightless  bird, but also  extinct !) 06/10/09 Machine learning with WEKA
Summary Introduction to WEKA WEKA System Hierarchy WEKA features Brief History Explorer Experimenter CLI Knowledge Flow Project Based on WEKA Limitations of WEKA 06/10/09 Machine learning with WEKA
References Ian H. Witten and Eibe Frank (2005) &quot;Data Mining: Practical machine learning tools and techniques&quot;, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.  http://www.itl.nist.gov/div898/handbook/index.htm 26/Sep/2006  S.P.Vimal, CS IS Group, BITS-Pilani

Wek1

  • 1.
    MaxQDPro Team Anjan.KHarish.R II Sem M.Tech CSE 06/10/09 Machine learning with WEKA Machine Learning with WEKA
  • 2.
    Agenda 06/10/09 Machinelearning with WEKA
  • 3.
    Introduction to WEKAW aikato E nvironment for K nowledge A nalysis Weka is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Official Web Site: http://www.cs.waikato.ac.nz/ml/weka/ 06/10/09 Machine learning with WEKA
  • 4.
    WEKA System HierarchyApril 10, 2006
  • 5.
    Weka’s Role in the B ig P icture 06/10/09 Machine learning with WEKA Input Raw data Data Ming by Weka Pre-processing Classification Regression Clustering Association Rules Visualization Output Result
  • 6.
    KDD Process Machinelearning with WEKA 06/10/09
  • 7.
    WEKA: the softwareMachine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 06/10/09 Machine learning with WEKA
  • 8.
    History Project fundedby the NZ government since 1993 Develop state-of-the art workbench of data mining tools Explore fielded applications Develop new fundamental methods 06/10/09 Machine learning with WEKA
  • 9.
    History July 1997- W EKA 2.2 Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5 Included a facility (based on Unix makefiles) for configuring and running large scale experiments Early 1997 - decision was made to rewrite W EKA in Java Originated from code written by Eibe Frank for his PhD Originally codenamed JAWS ( JA va W eka S ystem) May 1998 - W EKA 2.3 Last release of the TCL/TK-based system Mid 1999 - W EKA 3 (100% Java) released Version to complement the Data Mining book Development version (including GUI) 06/10/09 Machine learning with WEKA
  • 10.
    WEKA: versions Thereare several versions of WEKA: WEKA 3.4: “book version” compatible with description in data mining book WEKA 3.5.5: “development version” with lots of improvements This talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007) With latest being WEKA 3.6 series 06/10/09 Machine learning with WEKA
  • 11.
    06/10/09 Machine learningwith WEKA java weka.gui.GUIChooser
  • 12.
    Explorer - Preprocessing Import from file s : ARFF, CSV, C4.5, binary Import from URL or an SQL database (using JDBC) Preprocessing filters Adding/removing attributes Attribute value substitution Discretization (MDL, Kononenko, etc.) Time series filters (delta, shift) Sampling, randomization Missing value management Normalization and other numeric transformations Machine learning with WEKA 06/10/09
  • 13.
    ARFF File FormatRequire declarations of @RELATION , @ATTRIBUTE and @DATA @RELATION declaration associates a name with the dataset @RELATION <relation-name> @RELATION iris @ATTRIBUTE declaration specifies the name and type of an attribute @attribute <attribute-name> <datatype> Datatype can be numeric, nominal, string or date @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA declaration is a single line denoting the start of the data segment Missing values are represented by ? @DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, ?, 1.4, ?, Iris-versicolor 06/10/09 Machine learning with WEKA
  • 14.
    Explorer - ClassificationP redicted attribute is categorical Implemented methods Naïve Bayes decision trees and rules neural networks support vector machines instance-based classifiers … Evaluation test set crossvalidation ... Machine learning with WEKA 06/10/09
  • 15.
    J48 = DecisionTree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) 06/10/09 Machine learning with WEKA
  • 16.
    Cross-validation Correctly ClassifiedInstances 143 95.3% Incorrectly Classified Instances 7 4.67 % Default 10-fold cross validation i.e. Split data into 10 equal sized pieces Train on 9 pieces and test on remainder Do for all possibilities and average 06/10/09 Machine learning with WEKA
  • 17.
    J48 Confusion MatrixOld data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica 06/10/09 Machine learning with WEKA
  • 18.
    Precision, Recall, andAccuracy Precision: probability of being correct given that your decision. Precision of iris-setosa is 49/49 = 100% Specificity in medical literature Recall: probability of correctly identifying class. Recall accuracy for iris-setosa is 49/50 = 98% Sensitity in medical literature Accuracy: # right/total = 143/150 =~95% 06/10/09 Machine learning with WEKA
  • 19.
    Explorer - Clustering Implemented methods k -Means EM Cobweb X-means FarthestFirst … Clusters can be visualized and compared to “true” clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution Machine learning with WEKA 06/10/09
  • 20.
    Explorer - AssociationsWEKA contains the Apriori algorithm (among others) for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter  bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence 06/10/09 Machine learning with WEKA
  • 21.
    CONCEPT HIERARCY FoodMilk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 1
  • 22.
    CONCEPT HIERARCY FoodMilk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 2
  • 23.
    CONCEPT HIERARCY FoodMilk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 3
  • 24.
    Sample Execution (1)java weka.associations.Apriori -t data/weather.nominal.arff -I yes Apriori ======= Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 06/10/09 Machine learning with WEKA
  • 25.
    Sample Execution (2)Best rules found: 1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) 2. temperature=cool 4 ==> humidity=normal 4 (1) 3. outlook=overcast 4 ==> play=yes 4 (1) 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1) 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1) 7. outlook=sunny humidity=high 3 ==> play=no 3 (1) 8. outlook=sunny play=no 3 ==> humidity=high 3 (1) 06/10/09 Machine learning with WEKA
  • 26.
    Regression P redictedattribute is continuous Implemented methods (linear regression) neural networks regression trees … Machine learning with WEKA 06/10/09
  • 27.
    Explorer - Attribute Selection Very flexible: arbitrary combination of search and evaluation methods Both filtering and wrapping methods Search methods b est-first g enetic r anking ... Evaluation m easures ReliefF information gain g ain rati o … Machine learning with WEKA 06/10/09
  • 28.
    Explorer - DataVisualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values “ Jitter” option to deal with nominal attributes (and to detect “hidden” data points) “ Zoom-in” function 06/10/09 Machine learning with WEKA
  • 29.
    Performing experiments Experimentermakes it easy to compare the performance of different learning schemes For classification and regression problems Results can be written into file or database Evaluation options: cross-validation, learning curve, hold-out Can also iterate over different parameter settings Significance-testing built in! 06/10/09 Machine learning with WEKA
  • 30.
    The Knowledge FlowGUI Java-Beans-based interface for setting up and running machine learning experiments Data sources, classifiers, etc. are beans and can be connected graphically Data “flows” through components: e.g., “ data source” -> “filter” -> “classifier” -> “evaluator” Layouts can be saved and loaded again later cf. Clementine ™ 06/10/09 Machine learning with WEKA
  • 31.
    Projects based onW EKA 45 projects currently (30/01/07) listed on the WekaWiki Incorporate/wrap W EKA GRB Tool Shed - a tool to aid gamma ray burst research YALE - facility for large scale ML experiments GATE - NLP workbench with a W EKA interface Judge - document clustering and classification RWeka - an R interface to Weka Extend/modify W EKA BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to W EKA Weka-Parallel - parallel processing for W EKA Grid Weka - grid computing using W EKA Weka-CG - computational genetics tool library 06/10/09 Machine learning with WEKA
  • 32.
    W EKA and P ENTAHO Pentaho – The leader in Open Source Business Intelligence (BI) September 2006 – Pentaho acquires the Weka project (exclusive license and SF.net page) Weka will be used/integrated as data mining component in their BI suite Weka will be still available as GPL open source software Most likely to evolve 2 editions: Community edition BI oriented edition 06/10/09 Machine learning with WEKA
  • 33.
    Limitations of WEKA Traditional algorithms need to have all data in main memory ==> big datasets are an issue Solution: Incremental schemes Stream algorithms MOA “ M assive O nline A nalysis” (not only a flightless bird, but also extinct !) 06/10/09 Machine learning with WEKA
  • 34.
    Summary Introduction toWEKA WEKA System Hierarchy WEKA features Brief History Explorer Experimenter CLI Knowledge Flow Project Based on WEKA Limitations of WEKA 06/10/09 Machine learning with WEKA
  • 35.
    References Ian H.Witten and Eibe Frank (2005) &quot;Data Mining: Practical machine learning tools and techniques&quot;, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. http://www.itl.nist.gov/div898/handbook/index.htm 26/Sep/2006 S.P.Vimal, CS IS Group, BITS-Pilani