MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/10/09 Machine learning with WEKA Machine Learning with WEKA
Agenda 06/10/09 Machine learning with WEKA
Introduction to WEKA <ul><li>W aikato  E nvironment for  K nowledge  A nalysis  </li></ul><ul><li>Weka is a collection of ...
WEKA System Hierarchy April 10, 2006
Weka’s  R ole in the  B ig  P icture 06/10/09 Machine learning with WEKA <ul><li>Input </li></ul><ul><li>Raw data </li></u...
KDD Process Machine learning with WEKA 06/10/09
WEKA: the software <ul><li>Machine learning/data mining software written in Java (distributed under the GNU Public License...
History <ul><li>Project funded by the NZ government since 1993   </li></ul><ul><ul><li>Develop state-of-the art workbench ...
History <ul><li>July 1997 - W EKA  2.2 </li></ul><ul><ul><li>Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support...
WEKA: versions <ul><li>There are several versions of WEKA: </li></ul><ul><ul><li>WEKA 3.4: “book version” compatible with ...
06/10/09 Machine learning with WEKA java weka.gui.GUIChooser
Explorer -  Preprocessing <ul><li>Import   from file s : ARFF, CSV, C4.5, binary </li></ul><ul><li>Import from  URL or an ...
ARFF File Format <ul><li>Require declarations of  @RELATION ,  @ATTRIBUTE  and  @DATA   </li></ul><ul><li>@RELATION  decla...
Explorer - Classification <ul><li>P redicted attribute is categorical </li></ul><ul><li>Implemented methods </li></ul><ul>...
J48 = Decision Tree <ul><li>petalwidth <= 0.6: Iris-setosa (50.0) : # under node </li></ul><ul><li>petalwidth > 0.6  # ..n...
Cross-validation <ul><li>Correctly Classified Instances  143  95.3% </li></ul><ul><li>Incorrectly Classified Instances  7 ...
J48 Confusion Matrix <ul><li>Old data set from statistics: 50 of each class </li></ul><ul><li>a  b  c  <-- classified as <...
Precision, Recall, and Accuracy <ul><li>Precision: probability of being correct given that your decision. </li></ul><ul><u...
Explorer -  Clustering <ul><li>Implemented  methods </li></ul><ul><ul><li>k -Means </li></ul></ul><ul><ul><li>EM </li></ul...
Explorer - Associations <ul><li>WEKA contains the Apriori algorithm (among others) for learning association rules </li></u...
<ul><li>CONCEPT HIERARCY </li></ul>Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Org...
CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Leve...
CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Leve...
Sample Execution (1) <ul><li>java weka.associations.Apriori -t data/weather.nominal.arff -I yes  </li></ul><ul><li>Apriori...
Sample Execution (2) <ul><li>Best rules found: </li></ul><ul><li>1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) </li>...
Regression <ul><li>P redicted attribute is continuous </li></ul><ul><li>Implemented methods </li></ul><ul><ul><li>(linear ...
Explorer -  Attribute Selection <ul><li>Very flexible: arbitrary combination of search and evaluation methods </li></ul><u...
Explorer - Data Visualization <ul><li>Visualization very useful in practice: e.g. helps to determine difficulty of the lea...
Performing experiments <ul><li>Experimenter makes it easy to compare the performance of different learning schemes </li></...
The Knowledge Flow GUI <ul><li>Java-Beans-based interface for setting up and running machine learning experiments </li></u...
Projects based on W EKA <ul><li>45 projects currently (30/01/07) listed on the  WekaWiki </li></ul><ul><li>Incorporate/wra...
W EKA  and P ENTAHO <ul><li>Pentaho  –  The leader in Open Source Business Intelligence (BI) </li></ul><ul><li>September 2...
Limitations of W EKA <ul><li>Traditional algorithms need to have all data in main memory </li></ul><ul><li>==> big dataset...
Summary <ul><li>Introduction to WEKA </li></ul><ul><li>WEKA System Hierarchy </li></ul><ul><li>WEKA features </li></ul><ul...
References <ul><li>Ian H. Witten and Eibe Frank (2005) &quot;Data Mining: Practical machine learning tools and techniques&...
Upcoming SlideShare
Loading in...5
×

Wek1

5,487

Published on

A short description to weka data mining tool

Published in: Technology, Education
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,487
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
325
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Wek1

    1. 1. MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/10/09 Machine learning with WEKA Machine Learning with WEKA
    2. 2. Agenda 06/10/09 Machine learning with WEKA
    3. 3. Introduction to WEKA <ul><li>W aikato E nvironment for K nowledge A nalysis </li></ul><ul><li>Weka is a collection of machine learning algorithms for data mining tasks. </li></ul><ul><li>Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. </li></ul><ul><li>Official Web Site: http://www.cs.waikato.ac.nz/ml/weka/ </li></ul>06/10/09 Machine learning with WEKA
    4. 4. WEKA System Hierarchy April 10, 2006
    5. 5. Weka’s R ole in the B ig P icture 06/10/09 Machine learning with WEKA <ul><li>Input </li></ul><ul><li>Raw data </li></ul><ul><li>Data Ming </li></ul><ul><li>by Weka </li></ul><ul><li>Pre-processing </li></ul><ul><li>Classification </li></ul><ul><li>Regression </li></ul><ul><li>Clustering </li></ul><ul><li>Association Rules </li></ul><ul><li>Visualization </li></ul><ul><li>Output </li></ul><ul><li>Result </li></ul>
    6. 6. KDD Process Machine learning with WEKA 06/10/09
    7. 7. WEKA: the software <ul><li>Machine learning/data mining software written in Java (distributed under the GNU Public License) </li></ul><ul><li>Used for research, education, and applications </li></ul><ul><li>Complements “Data Mining” by Witten & Frank </li></ul><ul><li>Main features: </li></ul><ul><ul><li>Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods </li></ul></ul><ul><ul><li>Graphical user interfaces (incl. data visualization) </li></ul></ul><ul><ul><li>Environment for comparing learning algorithms </li></ul></ul>06/10/09 Machine learning with WEKA
    8. 8. History <ul><li>Project funded by the NZ government since 1993 </li></ul><ul><ul><li>Develop state-of-the art workbench of data mining tools </li></ul></ul><ul><ul><li>Explore fielded applications </li></ul></ul><ul><ul><li>Develop new fundamental methods </li></ul></ul>06/10/09 Machine learning with WEKA
    9. 9. History <ul><li>July 1997 - W EKA 2.2 </li></ul><ul><ul><li>Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5 </li></ul></ul><ul><ul><li>Included a facility (based on Unix makefiles) for configuring and running large scale experiments </li></ul></ul><ul><li>Early 1997 - decision was made to rewrite W EKA in Java </li></ul><ul><ul><li>Originated from code written by Eibe Frank for his PhD </li></ul></ul><ul><ul><li>Originally codenamed JAWS ( JA va W eka S ystem) </li></ul></ul><ul><li>May 1998 - W EKA 2.3 </li></ul><ul><ul><li>Last release of the TCL/TK-based system </li></ul></ul><ul><li>Mid 1999 - W EKA 3 (100% Java) released </li></ul><ul><ul><li>Version to complement the Data Mining book </li></ul></ul><ul><ul><li>Development version (including GUI) </li></ul></ul>06/10/09 Machine learning with WEKA
    10. 10. WEKA: versions <ul><li>There are several versions of WEKA: </li></ul><ul><ul><li>WEKA 3.4: “book version” compatible with description in data mining book </li></ul></ul><ul><ul><li>WEKA 3.5.5: “development version” with lots of improvements </li></ul></ul><ul><li>This talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007) </li></ul><ul><li>With latest being WEKA 3.6 series </li></ul>06/10/09 Machine learning with WEKA
    11. 11. 06/10/09 Machine learning with WEKA java weka.gui.GUIChooser
    12. 12. Explorer - Preprocessing <ul><li>Import from file s : ARFF, CSV, C4.5, binary </li></ul><ul><li>Import from URL or an SQL database (using JDBC) </li></ul><ul><li>Preprocessing filters </li></ul><ul><ul><li>Adding/removing attributes </li></ul></ul><ul><ul><li>Attribute value substitution </li></ul></ul><ul><ul><li>Discretization (MDL, Kononenko, etc.) </li></ul></ul><ul><ul><li>Time series filters (delta, shift) </li></ul></ul><ul><ul><li>Sampling, randomization </li></ul></ul><ul><ul><li>Missing value management </li></ul></ul><ul><ul><li>Normalization and other numeric transformations </li></ul></ul>Machine learning with WEKA 06/10/09
    13. 13. ARFF File Format <ul><li>Require declarations of @RELATION , @ATTRIBUTE and @DATA </li></ul><ul><li>@RELATION declaration associates a name with the dataset </li></ul><ul><ul><li>@RELATION <relation-name> </li></ul></ul><ul><ul><ul><li>@RELATION iris </li></ul></ul></ul><ul><li>@ATTRIBUTE declaration specifies the name and type of an attribute </li></ul><ul><ul><li>@attribute <attribute-name> <datatype> </li></ul></ul><ul><ul><li>Datatype can be numeric, nominal, string or date </li></ul></ul><ul><ul><ul><li>@ATTRIBUTE sepallength NUMERIC </li></ul></ul></ul><ul><ul><ul><li>@ATTRIBUTE petalwidth NUMERIC </li></ul></ul></ul><ul><ul><ul><li>@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} </li></ul></ul></ul><ul><li>@DATA declaration is a single line denoting the start of the data segment </li></ul><ul><ul><li>Missing values are represented by ? </li></ul></ul><ul><ul><ul><li>@DATA </li></ul></ul></ul><ul><ul><ul><li>5.1, 3.5, 1.4, 0.2, Iris-setosa </li></ul></ul></ul><ul><ul><ul><li>4.9, ?, 1.4, ?, Iris-versicolor </li></ul></ul></ul>06/10/09 Machine learning with WEKA
    14. 14. Explorer - Classification <ul><li>P redicted attribute is categorical </li></ul><ul><li>Implemented methods </li></ul><ul><ul><li>Naïve Bayes </li></ul></ul><ul><ul><li>decision trees and rules </li></ul></ul><ul><ul><li>neural networks </li></ul></ul><ul><ul><li>support vector machines </li></ul></ul><ul><ul><li>instance-based classifiers … </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>test set </li></ul></ul><ul><ul><li>crossvalidation ... </li></ul></ul>Machine learning with WEKA 06/10/09
    15. 15. J48 = Decision Tree <ul><li>petalwidth <= 0.6: Iris-setosa (50.0) : # under node </li></ul><ul><li>petalwidth > 0.6 # ..number wrong </li></ul><ul><li>| petalwidth <= 1.7 </li></ul><ul><li>| | petallength <= 4.9: Iris-versicolor (48.0/1.0) </li></ul><ul><li>| | petallength > 4.9 </li></ul><ul><li>| | | petalwidth <= 1.5: Iris-virginica (3.0) </li></ul><ul><li>| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) </li></ul><ul><li>| petalwidth > 1.7: Iris-virginica (46.0/1.0) </li></ul>06/10/09 Machine learning with WEKA
    16. 16. Cross-validation <ul><li>Correctly Classified Instances 143 95.3% </li></ul><ul><li>Incorrectly Classified Instances 7 4.67 % </li></ul><ul><li>Default 10-fold cross validation i.e. </li></ul><ul><ul><li>Split data into 10 equal sized pieces </li></ul></ul><ul><ul><li>Train on 9 pieces and test on remainder </li></ul></ul><ul><ul><li>Do for all possibilities and average </li></ul></ul>06/10/09 Machine learning with WEKA
    17. 17. J48 Confusion Matrix <ul><li>Old data set from statistics: 50 of each class </li></ul><ul><li>a b c <-- classified as </li></ul><ul><li>49 1 0 | a = Iris-setosa </li></ul><ul><li>0 47 3 | b = Iris-versicolor </li></ul><ul><li>0 3 47 | c = Iris-virginica </li></ul>06/10/09 Machine learning with WEKA
    18. 18. Precision, Recall, and Accuracy <ul><li>Precision: probability of being correct given that your decision. </li></ul><ul><ul><li>Precision of iris-setosa is 49/49 = 100% </li></ul></ul><ul><ul><li>Specificity in medical literature </li></ul></ul><ul><li>Recall: probability of correctly identifying class. </li></ul><ul><ul><li>Recall accuracy for iris-setosa is 49/50 = 98% </li></ul></ul><ul><ul><li>Sensitity in medical literature </li></ul></ul><ul><li>Accuracy: # right/total = 143/150 =~95% </li></ul>06/10/09 Machine learning with WEKA
    19. 19. Explorer - Clustering <ul><li>Implemented methods </li></ul><ul><ul><li>k -Means </li></ul></ul><ul><ul><li>EM </li></ul></ul><ul><ul><li>Cobweb </li></ul></ul><ul><ul><li>X-means </li></ul></ul><ul><ul><li>FarthestFirst … </li></ul></ul><ul><li>Clusters can be visualized and compared to “true” clusters (if given) </li></ul><ul><li>Evaluation based on loglikelihood if clustering scheme produces a probability distribution </li></ul>Machine learning with WEKA 06/10/09
    20. 20. Explorer - Associations <ul><li>WEKA contains the Apriori algorithm (among others) for learning association rules </li></ul><ul><ul><li>Works only with discrete data </li></ul></ul><ul><li>Can identify statistical dependencies between groups of attributes: </li></ul><ul><ul><li>milk, butter  bread, eggs (with confidence 0.9 and support 2000) </li></ul></ul><ul><li>Apriori can compute all rules that have a given minimum support and exceed a given confidence </li></ul>06/10/09 Machine learning with WEKA
    21. 21. <ul><li>CONCEPT HIERARCY </li></ul>Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 1
    22. 22. CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 2
    23. 23. CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 3
    24. 24. Sample Execution (1) <ul><li>java weka.associations.Apriori -t data/weather.nominal.arff -I yes </li></ul><ul><li>Apriori </li></ul><ul><li>======= </li></ul><ul><li>Minimum support: 0.2 </li></ul><ul><li>Minimum confidence: 0.9 </li></ul><ul><li>Number of cycles performed: 17 </li></ul><ul><li>Generated sets of large itemsets: </li></ul><ul><li>Size of set of large itemsets L(1): 12 </li></ul>06/10/09 Machine learning with WEKA
    25. 25. Sample Execution (2) <ul><li>Best rules found: </li></ul><ul><li>1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) </li></ul><ul><li>2. temperature=cool 4 ==> humidity=normal 4 (1) </li></ul><ul><li>3. outlook=overcast 4 ==> play=yes 4 (1) </li></ul><ul><li>4. temperature=cool play=yes 3 ==> humidity=normal 3 (1) </li></ul><ul><li>5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1) </li></ul><ul><li>6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1) </li></ul><ul><li>7. outlook=sunny humidity=high 3 ==> play=no 3 (1) </li></ul><ul><li>8. outlook=sunny play=no 3 ==> humidity=high 3 (1) </li></ul>06/10/09 Machine learning with WEKA
    26. 26. Regression <ul><li>P redicted attribute is continuous </li></ul><ul><li>Implemented methods </li></ul><ul><ul><li>(linear regression) </li></ul></ul><ul><ul><li>neural networks </li></ul></ul><ul><ul><li>regression trees … </li></ul></ul>Machine learning with WEKA 06/10/09
    27. 27. Explorer - Attribute Selection <ul><li>Very flexible: arbitrary combination of search and evaluation methods </li></ul><ul><li>Both filtering and wrapping methods </li></ul><ul><li>Search methods </li></ul><ul><ul><li>b est-first </li></ul></ul><ul><ul><li>g enetic </li></ul></ul><ul><ul><li>r anking ... </li></ul></ul><ul><li>Evaluation m easures </li></ul><ul><ul><li>ReliefF </li></ul></ul><ul><ul><li>information gain </li></ul></ul><ul><ul><li>g ain rati o … </li></ul></ul>Machine learning with WEKA 06/10/09
    28. 28. Explorer - Data Visualization <ul><li>Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem </li></ul><ul><li>WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) </li></ul><ul><ul><li>To do: rotating 3-d visualizations (Xgobi-style) </li></ul></ul><ul><li>Color-coded class values </li></ul><ul><li>“ Jitter” option to deal with nominal attributes (and to detect “hidden” data points) </li></ul><ul><li>“ Zoom-in” function </li></ul>06/10/09 Machine learning with WEKA
    29. 29. Performing experiments <ul><li>Experimenter makes it easy to compare the performance of different learning schemes </li></ul><ul><li>For classification and regression problems </li></ul><ul><li>Results can be written into file or database </li></ul><ul><li>Evaluation options: cross-validation, learning curve, hold-out </li></ul><ul><li>Can also iterate over different parameter settings </li></ul><ul><li>Significance-testing built in! </li></ul>06/10/09 Machine learning with WEKA
    30. 30. The Knowledge Flow GUI <ul><li>Java-Beans-based interface for setting up and running machine learning experiments </li></ul><ul><li>Data sources, classifiers, etc. are beans and can be connected graphically </li></ul><ul><li>Data “flows” through components: e.g., </li></ul><ul><li>“ data source” -> “filter” -> “classifier” -> “evaluator” </li></ul><ul><li>Layouts can be saved and loaded again later </li></ul><ul><li>cf. Clementine ™ </li></ul>06/10/09 Machine learning with WEKA
    31. 31. Projects based on W EKA <ul><li>45 projects currently (30/01/07) listed on the WekaWiki </li></ul><ul><li>Incorporate/wrap W EKA </li></ul><ul><ul><li>GRB Tool Shed - a tool to aid gamma ray burst research </li></ul></ul><ul><ul><li>YALE - facility for large scale ML experiments </li></ul></ul><ul><ul><li>GATE - NLP workbench with a W EKA interface </li></ul></ul><ul><ul><li>Judge - document clustering and classification </li></ul></ul><ul><ul><li>RWeka - an R interface to Weka </li></ul></ul><ul><li>Extend/modify W EKA </li></ul><ul><ul><li>BioWeka - extension library for knowledge discovery in biology </li></ul></ul><ul><ul><li>WekaMetal - meta learning extension to W EKA </li></ul></ul><ul><ul><li>Weka-Parallel - parallel processing for W EKA </li></ul></ul><ul><ul><li>Grid Weka - grid computing using W EKA </li></ul></ul><ul><ul><li>Weka-CG - computational genetics tool library </li></ul></ul>06/10/09 Machine learning with WEKA
    32. 32. W EKA and P ENTAHO <ul><li>Pentaho – The leader in Open Source Business Intelligence (BI) </li></ul><ul><li>September 2006 – Pentaho acquires the Weka project (exclusive license and SF.net page) </li></ul><ul><li>Weka will be used/integrated as data mining component in their BI suite </li></ul><ul><li>Weka will be still available as GPL open source software </li></ul><ul><li>Most likely to evolve 2 editions: </li></ul><ul><ul><li>Community edition </li></ul></ul><ul><ul><li>BI oriented edition </li></ul></ul>06/10/09 Machine learning with WEKA
    33. 33. Limitations of W EKA <ul><li>Traditional algorithms need to have all data in main memory </li></ul><ul><li>==> big datasets are an issue </li></ul><ul><li>Solution: </li></ul><ul><ul><li>Incremental schemes </li></ul></ul><ul><ul><li>Stream algorithms </li></ul></ul><ul><ul><li>MOA “ M assive O nline A nalysis” </li></ul></ul><ul><ul><li>(not only a flightless bird, but also extinct !) </li></ul></ul>06/10/09 Machine learning with WEKA
    34. 34. Summary <ul><li>Introduction to WEKA </li></ul><ul><li>WEKA System Hierarchy </li></ul><ul><li>WEKA features </li></ul><ul><li>Brief History </li></ul><ul><li>Explorer </li></ul><ul><li>Experimenter </li></ul><ul><li>CLI </li></ul><ul><li>Knowledge Flow </li></ul><ul><li>Project Based on WEKA </li></ul><ul><li>Limitations of WEKA </li></ul>06/10/09 Machine learning with WEKA
    35. 35. References <ul><li>Ian H. Witten and Eibe Frank (2005) &quot;Data Mining: Practical machine learning tools and techniques&quot;, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. </li></ul><ul><li>http://www.itl.nist.gov/div898/handbook/index.htm </li></ul>26/Sep/2006 S.P.Vimal, CS IS Group, BITS-Pilani
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×