Wek1

MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/10/09 Machine learning with WEKA Machine Learning with WEKA

Agenda 06/10/09 Machine learning with WEKA

Introduction to WEKA W aikato E nvironment for K nowledge A nalysis Weka is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Official Web Site: http://www.cs.waikato.ac.nz/ml/weka/ 06/10/09 Machine learning with WEKA

WEKA System Hierarchy April 10, 2006

Weka’s R ole in the B ig P icture 06/10/09 Machine learning with WEKA Input Raw data Data Ming by Weka Pre-processing Classification Regression Clustering Association Rules Visualization Output Result

KDD Process Machine learning with WEKA 06/10/09

WEKA: the software Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 06/10/09 Machine learning with WEKA

History Project funded by the NZ government since 1993 Develop state-of-the art workbench of data mining tools Explore fielded applications Develop new fundamental methods 06/10/09 Machine learning with WEKA

History July 1997 - W EKA 2.2 Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5 Included a facility (based on Unix makefiles) for configuring and running large scale experiments Early 1997 - decision was made to rewrite W EKA in Java Originated from code written by Eibe Frank for his PhD Originally codenamed JAWS ( JA va W eka S ystem) May 1998 - W EKA 2.3 Last release of the TCL/TK-based system Mid 1999 - W EKA 3 (100% Java) released Version to complement the Data Mining book Development version (including GUI) 06/10/09 Machine learning with WEKA

WEKA: versions There are several versions of WEKA: WEKA 3.4: “book version” compatible with description in data mining book WEKA 3.5.5: “development version” with lots of improvements This talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007) With latest being WEKA 3.6 series 06/10/09 Machine learning with WEKA

06/10/09 Machine learning with WEKA java weka.gui.GUIChooser

Explorer - Preprocessing Import from file s : ARFF, CSV, C4.5, binary Import from URL or an SQL database (using JDBC) Preprocessing filters Adding/removing attributes Attribute value substitution Discretization (MDL, Kononenko, etc.) Time series filters (delta, shift) Sampling, randomization Missing value management Normalization and other numeric transformations Machine learning with WEKA 06/10/09

ARFF File Format Require declarations of @RELATION , @ATTRIBUTE and @DATA @RELATION declaration associates a name with the dataset @RELATION <relation-name> @RELATION iris @ATTRIBUTE declaration specifies the name and type of an attribute @attribute <attribute-name> <datatype> Datatype can be numeric, nominal, string or date @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA declaration is a single line denoting the start of the data segment Missing values are represented by ? @DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, ?, 1.4, ?, Iris-versicolor 06/10/09 Machine learning with WEKA

Explorer - Classification P redicted attribute is categorical Implemented methods Naïve Bayes decision trees and rules neural networks support vector machines instance-based classifiers … Evaluation test set crossvalidation ... Machine learning with WEKA 06/10/09

J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) 06/10/09 Machine learning with WEKA

Cross-validation Correctly Classified Instances 143 95.3% Incorrectly Classified Instances 7 4.67 % Default 10-fold cross validation i.e. Split data into 10 equal sized pieces Train on 9 pieces and test on remainder Do for all possibilities and average 06/10/09 Machine learning with WEKA

J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica 06/10/09 Machine learning with WEKA

Precision, Recall, and Accuracy Precision: probability of being correct given that your decision. Precision of iris-setosa is 49/49 = 100% Specificity in medical literature Recall: probability of correctly identifying class. Recall accuracy for iris-setosa is 49/50 = 98% Sensitity in medical literature Accuracy: # right/total = 143/150 =~95% 06/10/09 Machine learning with WEKA

Explorer - Clustering Implemented methods k -Means EM Cobweb X-means FarthestFirst … Clusters can be visualized and compared to “true” clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution Machine learning with WEKA 06/10/09

Explorer - Associations WEKA contains the Apriori algorithm (among others) for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter  bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence 06/10/09 Machine learning with WEKA

CONCEPT HIERARCY Food Milk Bread Fruit 2% Skimmed Fat Free Wheat White Apple Banana Orange Inorganic Organic Multiple-Level Association Rule Mining in Weka Level 1

Sample Execution (1) java weka.associations.Apriori -t data/weather.nominal.arff -I yes Apriori ======= Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 06/10/09 Machine learning with WEKA

Sample Execution (2) Best rules found: 1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) 2. temperature=cool 4 ==> humidity=normal 4 (1) 3. outlook=overcast 4 ==> play=yes 4 (1) 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1) 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1) 7. outlook=sunny humidity=high 3 ==> play=no 3 (1) 8. outlook=sunny play=no 3 ==> humidity=high 3 (1) 06/10/09 Machine learning with WEKA

Regression P redicted attribute is continuous Implemented methods (linear regression) neural networks regression trees … Machine learning with WEKA 06/10/09

Explorer - Attribute Selection Very flexible: arbitrary combination of search and evaluation methods Both filtering and wrapping methods Search methods b est-first g enetic r anking ... Evaluation m easures ReliefF information gain g ain rati o … Machine learning with WEKA 06/10/09

Explorer - Data Visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values “ Jitter” option to deal with nominal attributes (and to detect “hidden” data points) “ Zoom-in” function 06/10/09 Machine learning with WEKA

Performing experiments Experimenter makes it easy to compare the performance of different learning schemes For classification and regression problems Results can be written into file or database Evaluation options: cross-validation, learning curve, hold-out Can also iterate over different parameter settings Significance-testing built in! 06/10/09 Machine learning with WEKA

The Knowledge Flow GUI Java-Beans-based interface for setting up and running machine learning experiments Data sources, classifiers, etc. are beans and can be connected graphically Data “flows” through components: e.g., “ data source” -> “filter” -> “classifier” -> “evaluator” Layouts can be saved and loaded again later cf. Clementine ™ 06/10/09 Machine learning with WEKA

Projects based on W EKA 45 projects currently (30/01/07) listed on the WekaWiki Incorporate/wrap W EKA GRB Tool Shed - a tool to aid gamma ray burst research YALE - facility for large scale ML experiments GATE - NLP workbench with a W EKA interface Judge - document clustering and classification RWeka - an R interface to Weka Extend/modify W EKA BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to W EKA Weka-Parallel - parallel processing for W EKA Grid Weka - grid computing using W EKA Weka-CG - computational genetics tool library 06/10/09 Machine learning with WEKA

W EKA and P ENTAHO Pentaho – The leader in Open Source Business Intelligence (BI) September 2006 – Pentaho acquires the Weka project (exclusive license and SF.net page) Weka will be used/integrated as data mining component in their BI suite Weka will be still available as GPL open source software Most likely to evolve 2 editions: Community edition BI oriented edition 06/10/09 Machine learning with WEKA

Limitations of W EKA Traditional algorithms need to have all data in main memory ==> big datasets are an issue Solution: Incremental schemes Stream algorithms MOA “ M assive O nline A nalysis” (not only a flightless bird, but also extinct !) 06/10/09 Machine learning with WEKA

Summary Introduction to WEKA WEKA System Hierarchy WEKA features Brief History Explorer Experimenter CLI Knowledge Flow Project Based on WEKA Limitations of WEKA 06/10/09 Machine learning with WEKA

References Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005. http://www.itl.nist.gov/div898/handbook/index.htm 26/Sep/2006 S.P.Vimal, CS IS Group, BITS-Pilani

Wek1

More Related Content

What's hot

Viewers also liked

Similar to Wek1

More from Dr Anjan Krishnamurthy

Recently uploaded

Wek1