Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Brief Weka Introduction Shuang Wu Guided by Dr. Thanh Tran
  2. 2. Weka • The software: Waikato Environment for Knowledge Analysis – Machine learning/data mining software written in Java (distributed under the GNU Public License) • The bird: an endemic bird of New Zealand
  3. 3. Outline • ARFF format and loading files to Weka • Basic preprocess and classifier Demo • Attribute selection & Demo • Filtering datasets & Demo
  4. 4. ARFF format and loading files to Weka
  5. 5. Attribute-Relation File Format (ARFF) • Two distinct sections – Header & Data • Four data types supported – numeric – <nominal-specification> – string – date [<date-format>] • E.g.: DATE "yyyy-MM-dd HH:mm:ss" (http://www.cs.waikato.ac.nz/ml/weka/arff.html)
  6. 6. Converting Files to ARFF • Weka has converters for the following file formats: – Spreadsheet files with extension .csv. – C4.5’s native file format with extensions .names and .data. – Serialized instances with extension .bsi. – LIBSVM format files with extension .libsvm. – SVM-Light format files with extension .dat. – XML-based ARFF format files with extension .xrff. (Witten, Frank & Witten, 2011)
  7. 7. (Witten, Frank & Witten, 2011)
  8. 8. (Witten, Frank & Witten, 2011)
  9. 9. Basic preprocess and classifier Demo
  10. 10. More Information can be seen from here.
  11. 11. Attribute selection
  12. 12. Why Feature Selection • Not all the features contained in the datasets of a classification problem are useful • Redundant or irrelevant features may even reduce the classification performance • Eliminating noisy and unnecessary features can – Improve classification performance – Make learning and executing processes faster – Simplify the structure of the learned models
  13. 13. Feature Selection • Two categories of feature selection – Wrapper approaches: • Conduct a search for the best feature subset using the learning algorithm itself as part of the evaluation function • A feature selection algorithm exists as a wrapper around a learning algorithm – Filter approaches: • Independent of a learning algorithm • Argued to be computationally less expensive and more general • By considering the performance of the selected feature subset on a particular learning algorithm, wrappers can usually achieve better results than filter approaches
  14. 14. Wrapper v.s. Filter (Kohavi & John, 1997)
  15. 15. Filter: one example • One algorithm that falls into the filter approach: the FOCUS algorithm – Exhaustively examines all subsets of features, selecting the minimal subset of features that is sufficient to determine the label value for all instances in the training set. – May introduces the MIN-FEATURES bias. – For example, in a medical diagnosis task, a set of features describing a patient might include the patient’s social security number (SSN). When FOCUS searches for the minimum set of features, it will pick the SSN as the only feature needed to uniquely determine the label. Given only the SSN, any induction algorithm is expected to generalize very poorly. (Kohavi & John, 1997)
  16. 16. Searching Attribute Space • The size of search space for n features is 2n, so it is impractical to search the whole space exhaustively in most situations • Single Feature Ranking – A relaxed version of feature selection that only requires the computation of the relative importance of the features and subsequently sorting them – Computationally cheap, but the combination of the top- ranked features may be a redundant subset • Feature Subset Ranking, such as – Greedy Algorithms – Genetic Algorithm (GA)
  17. 17. WEKA Attribute Selection Function • Two ways to do attribute selection: – Normally done by searching the space of attribute subsets, evaluating each one (Feature Subset Ranking) • By combining 1 attribute subset evaluator and 1 search method – A potentially faster but less accurate approach is to evaluate the attributes individually and sort them, discarding attributes that fall below a chosen cutoff point (Single Feature Ranking) • By using 1 single-attribute evaluator and the ranking method
  18. 18. Two Wrapper Methods in Weka • ClassifierSubsetEval – Use a classifier, specified in the object editor as a parameter, to evaluate sets of attributes on the training data or on a separate holdout set. • WrapperSubsetEval – Also use a classifier to evaluate attribute sets, but employ cross-validation to estimate the accuracy of the learning scheme for each set
  19. 19. Attribute Subset Evaluators (Witten, Frank & Witten, 2011) This one will be used in Demo
  20. 20. Search Methods (Witten, Frank & Witten, 2011) This one will be used in Demo
  21. 21. Single-Attribute Evaluators (Witten, Frank & Witten, 2011)
  22. 22. Ranking Method (Witten, Frank & Witten, 2011)
  23. 23. Attribute selection Demo
  24. 24. Filtering datasets
  25. 25. Filtering Algorithms • There are two kinds of filter – Supervised : taking advantage of the class information. A class must be assigned. Default behavior uses the last attribute as class. – Unsupervised: A class is not taking into consideration here. • Both unsupervised and supervised filters have – Attribute filters, which work on the attributes in the datasets, and – Instance filters, which work on the instances
  26. 26. Unsupervised Attribute Filters • Including operations of – Adding and Removing Attributes – Changing Values – Converting attributes from one form to another – Converting multi-instance data into single- instance format – Working with time series data – Randomizing
  27. 27. (Witten, Frank & Witten, 2011) This one will be used in the Demo.
  28. 28. (Witten, Frank & Witten, 2011)
  29. 29. (Witten, Frank & Witten, 2011)
  30. 30. Unsupervised Instance Filters (Witten, Frank & Witten, 2011) This one will be used in the Demo.
  31. 31. Supervised Attribute and Instance Filters (Witten, Frank & Witten, 2011)
  32. 32. Filtering datasets Demo
  33. 33. Noted that the data type of the attribute “temperature ” is numeric.
  34. 34. First, let’s filter the attributes.
  35. 35. Set the “attributeIndices” to 2 (the “temperature” attribute) and the “bins” to 5 (which means to discretize the datasets to 5 bins)
  36. 36. Noted the discretization result.
  37. 37. We can also filter the instances.
  38. 38. Noted here that there are 3 instances that has label (-inf-68.2].
  39. 39. Set the “attributeIndex” to 2 (the “temperature” attribute) and the “nominalIndices” to 1 (which means to remove all the instances with label (-inf-68.2].)
  40. 40. All the instances labeled as (-inf-68.2] have been removed.
  41. 41. Then when you do the classification, it will be based on the filtered datasets, as shown here.
  42. 42. Resources • Weka official website: http://www.cs.waikato.ac.nz/ml/weka/ • Two Weka tutorials on YouTube: – https://www.youtube.com/user/WekaMOOC – https://www.youtube.com/user/rushdishams/videos • Book: Data Mining: Practical Machine Learning Tools and Techniques. Please refer to http://www.cs.waikato.ac.nz/ml/weka/book.html for more details.
  43. 43. References • Frank, E., Machine Learning with WEKA. Retrieved April 05, 2014 from http://www.cs.waikato.ac.nz/ml/weka/documentation.html • Kohavi, R. & John, G.H. (1997), Wrappers for feature subset selection, Articial Intelligence 97, 315–333. • Reservoir sampling. Retrieved April 05, 2014, from http://en.wikipedia.org/wiki/Reservoir_sampling • Witten, I. H., Frank, E., Hall, M. (2011) Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann. • Xue, B., Zhang, M., & Browne, W. N. (2012). Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. Paper presented at the Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122, Melbourne, Australia.