Department of Computer Science, University of Waikato, New ...


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Department of Computer Science, University of Waikato, New ...

  1. 1. Data Mining using WEKA
  2. 2. Waikato Environment for Knowledge Analysis <ul><li>Copyright: Martin Kramer ( </li></ul><ul><li>PGSF/NERF project been going since 1994. </li></ul><ul><li>New Java software development from 98 on. </li></ul><ul><li>Project goals: </li></ul><ul><ul><li>Develop a state-of-the-art workbench of data mining tools </li></ul></ul><ul><ul><li>Explore fielded applications </li></ul></ul><ul><ul><li>Develop new fundamental methods </li></ul></ul>
  3. 3. WEKA TEAM <ul><li>Geoff Holmes, Ian Witten, Bernhard Pfahringer, Eibe Frank, Mark Hall, Yong Wang, Remco Bouckaert, Peter Reutemann, Gabi Schmidberger, Dale Fletcher, Tony Smith, Mike Mayo and Richard Kirkby </li></ul><ul><li>Members on editorial board of MLJ, programme committees for ICML, ECML, KDD, …. </li></ul><ul><li>Authors of a widely adopted data mining textbook. </li></ul>
  4. 4. Data mining process Select Preprocess Transform Mine Analyze & Assimilate Selected data Preprocessed data Transformed data Extracted information Assimilated knowledge
  5. 5. Data mining software <ul><li>Commercial packages (Cost ? X 10 6 dollars) </li></ul><ul><ul><li>IBM Intelligent Miner </li></ul></ul><ul><ul><li>SAS Enterprise Miner </li></ul></ul><ul><ul><li>Clementine </li></ul></ul><ul><li>WEKA (Free = GPL licence!) </li></ul><ul><ul><li>Java => Multi-platform </li></ul></ul><ul><ul><li>Open source – means you get source code </li></ul></ul>
  6. 6. Data format <ul><li>Rectangular table format (flat file) very common </li></ul><ul><ul><li>Most techniques exist to deal with table format </li></ul></ul><ul><li>Row=instance=individual=data point=case=record </li></ul><ul><li>Column=attribute=field=variable=characteristic=dimension </li></ul>… … … … … Yes False Normal Mild Rainy Yes False High Hot Overcast No True High Hot Sunny No False High Hot Sunny Play Windy Humidity Temperature Outlook
  7. 7. Data complications <ul><li>Volume of data – sampling; essential attributes </li></ul><ul><li>Missing data </li></ul><ul><li>Inaccurate data </li></ul><ul><li>Data filtering </li></ul><ul><li>Data aggregation </li></ul>
  8. 8. WEKA’s ARFF format % % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...
  9. 9. Attribute types <ul><li>ARFF supports numeric and nominal attributes </li></ul><ul><li>Interpretation depends on learning scheme </li></ul><ul><ul><li>Numeric attributes are interpreted as </li></ul></ul><ul><ul><ul><li>ordinal scales if less-than and greater-than are used </li></ul></ul></ul><ul><ul><ul><li>ratio scales if distance calculations are performed (normalization/standardization may be required) </li></ul></ul></ul><ul><ul><li>Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) </li></ul></ul><ul><li>Integers: nominal, ordinal, or ratio scale? </li></ul>
  10. 10. Missing values <ul><li>Frequently indicated by out-of-range entries </li></ul><ul><ul><li>Types: unknown, unrecorded, irrelevant </li></ul></ul><ul><ul><li>Reasons: malfunctioning equipment, changes in experimental design, collation of different datasets, measurement not possible </li></ul></ul><ul><li>Missing value may have significance in itself (e.g. missing test in a medical examination) </li></ul><ul><ul><li>Most schemes assume that is not the case  “missing” may need to be coded as additional value </li></ul></ul>
  11. 11. Getting to know the data <ul><li>Simple visualization tools are very useful for identifying problems </li></ul><ul><ul><li>Nominal attributes: histograms (Distribution consistent with background knowledge?) </li></ul></ul><ul><ul><li>Numeric attributes: graphs (Any obvious outliers?) </li></ul></ul><ul><li>2-D and 3-D visualizations show dependencies </li></ul><ul><li>Domain experts need to be consulted </li></ul><ul><li>Too much data to inspect? Take a sample! </li></ul>
  12. 12. Learning and using a model <ul><li>Learning </li></ul><ul><ul><li>Learning algorithm takes instances of concept as input </li></ul></ul><ul><ul><li>Produces a structural description (model) as output </li></ul></ul>Input: concept to learn Learning algorithm Model <ul><li>Prediction </li></ul><ul><ul><li>Model takes new instance as input </li></ul></ul><ul><ul><li>Outputs prediction </li></ul></ul>Input Model Prediction
  13. 13. Structural descriptions (models) <ul><li>Some models are better than others </li></ul><ul><ul><li>Accuracy </li></ul></ul><ul><ul><li>Understandability </li></ul></ul><ul><li>Models range from “easy to understand” to virtually incomprehensible </li></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>Rule induction </li></ul></ul><ul><ul><li>Regression models </li></ul></ul><ul><ul><li>Neural networks </li></ul></ul>Easier Harder
  14. 14. Pre-processing the data <ul><li>Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary </li></ul><ul><li>Data can also be read from a URL or from SQL databases using JDBC </li></ul><ul><li>Pre-processing tools in WEKA are called “filters” </li></ul><ul><li>WEKA contains filters for: </li></ul><ul><ul><li>Discretization, normalization, resampling, attribute selection, attribute combination, … </li></ul></ul>
  15. 15. Explorer: pre-processing
  16. 16. Building classification models <ul><li>“ Classifiers” in WEKA are models for predicting nominal or numeric quantities </li></ul><ul><li>Implemented schemes include: </li></ul><ul><ul><li>Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, … </li></ul></ul><ul><li>“ Meta”-classifiers include: </li></ul><ul><ul><li>Bagging, boosting, stacking, error-correcting output codes, data cleansing, … </li></ul></ul>
  17. 17. Explorer: classification
  18. 18. Explorer: classification
  19. 19. Explorer: classification
  20. 20. Explorer: classification
  21. 21. Explorer: classification
  22. 22. Explorer: classification
  23. 23. Explorer: classification
  24. 24. Explorer: classification
  25. 25. Explorer: classification/regression
  26. 26. Explorer: classification
  27. 27. Clustering data <ul><li>WEKA contains “clusterers” for finding groups of instances in a datasets </li></ul><ul><li>Implemented schemes are: </li></ul><ul><ul><li>k -Means, EM, Cobweb </li></ul></ul><ul><li>Coming soon: x -means </li></ul><ul><li>Clusters can be visualized and compared to “true” clusters (if given) </li></ul><ul><li>Evaluation based on loglikelihood if clustering scheme produces a probability distribution </li></ul>
  28. 28. Explorer: clustering
  29. 29. Explorer: clustering
  30. 30. Explorer: clustering
  31. 31. Explorer: clustering
  32. 32. Finding associations <ul><li>WEKA contains an implementation of the Apriori algorithm for learning association rules </li></ul><ul><ul><li>Works only with discrete data </li></ul></ul><ul><li>Allows you to identify statistical dependencies between groups of attributes: </li></ul><ul><ul><li>milk, butter  bread, eggs (with confidence 0.9 and support 2000) </li></ul></ul><ul><li>Apriori can compute all rules that have a given minimum support and exceed a given confidence </li></ul>
  33. 33. Explorer: association rules
  34. 34. Attribute selection <ul><li>Separate panel allows you to investigate which (subsets of) attributes are the most predictive ones </li></ul><ul><li>Attribute selection methods contain two parts: </li></ul><ul><ul><li>A search method: best-first, forward selection, random, exhaustive, race search, ranking </li></ul></ul><ul><ul><li>An evaluation method: correlation-based, wrapper, information gain, chi-squared, PCA, … </li></ul></ul><ul><li>Very flexible: WEKA allows (almost) arbitrary combinations of these two </li></ul>
  35. 35. Explorer: attribute selection
  36. 36. Data visualization <ul><li>Visualization is very useful in practice: e.g. helps to determine difficulty of the learning problem </li></ul><ul><li>WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) </li></ul><ul><ul><li>To do: rotating 3-d visualizations (Xgobi-style) </li></ul></ul><ul><li>Color-coded class values </li></ul><ul><li>“ Jitter” option to deal with nominal attributes (and to detect “hidden” data points) </li></ul><ul><li>“ Zoom-in” function </li></ul>
  37. 37. Explorer: data visualization
  38. 38. Performing experiments <ul><li>The Experimenter makes it easy to compare the performance of different learning schemes applied to the same data. </li></ul><ul><li>Designed for nominal and numeric class problems </li></ul><ul><li>Results can be written into file or database </li></ul><ul><li>Evaluation options: cross-validation, learning curve, hold-out </li></ul><ul><li>Can also iterate over different parameter settings </li></ul><ul><li>Significance-testing built in! </li></ul>
  39. 39. Experimenter: setting it up
  40. 40. Experimenter: running it
  41. 41. Experimenter: analysis
  42. 42. New Directions for Weka <ul><li>New user interface based on work flows </li></ul><ul><li>New data mining techniques </li></ul><ul><ul><li>PACE regression </li></ul></ul><ul><ul><li>Bayesian Networks </li></ul></ul><ul><ul><li>Logistic option trees </li></ul></ul><ul><li>New frameworks for very large data sources (MOA) </li></ul><ul><li>New applications in the agricultural sector </li></ul><ul><ul><li>Matchmaker for RPBC Ltd </li></ul></ul><ul><ul><li>Pest control for kiwifruit management </li></ul></ul><ul><ul><li>Crop forecasting </li></ul></ul><ul><ul><li>Soil element prediction from NIR data (Nitrogen, Carbon) </li></ul></ul>
  43. 43. Next Generation Weka: Knowledge flow GUI
  44. 44. Conclusions <ul><li>Weka is a comprehensive suite of Java programs united under a common interface to permit exploration and experimentation on datasets using state-of-the-art techniques. </li></ul><ul><li>The software is available under the GPL from </li></ul><ul><li>Weka provides the perfect environment for ongoing research in data mining. </li></ul>