An Introduction to WEKA        Saeed Iqbal
Content    What is WEKA?    Data set in WEKA    The Explorer:      Preprocess data      Classification      Clusteri...
What is WEKA?    Waikato Environment for Knowledge Analysis      It’s a data mining/machine learning tool developed by  ...
Download and Install WEKA    Website:     http://www.cs.waikato.ac.nz/~ml/weka/index.html    Support multiple platforms ...
Main Features    49 data preprocessing tools    76 classification/regression algorithms    8 clustering algorithms    ...
Main GUI     Three graphical user interfaces      “The Explorer” (exploratory data analysis)      “The Experimenter” (e...
Datasets in WekaEach entry in a dataset is an instance of the java class:  weka.core.InstanceEach instance consists of ...
ARFF FilesWeka wants its input data in ARFF format.  A dataset has to start with a declaration of its name:     @relati...
ARFF File@relation weather@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity n...
WEKA: ExplorerPreprocess: Choose and modify the data being acted on.Classify: Train and test learning schemes that class...
Content     What is WEKA?     Data set in WEKA     The Explorer:       Preprocess data       Classification       Cl...
Explorer: pre-processing the data     Data can be imported from a file in various formats: ARFF,      CSV, C4.5, binary  ...
WEKA only deals with “flat” files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male}...
WEKA only deals with “flat” files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male}...
load     filter                              analyze15    University of Waikato             01/07/13
16   University of Waikato   01/07/13
17   University of Waikato   01/07/13
18   University of Waikato   01/07/13
19   University of Waikato   01/07/13
20   University of Waikato   01/07/13
21   University of Waikato   01/07/13
22   University of Waikato   01/07/13
23   University of Waikato   01/07/13
24   University of Waikato   01/07/13
25   University of Waikato   01/07/13
26   University of Waikato   01/07/13
27   University of Waikato   01/07/13
28   University of Waikato   01/07/13
29   University of Waikato   01/07/13
30   University of Waikato   01/07/13
31   University of Waikato   01/07/13
32   University of Waikato   01/07/13
33   University of Waikato   01/07/13
34   University of Waikato   01/07/13
35   University of Waikato   01/07/13
Explorer: building “classifiers”     Classifiers in WEKA are models for predicting nominal or      numeric quantities    ...
Decision Tree Induction: Training Dataset                 age    income student credit_rating   buys_computer             ...
Output: A Decision Tree for “buys_computer”                                age?                  <=30      overcast       ...
Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm)     Tree is constructed in a top-down recurs...
classifiers  DEMO
41   University of Waikato   01/07/13
42   University of Waikato   01/07/13
43   University of Waikato   01/07/13
44   University of Waikato   01/07/13
45   University of Waikato   01/07/13
46   University of Waikato   01/07/13
47   University of Waikato   01/07/13
48   University of Waikato   01/07/13
49   University of Waikato   01/07/13
50   University of Waikato   01/07/13
51   University of Waikato   01/07/13
52   University of Waikato   01/07/13
53   University of Waikato   01/07/13
54   University of Waikato   01/07/13
55   University of Waikato   01/07/13
56   University of Waikato   01/07/13
57   University of Waikato   01/07/13
58   University of Waikato   01/07/13
59   University of Waikato   01/07/13
60   University of Waikato   01/07/13
61   University of Waikato   01/07/13
62   University of Waikato   01/07/13
Explorer: clustering data     WEKA contains “clusterers” for finding groups of similar      instances in a dataset     I...
Clustering  DEMO
The K-Means Clustering Method      Given k, the k-means algorithm is implemented in four steps:       Partition objects ...
Explorer: finding associations     WEKA contains an implementation of the Apriori algorithm       for learning associatio...
Basic Concepts: Frequent PatternsTid                Items bought                itemset: A set of one or more items10    ...
Basic Concepts: Association RulesTid                Items bought                Find all the rules X  Y with minimum10  ...
Association  DEMO
70   University of Waikato   01/07/13
71   University of Waikato   01/07/13
72   University of Waikato   01/07/13
73   University of Waikato   01/07/13
74   University of Waikato   01/07/13
Explorer: attribute selection     Panel that can be used to investigate which (subsets of)      attributes are the most p...
Explorer: attribute selection           DEMO
77   University of Waikato   01/07/13
78   University of Waikato   01/07/13
79   University of Waikato   01/07/13
80   University of Waikato   01/07/13
81   University of Waikato   01/07/13
82   University of Waikato   01/07/13
83   University of Waikato   01/07/13
84   University of Waikato   01/07/13
Explorer: data visualization     Visualization very useful in practice: e.g. helps to determine      difficulty of the le...
data visualization      DEMO
87   University of Waikato   01/07/13
88   University of Waikato   01/07/13
89   University of Waikato   01/07/13
90   University of Waikato   01/07/13
91   University of Waikato   01/07/13
92   University of Waikato   01/07/13
93   University of Waikato   01/07/13
94   University of Waikato   01/07/13
95   University of Waikato   01/07/13
96   University of Waikato   01/07/13
References and Resources References:  WEKA website:   http://www.cs.waikato.ac.nz/~ml/weka/index.html  WEKA Tutorial:  ...
Upcoming SlideShare
Loading in...5
×

Weka presentation

6,117

Published on

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,117
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
492
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Weka presentation

  1. 1. An Introduction to WEKA Saeed Iqbal
  2. 2. Content What is WEKA? Data set in WEKA The Explorer: Preprocess data Classification Clustering Association Rules Attribute Selection Data Visualization References and Resources2 01/07/13
  3. 3. What is WEKA? Waikato Environment for Knowledge Analysis It’s a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand. Weka is a collection of machine learning algorithms for data mining tasks. Weka is open source software issued under the GNU General Public License.3 01/07/13
  4. 4. Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html Support multiple platforms (written in java):  Windows, Mac OS X and Linux Weka Manual: http://transact.dl.sourceforge.net/sourcefor ge/weka/WekaManual-3.6.0.pdf4 01/07/13
  5. 5. Main Features 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 3 algorithms for finding association rules 15 attribute/subset evaluators + 10 search algorithms for feature selection5 01/07/13
  6. 6. Main GUI  Three graphical user interfaces “The Explorer” (exploratory data analysis) “The Experimenter” (experimental environment) “The KnowledgeFlow” (new process model inspired interface) Simple CLI (Command prompt)  Offers some functionality not available via the GUI6 01/07/13
  7. 7. Datasets in WekaEach entry in a dataset is an instance of the java class: weka.core.InstanceEach instance consists of a number of attributes Nominal: one of a predefined list of values  e.g. red, green, blue Numeric: A real or integer number String: Enclosed in “double quotes” Date Relational
  8. 8. ARFF FilesWeka wants its input data in ARFF format. A dataset has to start with a declaration of its name:  @relation name @attribute attribute_name specification  If an attribute is nominal, specification contains a list of the possible attribute values in curly brackets:  @attribute nominal_attribute {first_value, second_value, third_value}  If an attribute is numeric, specification is replaced by the keyword numeric: (Integer values are treated as real numbers in WEKA.)  @attribute numeric_attribute numeric After the attribute declarations, the actual data is introduced by a tag:  @data
  9. 9. ARFF File@relation weather@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { TRUE, FALSE }@attribute play { yes, no }@datasunny, 85, 85, FALSE, nosunny, 80, 90, TRUE, noovercast, 83, 86, FALSE, yesrainy, 70, 96, FALSE, yesrainy, 68, 80, FALSE, yesrainy, 65, 70, TRUE, noovercast, 64, 65, TRUE, yessunny, 72, 95, FALSE, nosunny, 69, 70, FALSE, yesrainy, 75, 80, FALSE, yessunny, 75, 70, TRUE, yesovercast, 72, 90, TRUE, yesovercast, 81, 75, FALSE, yesrainy, 71, 91, TRUE, no
  10. 10. WEKA: ExplorerPreprocess: Choose and modify the data being acted on.Classify: Train and test learning schemes that classify or perform regression.Cluster: Learn clusters for the data.Associate: Learn association rules for the data.Select attributes: Select the most relevant attributes in the data.Visualize: View an interactive 2D plot of the data.
  11. 11. Content What is WEKA? Data set in WEKA The Explorer: Preprocess data Classification Clustering Association Rules Attribute Selection Data Visualization References and Resources11 01/07/13
  12. 12. Explorer: pre-processing the data Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called “filters” WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes, …12 01/07/13
  13. 13. WEKA only deals with “flat” files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...13 01/07/13
  14. 14. WEKA only deals with “flat” files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...14 01/07/13
  15. 15. load filter analyze15 University of Waikato 01/07/13
  16. 16. 16 University of Waikato 01/07/13
  17. 17. 17 University of Waikato 01/07/13
  18. 18. 18 University of Waikato 01/07/13
  19. 19. 19 University of Waikato 01/07/13
  20. 20. 20 University of Waikato 01/07/13
  21. 21. 21 University of Waikato 01/07/13
  22. 22. 22 University of Waikato 01/07/13
  23. 23. 23 University of Waikato 01/07/13
  24. 24. 24 University of Waikato 01/07/13
  25. 25. 25 University of Waikato 01/07/13
  26. 26. 26 University of Waikato 01/07/13
  27. 27. 27 University of Waikato 01/07/13
  28. 28. 28 University of Waikato 01/07/13
  29. 29. 29 University of Waikato 01/07/13
  30. 30. 30 University of Waikato 01/07/13
  31. 31. 31 University of Waikato 01/07/13
  32. 32. 32 University of Waikato 01/07/13
  33. 33. 33 University of Waikato 01/07/13
  34. 34. 34 University of Waikato 01/07/13
  35. 35. 35 University of Waikato 01/07/13
  36. 36. Explorer: building “classifiers” Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …36 01/07/13
  37. 37. Decision Tree Induction: Training Dataset age income student credit_rating buys_computer <=30 high no fair noThis follows <=30 high no excellent no 31…40 high no fair yesan example >40 medium no fair yesof Quinlan’s >40 low yes fair yesID3 (Playing >40 low yes excellent no 31…40 low yes excellent yesTennis) <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no37 January 7, 2013
  38. 38. Output: A Decision Tree for “buys_computer” age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes no yes38 January 7, 2013
  39. 39. Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)39 January 7, 2013
  40. 40. classifiers DEMO
  41. 41. 41 University of Waikato 01/07/13
  42. 42. 42 University of Waikato 01/07/13
  43. 43. 43 University of Waikato 01/07/13
  44. 44. 44 University of Waikato 01/07/13
  45. 45. 45 University of Waikato 01/07/13
  46. 46. 46 University of Waikato 01/07/13
  47. 47. 47 University of Waikato 01/07/13
  48. 48. 48 University of Waikato 01/07/13
  49. 49. 49 University of Waikato 01/07/13
  50. 50. 50 University of Waikato 01/07/13
  51. 51. 51 University of Waikato 01/07/13
  52. 52. 52 University of Waikato 01/07/13
  53. 53. 53 University of Waikato 01/07/13
  54. 54. 54 University of Waikato 01/07/13
  55. 55. 55 University of Waikato 01/07/13
  56. 56. 56 University of Waikato 01/07/13
  57. 57. 57 University of Waikato 01/07/13
  58. 58. 58 University of Waikato 01/07/13
  59. 59. 59 University of Waikato 01/07/13
  60. 60. 60 University of Waikato 01/07/13
  61. 61. 61 University of Waikato 01/07/13
  62. 62. 62 University of Waikato 01/07/13
  63. 63. Explorer: clustering data WEKA contains “clusterers” for finding groups of similar instances in a dataset Implemented schemes are: k-Means, EM, Cobweb, X-means, FarthestFirst Clusters can be visualized and compared to “true” clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution63 01/07/13
  64. 64. Clustering DEMO
  65. 65. The K-Means Clustering Method  Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment65 January 7, 2013
  66. 66. Explorer: finding associations WEKA contains an implementation of the Apriori algorithm for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter ⇒ bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence66 01/07/13
  67. 67. Basic Concepts: Frequent PatternsTid Items bought  itemset: A set of one or more items10 Beer, Nuts, Diaper  k-itemset X = {x1, …, xk}20 Beer, Coffee, Diaper  (absolute) support, or, support count of X:30 Beer, Diaper, Eggs Frequency or occurrence of an itemset40 Nuts, Eggs, Milk X50 Nuts, Coffee, Diaper, Eggs, Milk  (relative) support, s, is the fraction of Customer Customer transactions that contains X (i.e., the buys both buys diaper probability that a transaction contains X)  An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys beer67 January 7, 2013
  68. 68. Basic Concepts: Association RulesTid Items bought  Find all the rules X  Y with minimum10 Beer, Nuts, Diaper support and confidence20 Beer, Coffee, Diaper30 Beer, Diaper, Eggs  support, s, probability that a40 Nuts, Eggs, Milk transaction contains X ∪ Y50 Nuts, Coffee, Diaper, Eggs, Milk  confidence, c, conditional probability Customer Customer that a transaction having X also buys both buys contains Y diaper Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Customer buys beer  Association rules: (many more!)  Beer  Diaper (60%, 100%)  Diaper  Beer (60%, 75%)68 January 7, 2013
  69. 69. Association DEMO
  70. 70. 70 University of Waikato 01/07/13
  71. 71. 71 University of Waikato 01/07/13
  72. 72. 72 University of Waikato 01/07/13
  73. 73. 73 University of Waikato 01/07/13
  74. 74. 74 University of Waikato 01/07/13
  75. 75. Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, … Very flexible: WEKA allows (almost) arbitrary combinations of these two75 01/07/13
  76. 76. Explorer: attribute selection DEMO
  77. 77. 77 University of Waikato 01/07/13
  78. 78. 78 University of Waikato 01/07/13
  79. 79. 79 University of Waikato 01/07/13
  80. 80. 80 University of Waikato 01/07/13
  81. 81. 81 University of Waikato 01/07/13
  82. 82. 82 University of Waikato 01/07/13
  83. 83. 83 University of Waikato 01/07/13
  84. 84. 84 University of Waikato 01/07/13
  85. 85. Explorer: data visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values “Jitter” option to deal with nominal attributes (and to detect “hidden” data points) “Zoom-in” function85 01/07/13
  86. 86. data visualization DEMO
  87. 87. 87 University of Waikato 01/07/13
  88. 88. 88 University of Waikato 01/07/13
  89. 89. 89 University of Waikato 01/07/13
  90. 90. 90 University of Waikato 01/07/13
  91. 91. 91 University of Waikato 01/07/13
  92. 92. 92 University of Waikato 01/07/13
  93. 93. 93 University of Waikato 01/07/13
  94. 94. 94 University of Waikato 01/07/13
  95. 95. 95 University of Waikato 01/07/13
  96. 96. 96 University of Waikato 01/07/13
  97. 97. References and Resources References: WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html WEKA Tutorial:  Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka.  A presentation which explains how to use Weka for exploratory data mining. WEKA Data Mining Book:  Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page Others:  Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd ed.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×