SlideShare a Scribd company logo
1 of 12
COMPUTER TECHNOLOGY INSTITUTE                                             1999
              ________________________________________________________________________________




                              TECHNICAL REPORT No. 99.03.01


               “Efficient decision tree induction with set-valued attributes”

                             Dimitrios Kalles, Athanasios Papagelis

                                               March 16, 1999




Abstract
Conventional algorithms for the induction of decision trees use an attribute-value representation scheme for
instances. This paper explores the empirical consequences of using set-valued attributes. This simple
representational extension is shown to yield significant gains in speed and accuracy. To do so, the paper
also describes an intuitive and practical version of pre-pruning.




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                   1
COMPUTER TECHNOLOGY INSTITUTE                                             1999
               ________________________________________________________________________________



1.Introduction
Describing instances using attribute-value pairs is a widely used practice in the machine learning and
pattern recognition literature. Though we acknowledge the utility of this practice, in this paper we aim to
explore the extent to which using such a representation for decision tree learning is as good as one would
hope for. Specifically, we will make a departure from the attribute-single-value convention and deal with
set-valued attributes.
In doing so, we will have to face all guises of decision tree learning problems. Splitting criteria have to be
re-defined. The set-valued attribute approach will put again pre-pruning into the picture, as a viable and
reasonable pruning strategy. The classification task will also change. Of course, all these depend on us
providing a plausible explanation of why we need set-valued attributes, and how we can obtain them.
The first and foremost characteristics of set-valued attributes is that during splitting, an instance may be
instructed to follow both paths of a (binary) decision tree, thus ending up in more than one leaf. Such
instance replication across tree branches beefs up (hopefully) the quality of the splitting process as splitting
decisions are based on larger instance samples. The idea is, then, that classification accuracy should
increase and this is the strategic goal. This increase will be effected not only by more educated splits during
learning but also because instances follow more than one path during testing too. What we then get is a
combined decision on class membership.
The reader may be questioning the value of this recipe for instance proliferation, across and along tree
branches. Truly, this might be a problem. At some nodes, it may be that attributes are not informative
enough for splitting and that the overlap in instances between children nodes may be excessive. Defining
what may be considered excessive has been a line of research in this work and it has turned out that very
simple pre-pruning criteria suffice to contain this excessiveness. The decision tree building algorithm can
be efficiently modified to contain a couple of extra stopping conditions which limit the extend to which
instance replication may be allowed.
As probabilistic classification has naturally emerged as an option in this research, it is only natural that we
would like to give credit, for some of the ideas that come up in this paper, to the work carried out earlier by
Quinlan [1987]. At this early point, we should emphasize that the proposed approach does not relate to
splitting using feature sets, as employed by C4.5 [Quinlan, 1993]
An earlier attempt to capitalize on these ideas appeared in [Kalles, 1994]. Probably due to the immaturity of
those results and to (some extent to) the awkwardness of the terminology (the term pattern replication was
used to denote the concept of an instance being sent down to more than one branch due to set-valued
attributes), the potential has not been fully explored. Of course, this research flavor has been apparent in
[Martin and Billman, 1994], who discuss overlapping concepts, in [Ali and Pazzani, 1996] and in [Breiman,
1996]; the latter two are oriented towards the exploitation of multiple models (see also [Kwok and Carter,
1990]). Our work focuses on exploiting multiple classifications paths for a given testing instance, where
these descriptions are all generated by the same decision tree. In that sense, we feel rather closer to
Quinlan’s approach, at least in the start; we then depart to study what we consider more interesting views of
the problem.
Our short review would be incomplete without mentioning RIPPER [Cohen, 1996], a system that exploits
set-valued features to solve categorization problems in the linguistic domain. More than a few algorithms
have been developed on RIPPER’s track, yet we shall limit our references here for the sake of containing
the problem within the decision tree domain.
The rest of this paper is organized in four sections. In the next section we elaborate on the intuition behind
the set-valued approach and how it leads to the heuristics employed. The actual modifications to the
standard decision tree algorithms come next, including descriptions of splitting, pruning and classifying.
We then demonstrate via an extended experimental session that the proposed heuristics indeed work and
identify potential pitfalls. Finally, we put all the details together and discuss lines of research that have been
deemed worthy of following.




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                        2
COMPUTER TECHNOLOGY INSTITUTE                                             1999
              ________________________________________________________________________________



2.An Overview of the Problem
Symbolic attributes are usually described by a single value (“Ford is the make of this car”). That a symbolic
attribute may obtain an unequivocally specified value is a result of the domain itself (car makes are a finite
number). On the other hand numeric values are drawn from a continuum of values, where errors or
variation may turn up in different guises of the same underlying phenomenon: noise. Note that this can hold
of seemingly symbolic values too; colors correspond to a de facto discretization of frequencies. It would be
plausible that for some discretization, for a learning problem, we would like to be able to make a distinction
between values such as green, not-so-green, etc. This explains why we feel that discretizing numeric values
is a practice that easily fits the set-valued attributes context.
A casual browsing of the data sets in the Machine Learning repository [Blake et al., 1998] shows that there
are no data sets where the same concept could apply in a purely symbolic domain. It should come as no
surprise that such experiments are indeed limited and the ones that seem to be practically justifiable have
appeared in a linguistic context (in robust or predictive parsing, for example, where a word can be
categorized as a verb or a noun). As linguistic information is inherently symbolic we feel confident that the
set-valued attribute approach will eventually demonstrate its worth, starting from that domain.
Having mentioned the applicability of the concept in real domains we now argue why it should work,
starting with numeric domains.
First, note that one of the typical characteristics of decision trees is that their divide-and-conquer approach
quickly trims down the availability of large instance chunks, as one progresses towards the tree fringe. The
splitting process calculates some level of (some type of) information gain and splits the tree accordingly. In
doing so, it forces instances that are near a splitting threshold to follow one path. The other one, even if
missed by some small Δε is a loser. This amounts to a reduced sample in the losing branch. By treating
numeric attributes as set-valued ones we artificially enlarge the learning population at critical nodes. This
can be done by employing a straightforward discretization step before any splitting starts. It also means that
threshold values do not have to be re-calculated using the actual values but the substituted discrete ones
(actually, set of values, where applicable).
During testing, the same rule applies and test instances may end up in more than one leaf. Based on the
assumption that such instance replication will only occur where there is doubt about the existence of a
single value for an attribute, it follows that one can determine class membership of an instance by
considering all leaves it has reached. This delivers a more educated classification as it is based on a larger
sample and, conceptually, resembles an ensemble of experts. For this work, we have used a simple majority
scheme; it may well be, however, that even better results could result using more informed and
sophisticated averaging.
The discretization procedure used to convert numeric values to sets of integer values is an investment in
longer term learning efficiency as well. Paying a small price for determining how instances’ values should
be discretized is compensated by having to manipulate integer values only, at the learning stage. In
symbolic domains, where an attribute may be assigned a set of values, the set-valued representation is
arguably the only one that does not incur a representation loss.




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                      3
COMPUTER TECHNOLOGY INSTITUTE                                             1999
                 ________________________________________________________________________________



3.Learning with Set Valued Attributes
This section will describe how we handle set-valued attributes. While describing how one can generate such
attributes, we shall limit our presentation to the numeric case, as available data sets are not populated by
sets of symbolic values.
To obtain set-valued attributes we have to discretize the raw data of a dataset. The discretization step
produces instances that have (integer) set-valued attributes. Our algorithm uses these normalized instances
to build the tree.
Every attribute’s values are mapped to integers. Instances sharing an attribute value are said to belong to the
same bucket, which is characterized by that integer value. Each attribute has as many buckets as distinct
values.
For continuous (numeric) attributes we split the continuum of values into a small number of non-
overlapping intervals with each interval assigned to a bucket (one-to-one correspondence). An instance’s
value (a point) for a specific attribute may then belong to more that one of these buckets. Buckets may be
merged when values so allow. Missing values are directed by default to the first bucket for the
corresponding attribute.1
We use the classic information gain [Quinlan, 1986] metric to select which attribute value will be the test
on a specific node. An attribute’s value that has been used is excluded from being used again in any
subtree. Every instance follows at least one path from the root of the tree to some leaf. An instance can
follow more than one branch of a node when the attribute being examined at that node has two values that
direct it to different branches. Thus, the final number of instances on leaves may be larger than the starting
number of instances.
An interesting side effect of having instances following both branches of a tree is that a child node can have
exactly the same instances with its father. Although this is not necessarily a disadvantage, a repeating
pattern of this behavior along a path can cause a serious overhead due to the size of the resulting tree
(while, at the same time, being highly unlikely to contribute to tree accuracy, as the proliferation of the
same instances suggests that all meaningful splittings have been already completed).
These considerations lead to an extremely simple pre-pruning technique. We prune the tree at a node (we
make that node a leaf) when that node shares the same instance set with some predecessors. Quantifying the
some term is an ad hoc policy, captured by the notion of the pruning level (and thus allowing flexibility).
So, for example, with a pruning level of 1 we will prune the tree at a node that shares the same instance set
with its father but not with its grandfather (a pruning level of 0 is meaningless). Note that instance sets can
only be identical for adjacent nodes in a path.
In doing so we want to limit excessive instance replication and we expect that this kind of pre-pruning
should have minor impact on the accuracy since the lost information is most likely contained in some of the
surviving (replicated) instance sets.
Quite clearly, using the proposed method may well result in testing instances that assign their values to
more than one bucket. Thus, the classification stage requires an instance to be able to follow more than one
branch of a node ending up, maybe, in more than one leaf. Classification is then straightforward by
averaging the instance classes available at all the leaves reached by an instance.
An algorithm that uses the χ2 metric, Chimerge [Kerber, 1992], was used to discretize continuous attributes.
Chimerge employs a χ2-related threshold to find the best possible points to split a continuum of values.
The value for χ2-threshold is determined by selecting a desired significance level and then using a table to
obtain the corresponding χ2 value (obtaining the χ2 value also requires specifying the number of degrees of
freedom, which will be 1 less than the number of classes). For example, when there are 3 classes (thus 2
degrees of freedom) the χ2 value at the .90 percentile level is 4.6. The meaning of χ2-threshold is that among
cases where the class and attribute are independent there is a 90% probability that the computed χ2 value
will be less than 4.6; thus, χ2 values in excess of this threshold imply that the attribute and the class are not
independent. As a result, choosing higher values for χ2-threshold causes the merging process to continue
longer, resulting in discretizations with fewer and larger intervals (buckets). The user can also override χ2-
threshold by setting a max-buckets parameter, thus specifying an upper limit on the number of intervals to
create.

1
 Admittedly this approach is counterintuitive, yet quite straightforward. Our research agenda suggests that missing values instance
should be directed to all buckets.

    ____________________________________________________________________________________
    TECHNICAL REPORT No. 99.03.01                                                                                     4
COMPUTER TECHNOLOGY INSTITUTE                                             1999
                 ________________________________________________________________________________



During this research we extended the use of the χ2 metric to create left and right margins extending from a
bucket’s boundaries. Every attribute’s value belonging to a specific bucket but also belonging to the left
(right) margin of that bucket, was also considered to belong to the previous (next) bucket respectively.
To define a margin’s length -for example a right margin- we start by constructing a test bucket, which
initially has only the last value of the current bucket and test it with the next bucket as a whole using the χ2
metric. While the result doesn’t exceed χ2-threshold, we extend the right margin to the left, thus enlarging
the test bucket. We know (from the initial bucket construction) that the right margin will increase finitely
and within the bucket’s size.
For example, suppose we have an attribute named whale-size, which represents the length of, say, 50
whales. To create buckets we sort the values of that attribute in ascending order and we use Chimerge to
split the continuum of values to buckets (see Figure 1).

                                         Bucket 1             Bucket 2              Bucket 3




                                    4 6 7 8 9 10 11 12 13 14 16 19 21 22 23 25 27 28 30 31 32

                              Figure 1: Splitting the continuum of values to buckets
We then move to find the margins. The first bucket has only a right margin and the last has only a left
margin. Consider bucket 2 on the figure below; its boundary values are 13 and 23 and its left and right
margin are at values 16 and 22 correspondingly.

                                          Bucket 1            Bucket 2               Bucket 3



                                                  R        L           R    L
                                    4 6 7 8 9 10 11 12 13 14 16 19 21 22 23 25 27 28 30 31 32


                                                      Bucket-2
                                                                         Bucket-2
                                                Left Margin
                                                                         Right Margin


                                     Figure 2: Getting margins for every bucket
To obtain the right margin we start by testing value 23 with bucket 3 as a whole using the χ2 metric.
Suppose that the first test returns a value lower than χ2-threshold, so we set 23 as the right margin of bucket
2 and combine 23 with 22 (the next-to-the-left value of bucket 2) to a test bucket. 2 This test bucket is tested
again with bucket 3 using the χ2-statistic. Suppose that, once more, the returned value does not exceed χ2-
threshold so we extend bucket's 2 right margin to 22. Also suppose that the next attempt to extend the
margin fails because χ2-threshold is exceeded. From then on, when an instance has the value 22 or 23 for its
whale-size attribute, this value is mapped to both buckets 2 and 3.




2
 If the very first test were unsuccessful we would set the left margin to be the mid-point between bucket's 3 first number (25) and
bucket's 2 last number (23), which is 24.

    ____________________________________________________________________________________
    TECHNICAL REPORT No. 99.03.01                                                                                     5
COMPUTER TECHNOLOGY INSTITUTE                                             1999
               ________________________________________________________________________________



4.Experimental Validation
Experimentation consisted by using several databases from the Machine Learning Repository [Blake et al.,
1998]. We compared the performance of the proposed approach to the results obtained by experimenting
with ITI [Utgoff, 1997] as well. ITI is a decision tree learning program that is freely available for research
purposes. All databases were chosen to have continuous valued attributes to demonstrate the benefits of our
approach. Testing took place on a Sun SparcStation/10 with 64 MB of RAM under SunOS 5.5.1.
The main goal during this research was to come up with (a modification to) an algorithm capable of more
accurate results than existing algorithms. A parallel goal was to build a time-efficient algorithm.
As the algorithm depends on a particular configuration of χ2-thresholds, maximum number of buckets and
prunning level, we conducted our testing for various combinations of those parameters. Specifically, every
database was tested using χ2 -threshold values of 0.90, 0.95, 0.975 and 0.99, with each of those tested for 5,
10, 15, and 20 maximum number of buckets and with a pruning level of 1, 2, 3, and 4, thus resulting in 64
different tests for every database. These tests would be then compared to a single run of ITI. Note that each
such test is the outcome of a 10-fold cross-validation process.
The following databases were used:

  Database        Characteristics
  Abalone         4177 instances, 8 attributes (one nominal), no missing values
  Adult           48842 instances, 14 attributes (8 nominal), missing values exist
  Crx             690 instances, 15 attributes (6 nominal), no missing values
  Ecoli           336 instances, 8 attributes (one nominal), no missing values
  Glass           214 instances, 9 attributes (all numeric), no missing values
  Pima            768 instances, 8 attributes (all numeric), no missing values
  Wine            178 instances, 13 attributes (all numeric), no missing values
  Yeast           1484 instances, 8 attributes (one nominal), no missing values

Three basic metrics were recorded: accuracy, speed and size. Accuracy is the percentage of the correctly
classified instances, speed is the time that the algorithm spends to create the tree, and size is the number of
leaves of the resulting decision tree.
We first present the accuracy results (see Appendix A). The rather unconventional numbering on the x-axis
is simply an indexing scheme; we felt that a graphical representation of the relative superiority of the
proposed approach would be more emphatic. Two lines (one straight and one dotted) are used to
demonstrate the results of ITI when pruning was turned on and off accordingly.
The experiments are sorted starting from the 0.90 χ2-threshold, with 5 buckets (maximum) and a pruning
level of 1. The first four results are for prunings level 1-4. The number of buckets is then increased by 5 and
the pruning level goes again from 1-4. The same procedure continues until the number of buckets reaches
20 where we change the χ2-threshold to 0.95 and start again. The whole procedure is repeated for a χ2-
threshold of 0.975 and 0.99.
As for speed, Appendix B demonstrates how the proposed approach compares to ITI in inducing the
decision tree (the reported results for our approach are the splitting time and the sum of splitting and
discretization time).
Regarding size, (see Appendix C) it is clear that the set-valued approach produces more eloquent trees.
Accuracy seems to be benefited. It is interesting to note that under-performance (regardless of whether this
is observed against ITI with or without pruning) is usually associated with lower χ2-threshold values, or
lower max-buckets values. Finding a specific combination of these parameters under which our approach
does best for all circumstances is a difficult task and, quite surely, theoretically intractable. It rather seems
that for every database depending on its idiosyncrasy different combinations are suited.
However, it can be argued that unless χ2-threshold is set to low values, thus injecting more “anarchy” than
would be normally accommodated, and the max-buckets value is set too low, thus enforcing unwarranted
value interval merges, we can expect accuracy to rise compared to ITI.
Accuracy is also dependent on the pruning level, but not with the same pattern throughout. If we focus on
experiments on the right-hand of the charts (where χ2 threshold and max-buckets values are reasonable), we
observe that accuracy displays a data-set specific pattern of correlation with the pruning level, yet this
display is different across domains. We consider this to be a clear indication that pre-pruning is not a

 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                       6
COMPUTER TECHNOLOGY INSTITUTE                                             1999
              ________________________________________________________________________________



solution per se but should be enhanced with a post-pruning step to enhance the predictability of the
outcomes.
The results are quite interesting, as far as speed is concerned. While experimenting with most databases, we
were confident that the discretization step would be of minor overhead. By experimenting with the wine
and adult databases, however, we were surprised to see that our approach spent most of the time
discretizing numeric attributes. Although this lead to a thorough re-evaluation of the software code and a
decision to re-haul it as soon as possible, it also stimulated us to provide the graphs as presented, with a
clear view of the speed of the splitting phase. We can claim that splitting speed is comfortably improved yet
discretization should be further looked into. Note that pre–pruning also helps keeping the tree at a
reasonable (yet larger than usual) size with a small price in time.
The main drawback of the algorithm is the size of the tree that it constructs. Having instances replicate
themselves can result in extremely large, difficult to handle trees. Two things can prevent this from
happening: normalization reduces the size of the data that the algorithm manipulates and pre-pruning
reduces the unnecessary instance proliferation along tree branches. We have seen that both approaches
seem to work well, yet none is a solution per se.




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                    7
COMPUTER TECHNOLOGY INSTITUTE                                             1999
              ________________________________________________________________________________



5.Discussion
We have proposed a modification to the classical decision tree induction approach in order to handle set-
valued attributes with relatively straightforward conceptual extensions to the basic model. Our experiments
have demonstrated that by employing a simple pre-processing stage, the proposed approach can handle
more efficiently numeric attributes (for a start) and yet yield significant accuracy improvements.
As we speculated in the introductory sections, the applicability of the approach to numeric attributes
seemed all too obvious. Although numeric attributes have values that span a continuum, it is not difficult to
imagine that conceptual groupings of similar objects would also share neighboring values. In this sense, the
discretization step simply represents the anyway present clusters of the instance space along particular axes.
We view this as an example of everyday representation bias; we tend to favor descriptions that make it
easier for us to group objects rather than allow fuzziness. This bias can be captured in a theoretical sense by
the observation that entropy gain is maximized at class boundaries for numeric attributes [Fayyad and Irani,
1992].
When numeric attributes truly have values that make it difficult to identify class- related ranges, splitting
will be "terminally" affected by the observed values and testing is affected too. Our approach serves to offer
a second chance to near-misses and as the experiments show, near-misses do survive. The χ 2 approach to
determine secondary bucket preference ensures that neighboring instances are allowed to be of value longer
than the classical approach would allow. For the vast majority of cases this interaction has proved
beneficial by a comfortable margin.
As far as the experiments are concerned, the χ 2-threshold value for merging buckets has proved important.
The greater the confidence level required, the better the accuracy attained, in most cases. However, there
have been cases, where increasing χ2-threshold decreased the accuracy (compared to lower χ2-threshold
values). We attribute such a behavior to the fact that the corresponding data sets do demonstrate a non-
easily-separable class property. In such cases, increasing χ 2 threshold, thus making buckets more rigid,
imposes an artificial clustering which deteriorates accuracy. That our approach still over-performs ITI is a
sign that the few instances which are directed to more than one bucket compensate for the non-flexibility of
the “few” buckets available.
Scheduled improvements to the proposed approach for handling set-valued attributes span a broad
spectrum. Priority is given to the classification scheme. We aim to examine the extent to which we can use
a confidence status for each instance traveling through the tree (for training or testing) to obtain weighted
averages for class assignment at leaves, instead of simply summing up counters.
It is interesting to note that using set-valued attributes naturally paves the way for deploying more effective
post-processing schemes for various tasks (character recognition is an obvious example). By following a
few branches, one need not make a guess about the second best option based on only one node, as is usually
the case (where, actually, the idea is to eliminate all but the winner!).
Another important line of research concerns the development of a suitable post-pruning strategy. Instance
replication complicates the maths involved in estimating error, and standard approaches need to be better
studied (for example, it is not at all obvious how error-complexity pruning [Breiman et al., 1984] might be
readily adapted to set-valued attributes). Last, but not least, we need to explore the extent to which instance
replication does (or, hopefully, does not) seriously affect the efficiency of incremental decision tree
algorithms.
We feel that the proposed approach opens up a range of possibilities for decision tree induction. By beefing
it up with the appropriate mathematical framework we may be able to say that a little bit of carefully
introduced fuzziness improves the learning process.




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                      8
COMPUTER TECHNOLOGY INSTITUTE                                             1999
              ________________________________________________________________________________



References
[Ali and Pazzani, 1996] K.M. Ali, M.J. Pazzani. Error Reduction through Learning Multiple Descriptions
Machine Learning, 24:173-202(1996).
[Breiman et al., 1984] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and
Regression Trees. Wadsworth, Belmont, CA (1984).
[Breiman, 1996] L. Breiman. Bagging Predictors. Machine Learning, 24:123-140 (1996).
[Cohen, 1996] W.W. Cohen. Learning Trees and Rules with Set-valued Features. Proceedings of AAAI-96.
[Fayyad and Irani, 1992] U.M. Fayyad, K.B. Irani. On the Handling of Continuous-Valued Attributes on
Decision Tree Generation. Machine Learning, 8: 87-102 (1992).
[Kalles, 1994] D. Kalles. Decision trees and domain knowledge in pattern recognition. Phd Thesis,
University of Manchester (1994).
[Kerber, 1992] R. Kerber. Chimerge:Discretization of numeric attributes. 10th National Conference on
Artificial Intelligence, San Jose, CA. MIT Press. 123-128 (1992).
[Kwok and Carter, 1990] S.W. Kwok, C. Carter. Multiple Decision Trees. Proceedings of the Uncertainty
in Artificial Intelligence 4. R.D. Shachter, T.S. Levitt, L.N. Kanal, J.F. Lemmer (Editors) (1990).
[Martin and Billman, 1994] J.D. Martin, D.O. Billman. Acquiring and Combining Overlapping Concepts.
Machine Learning, 16:121-160 (1994).
[Quinlan, 1986] J.R. Quinlan. Induction of Decision Trees, Machine Learning, 1:81-106 (1986).
[Quinlan, 1987] J.R. Quinlan. Decision trees as probabilistic classifiers. In Proceedings of the 4th
International Workshop on Machine Learning, pages 31-37, Irvine, CA, June 1987.
[Quinlan, 1993] J.R. Quinlan. C4.5 Programs for machine learning. San Mateo, CA, Morgan Kaufmann
(1993).
[Utgoff, 1997] P.E. Utgoff. Decision Tree Induction Based on Efficient Tree Restructuring, Machine
Learning, 29:5-44 (1997).
[Blake et al., 1998] C. Blake, E. Keogh, and J Merz. UCI Repository of machine learning databases [http://
www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of
Information and Computer Science (1998).




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                  9
COMPUTER TECHNOLOGY INSTITUTE                                             1999
                                  ________________________________________________________________________________



A Appendix on Accuracy Results
The following charts demonstrate the accuracy results for the tested databases.

                                       Ecoli Database                                                                     Wine Database
            80                                                                                96

            75
                                                                                              94
            70
                                                                                              92
 Accuracy




            65




                                                                                   Accuracy
                                                                                              90
            60
                                                                                              88
            55

            50                                                                                86

            45                                                                                84
                 0   4   8   12   16 20 24 28 32 36 40 44      48 52 56 60 64                      0   4   8   12 16 20 24 28            32 36 40 44 48 52             56 60 64

                                     Number of Experiment                                                                Number of Experiment

                                      Yeast Database                                                                          Crx Database
            65                                                                                86

            60                                                                                84

            55
                                                                                              82
 Accuracy




                                                                                  Accuracy




            50
                                                                                              80
            45
                                                                                              78
            40

            35
                                                                                              76

            30                                                                                74
                 0   4   8   12 16 20 24 28      32 36 40 44 48 52    56 60 64                     0   4   8   12 16      20 24 28       32 36 40 44         48 52 56       60 64

                                     Number of Experiment                                                                Number of Experiment

                                      Pima Database                                                                       Glass Database
            76                                                                                50

                                                                                              45
            74
                                                                                              40

            72                                                                                35
                                                                                              30
 Accuracy




                                                                                  Accuracy




            70                                                                                25

                                                                                              20
            68
                                                                                              15

            66                                                                                10
                                                                                               5
            64                                                                                 0
                 0   4   8   12 16    20 24 28   32 36 40 44   48 52 56   60 64                    0   4   8   12   16   20    24   28   32   36   40   44   48   52   56   60   64
                                     Number of Experiment                                                                Number of Experiment

                                     Abalone Database                                                                      Adult Database
                                                                                              83
            55

            54                                                                                82

            53                                                                                82
            52
                                                                                              81
 Accuracy




                                                                                  Accuracy




            51
                                                                                              81
            50
                                                                                              80
            49

            48                                                                                80

            47                                                                                79
                 0   4   8   12 16 20    24 28 32 36    40 44 48 52   56 60 64                     0   4   8   12 16 20 24 28 32 36 40 44 48 52 56 60 64
                                     Number of Experiment                                                                Number of Experiment




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                                                                                                        10
COMPUTER TECHNOLOGY INSTITUTE                                             1999
                                                              ________________________________________________________________________________



B Appendix on Speed Results
The following charts demonstrate the speed results for the tested databases. Triangles represent splitting
time. ITI results are presented with one straight line.
                                                                  Ecoli Database                                                                                         Wine Database
                                                                                                                            0,40
                          0,60
                                                                                                                            0,35
                          0,50
                                                                                                                            0,30
 Time (Seconds)




                                                                                                       Time (Seconds)
                          0,40                                                                                              0,25
                          0,30                                                                                              0,20
                          0,20                                                                                              0,15

                          0,10                                                                                              0,10
                                                                                                                            0,05
                          0,00
                                 1           5       9       13 17 21 25 29 33 37 41 45 49 53 57 61                         0,00
                                                                                                                                       1       5       9 13 17 21 25 29 33 37 41 45 49 53 57 61
                                                                 Number of Experiment
                                                                                                                                                                         Number of Experiment

                                                                  Yeast Database                                                                                          Crx Database
                                                                                                                             1,20
                          5,00
                          4,50                                                                                               1,00
                          4,00
                                                                                                           Time (Seconds)



                          3,50                                                                                               0,80
  Time (Seconds)




                          3,00
                                                                                                                             0,60
                          2,50
                          2,00
                                                                                                                             0,40
                          1,50
                          1,00                                                                                               0,20
                          0,50
                          0,00                                                                                               0,00
                                     1       5       9       13 17 21 25 29 33 37 41 45 49 53 57 61                                        1       5       9    13 17 21 25 29 33 37 41 45 49 53 57 61

                                                                 Number of Experiment                                                                                    Number of Experiment


                                                                  Pima Database                                                                                          Glass Database
                                                                                                                             0,80
                           1,40
                                                                                                                             0,70
                           1,20
         Time (Seconds)




                                                                                                                             0,60
                           1,00
                                                                                                          Time (Seconds)




                                                                                                                             0,50
                           0,80
                                                                                                                             0,40
                           0,60
                                                                                                                             0,30
                           0,40
                                                                                                                             0,20
                           0,20
                                                                                                                             0,10
                           0,00
                                                                                                                             0,00
                                         1       5       9    13 17 21 25 29 33 37 41 45 49 53 57 61
                                                                                                                                       1           5       9    13 17 21 25 29 33 37 41 45 49 53 57 61
                                                                  Number of Experiment
                                                                                                                                                                         Number of Experiment

                                                                 Abalone Database                                                                                         Adult Database

                                                                                                                            250
                          14

                          12                                                                                                200
                          10
  Time (Seconds)




                                                                                                                            150
                                                                                                        Time (Seconds)




                           8

                           6                                                                                                100

                           4
                                                                                                                             50
                           2

                           0                                                                                                 0
                                 1       5       9       13 17 21 25    29 33 37 41 45 49 53 57   61                               1           5       9       13   17   21   25   29   33   37   41   45   49   53   57   61

                                                                 Number of Experiment                                                                                    Number of Experiment




   ____________________________________________________________________________________
   TECHNICAL REPORT No. 99.03.01                                                                                                                                                                                                11
COMPUTER TECHNOLOGY INSTITUTE                                             1999
                                          ________________________________________________________________________________



C Appendix on Size Results
The following charts demonstrate the size results (number of leaves) for the tested databases.
                                              Ecoli Database                                                                               Wine Database
          180                                                                                80
          160                                                                                70
          140
                                                                                             60
          120
                                                                                             50
          100
 Leaves




                                                                                    Leaves
                                                                                             40
           80
                                                                                             30
           60
           40                                                                                20

           20                                                                                10

            0                                                                                 0
                    0   4       8       12 16 20 24 28 32 36 40 44 48 52 56 60 64                  0       4       8           12 16 20 24 28              32 36 40 44 48 52             56 60 64
                                             Number of Experiment                                                                         Number of Experiment

                                              Yeast Database                                                                                    Crx Database
          1400                                                                               160

          1200                                                                               140

                                                                                             120
          1000
                                                                                             100
           800
 Leaves




                                                                                    Leaves


                                                                                              80
           600
                                                                                              60
           400
                                                                                              40
           200                                                                                20

                0                                                                                 0
                    0   4           8   12 16 20 24 28 32 36 40 44 48 52 56 60 64                      0       4           8       12 16 20 24 28 32 36 40 44 48 52 56 60 64

                                             Number of Experiment                                                                         Number of Experiment

                                              Pima Database                                                                                Glass Database
          700                                                                                180

                                                                                             160
          600
                                                                                             140
          500
                                                                                             120
          400
                                                                                             100
 Leaves




                                                                                    Leaves




          300                                                                                80


          200                                                                                60

                                                                                             40
          100
                                                                                             20
            0
                                                                                              0
                0       4       8       12 16 20 24 28 32 36 40 44 48 52 56 60 64                  0       4           8       12    16    20    24   28   32   36   40   44   48   52   56   60   64

                                            Number of Experiment                                                                          Number of Experiment

                                            Abalone Database                                                                                Adult Database
          2500
                                                                                             5000
                                                                                             4500
          2000                                                                               4000
                                                                                             3500
          1500                                                                               3000
                                                                                    Leaves
 Leaves




                                                                                             2500
          1000                                                                               2000
                                                                                             1500
                                                                                             1000
           500
                                                                                               500
                                                                                                       0
                0
                                                                                                           0       4           8    12 16 20 24 28 32 36 40 44 48 52 56 60 64
                    0       4   8       12 16 20 24 28 32 36 40 44 48 52 56 60 64
                                                                                                                                           Number of Experiment
                                             Number of Experiment




 ____________________________________________________________________________________
 TECHNICAL REPORT No. 99.03.01                                                                                                                                                                          12

More Related Content

What's hot

CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesMarc Garcia
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYijaia
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
Critical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network ProjectCritical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network Projectiosrjce
 
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONSFUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONSijcsity
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)Learnbay Datascience
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestMohamed Medhat Gaber
 
Cluster analysis using spss
Cluster analysis using spssCluster analysis using spss
Cluster analysis using spssDr Nisha Arora
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 

What's hot (18)

CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression Trees
 
正準相関分析
正準相関分析正準相関分析
正準相関分析
 
Cluster Validation
Cluster ValidationCluster Validation
Cluster Validation
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
Critical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network ProjectCritical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network Project
 
Decision tree
Decision tree Decision tree
Decision tree
 
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONSFUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
 
Malhotra20
Malhotra20Malhotra20
Malhotra20
 
Decision tree
Decision treeDecision tree
Decision tree
 
Cluster analysis using spss
Cluster analysis using spssCluster analysis using spss
Cluster analysis using spss
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
Decision tree
Decision treeDecision tree
Decision tree
 

Viewers also liked

Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...
Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...
Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...butest
 
Technical Paper.doc.doc
Technical Paper.doc.docTechnical Paper.doc.doc
Technical Paper.doc.docbutest
 
source2
source2source2
source2butest
 
Morten Bengtson Opgave
Morten Bengtson OpgaveMorten Bengtson Opgave
Morten Bengtson Opgaveguest609c2b
 
Sethia Grandeur Brochure - Zricks.com
Sethia Grandeur Brochure - Zricks.comSethia Grandeur Brochure - Zricks.com
Sethia Grandeur Brochure - Zricks.comZricks.com
 
Dragomir R
Dragomir RDragomir R
Dragomir Rbutest
 
2016.09.17 freifunk neu wulmstorf
2016.09.17 freifunk neu wulmstorf2016.09.17 freifunk neu wulmstorf
2016.09.17 freifunk neu wulmstorfGunnar Klauberg
 
States of Mind: can they be communicated and compared?
States of Mind: can they be communicated and compared?States of Mind: can they be communicated and compared?
States of Mind: can they be communicated and compared?Yoav Francis
 
PENERBITAN 2006.doc
PENERBITAN 2006.docPENERBITAN 2006.doc
PENERBITAN 2006.docbutest
 
Styleguide Driven Development
Styleguide Driven DevelopmentStyleguide Driven Development
Styleguide Driven DevelopmentWINTR
 
ΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOME
ΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOMEΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOME
ΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOMEYOUR NEXT HOME
 
NEB Step-1 Formative Assessment
NEB Step-1 Formative AssessmentNEB Step-1 Formative Assessment
NEB Step-1 Formative AssessmentDrSaeed Shafi
 
Zivotopis
ZivotopisZivotopis
Zivotopisbrankec
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...
Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...
Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...IJERD Editor
 
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...IJERD Editor
 
Victor (Shengli) Sheng
Victor (Shengli) ShengVictor (Shengli) Sheng
Victor (Shengli) Shengbutest
 

Viewers also liked (20)

Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...
Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...
Julie Acker, M.S.W., CMHA Lambton Julie Acker holds a Masters ...
 
.doc
.doc.doc
.doc
 
Technical Paper.doc.doc
Technical Paper.doc.docTechnical Paper.doc.doc
Technical Paper.doc.doc
 
source2
source2source2
source2
 
Morten Bengtson Opgave
Morten Bengtson OpgaveMorten Bengtson Opgave
Morten Bengtson Opgave
 
Sethia Grandeur Brochure - Zricks.com
Sethia Grandeur Brochure - Zricks.comSethia Grandeur Brochure - Zricks.com
Sethia Grandeur Brochure - Zricks.com
 
Dragomir R
Dragomir RDragomir R
Dragomir R
 
2016.09.17 freifunk neu wulmstorf
2016.09.17 freifunk neu wulmstorf2016.09.17 freifunk neu wulmstorf
2016.09.17 freifunk neu wulmstorf
 
HDsamplesLMC
HDsamplesLMCHDsamplesLMC
HDsamplesLMC
 
States of Mind: can they be communicated and compared?
States of Mind: can they be communicated and compared?States of Mind: can they be communicated and compared?
States of Mind: can they be communicated and compared?
 
PENERBITAN 2006.doc
PENERBITAN 2006.docPENERBITAN 2006.doc
PENERBITAN 2006.doc
 
Styleguide Driven Development
Styleguide Driven DevelopmentStyleguide Driven Development
Styleguide Driven Development
 
ΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOME
ΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOMEΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOME
ΠΑΡΟΥΣΙΑΣΗ YOUR NEXT HOME
 
NEB Step-1 Formative Assessment
NEB Step-1 Formative AssessmentNEB Step-1 Formative Assessment
NEB Step-1 Formative Assessment
 
Zivotopis
ZivotopisZivotopis
Zivotopis
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...
Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...
Model of Energy Generation in Plant by the Cells of The Leafs During the Nigh...
 
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
 
Victor (Shengli) Sheng
Victor (Shengli) ShengVictor (Shengli) Sheng
Victor (Shengli) Sheng
 
A
AA
A
 

Similar to Figure 1

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5ssuser33da69
 
Variable and feature selection
Variable and feature selectionVariable and feature selection
Variable and feature selectionAaron Karper
 
Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...IRJET Journal
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Classification and decision tree classifier machine learning
Classification and decision tree classifier machine learningClassification and decision tree classifier machine learning
Classification and decision tree classifier machine learningFrancisco E. Figueroa-Nigaglioni
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problemcsandit
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterIOSR Journals
 
Case Study: Prediction on Iris Dataset Using KNN Algorithm
Case Study: Prediction on Iris Dataset Using KNN AlgorithmCase Study: Prediction on Iris Dataset Using KNN Algorithm
Case Study: Prediction on Iris Dataset Using KNN AlgorithmIRJET Journal
 
A Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification TechniquesA Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification Techniquesijsrd.com
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Miningbutest
 
ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 

Similar to Figure 1 (20)

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
 
Variable and feature selection
Variable and feature selectionVariable and feature selection
Variable and feature selection
 
FSRM 582 Project
FSRM 582 ProjectFSRM 582 Project
FSRM 582 Project
 
Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
Final Report
Final ReportFinal Report
Final Report
 
Classification and decision tree classifier machine learning
Classification and decision tree classifier machine learningClassification and decision tree classifier machine learning
Classification and decision tree classifier machine learning
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problem
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted Cluster
 
Case Study: Prediction on Iris Dataset Using KNN Algorithm
Case Study: Prediction on Iris Dataset Using KNN AlgorithmCase Study: Prediction on Iris Dataset Using KNN Algorithm
Case Study: Prediction on Iris Dataset Using KNN Algorithm
 
A Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification TechniquesA Survey of Modern Data Classification Techniques
A Survey of Modern Data Classification Techniques
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
Sbi simulation
Sbi simulationSbi simulation
Sbi simulation
 
ML crash course
ML crash courseML crash course
ML crash course
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Figure 1

  • 1. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 “Efficient decision tree induction with set-valued attributes” Dimitrios Kalles, Athanasios Papagelis March 16, 1999 Abstract Conventional algorithms for the induction of decision trees use an attribute-value representation scheme for instances. This paper explores the empirical consequences of using set-valued attributes. This simple representational extension is shown to yield significant gains in speed and accuracy. To do so, the paper also describes an intuitive and practical version of pre-pruning. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 1
  • 2. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ 1.Introduction Describing instances using attribute-value pairs is a widely used practice in the machine learning and pattern recognition literature. Though we acknowledge the utility of this practice, in this paper we aim to explore the extent to which using such a representation for decision tree learning is as good as one would hope for. Specifically, we will make a departure from the attribute-single-value convention and deal with set-valued attributes. In doing so, we will have to face all guises of decision tree learning problems. Splitting criteria have to be re-defined. The set-valued attribute approach will put again pre-pruning into the picture, as a viable and reasonable pruning strategy. The classification task will also change. Of course, all these depend on us providing a plausible explanation of why we need set-valued attributes, and how we can obtain them. The first and foremost characteristics of set-valued attributes is that during splitting, an instance may be instructed to follow both paths of a (binary) decision tree, thus ending up in more than one leaf. Such instance replication across tree branches beefs up (hopefully) the quality of the splitting process as splitting decisions are based on larger instance samples. The idea is, then, that classification accuracy should increase and this is the strategic goal. This increase will be effected not only by more educated splits during learning but also because instances follow more than one path during testing too. What we then get is a combined decision on class membership. The reader may be questioning the value of this recipe for instance proliferation, across and along tree branches. Truly, this might be a problem. At some nodes, it may be that attributes are not informative enough for splitting and that the overlap in instances between children nodes may be excessive. Defining what may be considered excessive has been a line of research in this work and it has turned out that very simple pre-pruning criteria suffice to contain this excessiveness. The decision tree building algorithm can be efficiently modified to contain a couple of extra stopping conditions which limit the extend to which instance replication may be allowed. As probabilistic classification has naturally emerged as an option in this research, it is only natural that we would like to give credit, for some of the ideas that come up in this paper, to the work carried out earlier by Quinlan [1987]. At this early point, we should emphasize that the proposed approach does not relate to splitting using feature sets, as employed by C4.5 [Quinlan, 1993] An earlier attempt to capitalize on these ideas appeared in [Kalles, 1994]. Probably due to the immaturity of those results and to (some extent to) the awkwardness of the terminology (the term pattern replication was used to denote the concept of an instance being sent down to more than one branch due to set-valued attributes), the potential has not been fully explored. Of course, this research flavor has been apparent in [Martin and Billman, 1994], who discuss overlapping concepts, in [Ali and Pazzani, 1996] and in [Breiman, 1996]; the latter two are oriented towards the exploitation of multiple models (see also [Kwok and Carter, 1990]). Our work focuses on exploiting multiple classifications paths for a given testing instance, where these descriptions are all generated by the same decision tree. In that sense, we feel rather closer to Quinlan’s approach, at least in the start; we then depart to study what we consider more interesting views of the problem. Our short review would be incomplete without mentioning RIPPER [Cohen, 1996], a system that exploits set-valued features to solve categorization problems in the linguistic domain. More than a few algorithms have been developed on RIPPER’s track, yet we shall limit our references here for the sake of containing the problem within the decision tree domain. The rest of this paper is organized in four sections. In the next section we elaborate on the intuition behind the set-valued approach and how it leads to the heuristics employed. The actual modifications to the standard decision tree algorithms come next, including descriptions of splitting, pruning and classifying. We then demonstrate via an extended experimental session that the proposed heuristics indeed work and identify potential pitfalls. Finally, we put all the details together and discuss lines of research that have been deemed worthy of following. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 2
  • 3. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ 2.An Overview of the Problem Symbolic attributes are usually described by a single value (“Ford is the make of this car”). That a symbolic attribute may obtain an unequivocally specified value is a result of the domain itself (car makes are a finite number). On the other hand numeric values are drawn from a continuum of values, where errors or variation may turn up in different guises of the same underlying phenomenon: noise. Note that this can hold of seemingly symbolic values too; colors correspond to a de facto discretization of frequencies. It would be plausible that for some discretization, for a learning problem, we would like to be able to make a distinction between values such as green, not-so-green, etc. This explains why we feel that discretizing numeric values is a practice that easily fits the set-valued attributes context. A casual browsing of the data sets in the Machine Learning repository [Blake et al., 1998] shows that there are no data sets where the same concept could apply in a purely symbolic domain. It should come as no surprise that such experiments are indeed limited and the ones that seem to be practically justifiable have appeared in a linguistic context (in robust or predictive parsing, for example, where a word can be categorized as a verb or a noun). As linguistic information is inherently symbolic we feel confident that the set-valued attribute approach will eventually demonstrate its worth, starting from that domain. Having mentioned the applicability of the concept in real domains we now argue why it should work, starting with numeric domains. First, note that one of the typical characteristics of decision trees is that their divide-and-conquer approach quickly trims down the availability of large instance chunks, as one progresses towards the tree fringe. The splitting process calculates some level of (some type of) information gain and splits the tree accordingly. In doing so, it forces instances that are near a splitting threshold to follow one path. The other one, even if missed by some small Δε is a loser. This amounts to a reduced sample in the losing branch. By treating numeric attributes as set-valued ones we artificially enlarge the learning population at critical nodes. This can be done by employing a straightforward discretization step before any splitting starts. It also means that threshold values do not have to be re-calculated using the actual values but the substituted discrete ones (actually, set of values, where applicable). During testing, the same rule applies and test instances may end up in more than one leaf. Based on the assumption that such instance replication will only occur where there is doubt about the existence of a single value for an attribute, it follows that one can determine class membership of an instance by considering all leaves it has reached. This delivers a more educated classification as it is based on a larger sample and, conceptually, resembles an ensemble of experts. For this work, we have used a simple majority scheme; it may well be, however, that even better results could result using more informed and sophisticated averaging. The discretization procedure used to convert numeric values to sets of integer values is an investment in longer term learning efficiency as well. Paying a small price for determining how instances’ values should be discretized is compensated by having to manipulate integer values only, at the learning stage. In symbolic domains, where an attribute may be assigned a set of values, the set-valued representation is arguably the only one that does not incur a representation loss. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 3
  • 4. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ 3.Learning with Set Valued Attributes This section will describe how we handle set-valued attributes. While describing how one can generate such attributes, we shall limit our presentation to the numeric case, as available data sets are not populated by sets of symbolic values. To obtain set-valued attributes we have to discretize the raw data of a dataset. The discretization step produces instances that have (integer) set-valued attributes. Our algorithm uses these normalized instances to build the tree. Every attribute’s values are mapped to integers. Instances sharing an attribute value are said to belong to the same bucket, which is characterized by that integer value. Each attribute has as many buckets as distinct values. For continuous (numeric) attributes we split the continuum of values into a small number of non- overlapping intervals with each interval assigned to a bucket (one-to-one correspondence). An instance’s value (a point) for a specific attribute may then belong to more that one of these buckets. Buckets may be merged when values so allow. Missing values are directed by default to the first bucket for the corresponding attribute.1 We use the classic information gain [Quinlan, 1986] metric to select which attribute value will be the test on a specific node. An attribute’s value that has been used is excluded from being used again in any subtree. Every instance follows at least one path from the root of the tree to some leaf. An instance can follow more than one branch of a node when the attribute being examined at that node has two values that direct it to different branches. Thus, the final number of instances on leaves may be larger than the starting number of instances. An interesting side effect of having instances following both branches of a tree is that a child node can have exactly the same instances with its father. Although this is not necessarily a disadvantage, a repeating pattern of this behavior along a path can cause a serious overhead due to the size of the resulting tree (while, at the same time, being highly unlikely to contribute to tree accuracy, as the proliferation of the same instances suggests that all meaningful splittings have been already completed). These considerations lead to an extremely simple pre-pruning technique. We prune the tree at a node (we make that node a leaf) when that node shares the same instance set with some predecessors. Quantifying the some term is an ad hoc policy, captured by the notion of the pruning level (and thus allowing flexibility). So, for example, with a pruning level of 1 we will prune the tree at a node that shares the same instance set with its father but not with its grandfather (a pruning level of 0 is meaningless). Note that instance sets can only be identical for adjacent nodes in a path. In doing so we want to limit excessive instance replication and we expect that this kind of pre-pruning should have minor impact on the accuracy since the lost information is most likely contained in some of the surviving (replicated) instance sets. Quite clearly, using the proposed method may well result in testing instances that assign their values to more than one bucket. Thus, the classification stage requires an instance to be able to follow more than one branch of a node ending up, maybe, in more than one leaf. Classification is then straightforward by averaging the instance classes available at all the leaves reached by an instance. An algorithm that uses the χ2 metric, Chimerge [Kerber, 1992], was used to discretize continuous attributes. Chimerge employs a χ2-related threshold to find the best possible points to split a continuum of values. The value for χ2-threshold is determined by selecting a desired significance level and then using a table to obtain the corresponding χ2 value (obtaining the χ2 value also requires specifying the number of degrees of freedom, which will be 1 less than the number of classes). For example, when there are 3 classes (thus 2 degrees of freedom) the χ2 value at the .90 percentile level is 4.6. The meaning of χ2-threshold is that among cases where the class and attribute are independent there is a 90% probability that the computed χ2 value will be less than 4.6; thus, χ2 values in excess of this threshold imply that the attribute and the class are not independent. As a result, choosing higher values for χ2-threshold causes the merging process to continue longer, resulting in discretizations with fewer and larger intervals (buckets). The user can also override χ2- threshold by setting a max-buckets parameter, thus specifying an upper limit on the number of intervals to create. 1 Admittedly this approach is counterintuitive, yet quite straightforward. Our research agenda suggests that missing values instance should be directed to all buckets. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 4
  • 5. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ During this research we extended the use of the χ2 metric to create left and right margins extending from a bucket’s boundaries. Every attribute’s value belonging to a specific bucket but also belonging to the left (right) margin of that bucket, was also considered to belong to the previous (next) bucket respectively. To define a margin’s length -for example a right margin- we start by constructing a test bucket, which initially has only the last value of the current bucket and test it with the next bucket as a whole using the χ2 metric. While the result doesn’t exceed χ2-threshold, we extend the right margin to the left, thus enlarging the test bucket. We know (from the initial bucket construction) that the right margin will increase finitely and within the bucket’s size. For example, suppose we have an attribute named whale-size, which represents the length of, say, 50 whales. To create buckets we sort the values of that attribute in ascending order and we use Chimerge to split the continuum of values to buckets (see Figure 1). Bucket 1 Bucket 2 Bucket 3 4 6 7 8 9 10 11 12 13 14 16 19 21 22 23 25 27 28 30 31 32 Figure 1: Splitting the continuum of values to buckets We then move to find the margins. The first bucket has only a right margin and the last has only a left margin. Consider bucket 2 on the figure below; its boundary values are 13 and 23 and its left and right margin are at values 16 and 22 correspondingly. Bucket 1 Bucket 2 Bucket 3 R L R L 4 6 7 8 9 10 11 12 13 14 16 19 21 22 23 25 27 28 30 31 32 Bucket-2 Bucket-2 Left Margin Right Margin Figure 2: Getting margins for every bucket To obtain the right margin we start by testing value 23 with bucket 3 as a whole using the χ2 metric. Suppose that the first test returns a value lower than χ2-threshold, so we set 23 as the right margin of bucket 2 and combine 23 with 22 (the next-to-the-left value of bucket 2) to a test bucket. 2 This test bucket is tested again with bucket 3 using the χ2-statistic. Suppose that, once more, the returned value does not exceed χ2- threshold so we extend bucket's 2 right margin to 22. Also suppose that the next attempt to extend the margin fails because χ2-threshold is exceeded. From then on, when an instance has the value 22 or 23 for its whale-size attribute, this value is mapped to both buckets 2 and 3. 2 If the very first test were unsuccessful we would set the left margin to be the mid-point between bucket's 3 first number (25) and bucket's 2 last number (23), which is 24. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 5
  • 6. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ 4.Experimental Validation Experimentation consisted by using several databases from the Machine Learning Repository [Blake et al., 1998]. We compared the performance of the proposed approach to the results obtained by experimenting with ITI [Utgoff, 1997] as well. ITI is a decision tree learning program that is freely available for research purposes. All databases were chosen to have continuous valued attributes to demonstrate the benefits of our approach. Testing took place on a Sun SparcStation/10 with 64 MB of RAM under SunOS 5.5.1. The main goal during this research was to come up with (a modification to) an algorithm capable of more accurate results than existing algorithms. A parallel goal was to build a time-efficient algorithm. As the algorithm depends on a particular configuration of χ2-thresholds, maximum number of buckets and prunning level, we conducted our testing for various combinations of those parameters. Specifically, every database was tested using χ2 -threshold values of 0.90, 0.95, 0.975 and 0.99, with each of those tested for 5, 10, 15, and 20 maximum number of buckets and with a pruning level of 1, 2, 3, and 4, thus resulting in 64 different tests for every database. These tests would be then compared to a single run of ITI. Note that each such test is the outcome of a 10-fold cross-validation process. The following databases were used: Database Characteristics Abalone 4177 instances, 8 attributes (one nominal), no missing values Adult 48842 instances, 14 attributes (8 nominal), missing values exist Crx 690 instances, 15 attributes (6 nominal), no missing values Ecoli 336 instances, 8 attributes (one nominal), no missing values Glass 214 instances, 9 attributes (all numeric), no missing values Pima 768 instances, 8 attributes (all numeric), no missing values Wine 178 instances, 13 attributes (all numeric), no missing values Yeast 1484 instances, 8 attributes (one nominal), no missing values Three basic metrics were recorded: accuracy, speed and size. Accuracy is the percentage of the correctly classified instances, speed is the time that the algorithm spends to create the tree, and size is the number of leaves of the resulting decision tree. We first present the accuracy results (see Appendix A). The rather unconventional numbering on the x-axis is simply an indexing scheme; we felt that a graphical representation of the relative superiority of the proposed approach would be more emphatic. Two lines (one straight and one dotted) are used to demonstrate the results of ITI when pruning was turned on and off accordingly. The experiments are sorted starting from the 0.90 χ2-threshold, with 5 buckets (maximum) and a pruning level of 1. The first four results are for prunings level 1-4. The number of buckets is then increased by 5 and the pruning level goes again from 1-4. The same procedure continues until the number of buckets reaches 20 where we change the χ2-threshold to 0.95 and start again. The whole procedure is repeated for a χ2- threshold of 0.975 and 0.99. As for speed, Appendix B demonstrates how the proposed approach compares to ITI in inducing the decision tree (the reported results for our approach are the splitting time and the sum of splitting and discretization time). Regarding size, (see Appendix C) it is clear that the set-valued approach produces more eloquent trees. Accuracy seems to be benefited. It is interesting to note that under-performance (regardless of whether this is observed against ITI with or without pruning) is usually associated with lower χ2-threshold values, or lower max-buckets values. Finding a specific combination of these parameters under which our approach does best for all circumstances is a difficult task and, quite surely, theoretically intractable. It rather seems that for every database depending on its idiosyncrasy different combinations are suited. However, it can be argued that unless χ2-threshold is set to low values, thus injecting more “anarchy” than would be normally accommodated, and the max-buckets value is set too low, thus enforcing unwarranted value interval merges, we can expect accuracy to rise compared to ITI. Accuracy is also dependent on the pruning level, but not with the same pattern throughout. If we focus on experiments on the right-hand of the charts (where χ2 threshold and max-buckets values are reasonable), we observe that accuracy displays a data-set specific pattern of correlation with the pruning level, yet this display is different across domains. We consider this to be a clear indication that pre-pruning is not a ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 6
  • 7. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ solution per se but should be enhanced with a post-pruning step to enhance the predictability of the outcomes. The results are quite interesting, as far as speed is concerned. While experimenting with most databases, we were confident that the discretization step would be of minor overhead. By experimenting with the wine and adult databases, however, we were surprised to see that our approach spent most of the time discretizing numeric attributes. Although this lead to a thorough re-evaluation of the software code and a decision to re-haul it as soon as possible, it also stimulated us to provide the graphs as presented, with a clear view of the speed of the splitting phase. We can claim that splitting speed is comfortably improved yet discretization should be further looked into. Note that pre–pruning also helps keeping the tree at a reasonable (yet larger than usual) size with a small price in time. The main drawback of the algorithm is the size of the tree that it constructs. Having instances replicate themselves can result in extremely large, difficult to handle trees. Two things can prevent this from happening: normalization reduces the size of the data that the algorithm manipulates and pre-pruning reduces the unnecessary instance proliferation along tree branches. We have seen that both approaches seem to work well, yet none is a solution per se. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 7
  • 8. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ 5.Discussion We have proposed a modification to the classical decision tree induction approach in order to handle set- valued attributes with relatively straightforward conceptual extensions to the basic model. Our experiments have demonstrated that by employing a simple pre-processing stage, the proposed approach can handle more efficiently numeric attributes (for a start) and yet yield significant accuracy improvements. As we speculated in the introductory sections, the applicability of the approach to numeric attributes seemed all too obvious. Although numeric attributes have values that span a continuum, it is not difficult to imagine that conceptual groupings of similar objects would also share neighboring values. In this sense, the discretization step simply represents the anyway present clusters of the instance space along particular axes. We view this as an example of everyday representation bias; we tend to favor descriptions that make it easier for us to group objects rather than allow fuzziness. This bias can be captured in a theoretical sense by the observation that entropy gain is maximized at class boundaries for numeric attributes [Fayyad and Irani, 1992]. When numeric attributes truly have values that make it difficult to identify class- related ranges, splitting will be "terminally" affected by the observed values and testing is affected too. Our approach serves to offer a second chance to near-misses and as the experiments show, near-misses do survive. The χ 2 approach to determine secondary bucket preference ensures that neighboring instances are allowed to be of value longer than the classical approach would allow. For the vast majority of cases this interaction has proved beneficial by a comfortable margin. As far as the experiments are concerned, the χ 2-threshold value for merging buckets has proved important. The greater the confidence level required, the better the accuracy attained, in most cases. However, there have been cases, where increasing χ2-threshold decreased the accuracy (compared to lower χ2-threshold values). We attribute such a behavior to the fact that the corresponding data sets do demonstrate a non- easily-separable class property. In such cases, increasing χ 2 threshold, thus making buckets more rigid, imposes an artificial clustering which deteriorates accuracy. That our approach still over-performs ITI is a sign that the few instances which are directed to more than one bucket compensate for the non-flexibility of the “few” buckets available. Scheduled improvements to the proposed approach for handling set-valued attributes span a broad spectrum. Priority is given to the classification scheme. We aim to examine the extent to which we can use a confidence status for each instance traveling through the tree (for training or testing) to obtain weighted averages for class assignment at leaves, instead of simply summing up counters. It is interesting to note that using set-valued attributes naturally paves the way for deploying more effective post-processing schemes for various tasks (character recognition is an obvious example). By following a few branches, one need not make a guess about the second best option based on only one node, as is usually the case (where, actually, the idea is to eliminate all but the winner!). Another important line of research concerns the development of a suitable post-pruning strategy. Instance replication complicates the maths involved in estimating error, and standard approaches need to be better studied (for example, it is not at all obvious how error-complexity pruning [Breiman et al., 1984] might be readily adapted to set-valued attributes). Last, but not least, we need to explore the extent to which instance replication does (or, hopefully, does not) seriously affect the efficiency of incremental decision tree algorithms. We feel that the proposed approach opens up a range of possibilities for decision tree induction. By beefing it up with the appropriate mathematical framework we may be able to say that a little bit of carefully introduced fuzziness improves the learning process. ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 8
  • 9. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ References [Ali and Pazzani, 1996] K.M. Ali, M.J. Pazzani. Error Reduction through Learning Multiple Descriptions Machine Learning, 24:173-202(1996). [Breiman et al., 1984] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA (1984). [Breiman, 1996] L. Breiman. Bagging Predictors. Machine Learning, 24:123-140 (1996). [Cohen, 1996] W.W. Cohen. Learning Trees and Rules with Set-valued Features. Proceedings of AAAI-96. [Fayyad and Irani, 1992] U.M. Fayyad, K.B. Irani. On the Handling of Continuous-Valued Attributes on Decision Tree Generation. Machine Learning, 8: 87-102 (1992). [Kalles, 1994] D. Kalles. Decision trees and domain knowledge in pattern recognition. Phd Thesis, University of Manchester (1994). [Kerber, 1992] R. Kerber. Chimerge:Discretization of numeric attributes. 10th National Conference on Artificial Intelligence, San Jose, CA. MIT Press. 123-128 (1992). [Kwok and Carter, 1990] S.W. Kwok, C. Carter. Multiple Decision Trees. Proceedings of the Uncertainty in Artificial Intelligence 4. R.D. Shachter, T.S. Levitt, L.N. Kanal, J.F. Lemmer (Editors) (1990). [Martin and Billman, 1994] J.D. Martin, D.O. Billman. Acquiring and Combining Overlapping Concepts. Machine Learning, 16:121-160 (1994). [Quinlan, 1986] J.R. Quinlan. Induction of Decision Trees, Machine Learning, 1:81-106 (1986). [Quinlan, 1987] J.R. Quinlan. Decision trees as probabilistic classifiers. In Proceedings of the 4th International Workshop on Machine Learning, pages 31-37, Irvine, CA, June 1987. [Quinlan, 1993] J.R. Quinlan. C4.5 Programs for machine learning. San Mateo, CA, Morgan Kaufmann (1993). [Utgoff, 1997] P.E. Utgoff. Decision Tree Induction Based on Efficient Tree Restructuring, Machine Learning, 29:5-44 (1997). [Blake et al., 1998] C. Blake, E. Keogh, and J Merz. UCI Repository of machine learning databases [http:// www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science (1998). ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 9
  • 10. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ A Appendix on Accuracy Results The following charts demonstrate the accuracy results for the tested databases. Ecoli Database Wine Database 80 96 75 94 70 92 Accuracy 65 Accuracy 90 60 88 55 50 86 45 84 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment Yeast Database Crx Database 65 86 60 84 55 82 Accuracy Accuracy 50 80 45 78 40 35 76 30 74 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment Pima Database Glass Database 76 50 45 74 40 72 35 30 Accuracy Accuracy 70 25 20 68 15 66 10 5 64 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment Abalone Database Adult Database 83 55 54 82 53 82 52 81 Accuracy Accuracy 51 81 50 80 49 48 80 47 79 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 10
  • 11. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ B Appendix on Speed Results The following charts demonstrate the speed results for the tested databases. Triangles represent splitting time. ITI results are presented with one straight line. Ecoli Database Wine Database 0,40 0,60 0,35 0,50 0,30 Time (Seconds) Time (Seconds) 0,40 0,25 0,30 0,20 0,20 0,15 0,10 0,10 0,05 0,00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 0,00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 Number of Experiment Number of Experiment Yeast Database Crx Database 1,20 5,00 4,50 1,00 4,00 Time (Seconds) 3,50 0,80 Time (Seconds) 3,00 0,60 2,50 2,00 0,40 1,50 1,00 0,20 0,50 0,00 0,00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 Number of Experiment Number of Experiment Pima Database Glass Database 0,80 1,40 0,70 1,20 Time (Seconds) 0,60 1,00 Time (Seconds) 0,50 0,80 0,40 0,60 0,30 0,40 0,20 0,20 0,10 0,00 0,00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 Number of Experiment Number of Experiment Abalone Database Adult Database 250 14 12 200 10 Time (Seconds) 150 Time (Seconds) 8 6 100 4 50 2 0 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 Number of Experiment Number of Experiment ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 11
  • 12. COMPUTER TECHNOLOGY INSTITUTE 1999 ________________________________________________________________________________ C Appendix on Size Results The following charts demonstrate the size results (number of leaves) for the tested databases. Ecoli Database Wine Database 180 80 160 70 140 60 120 50 100 Leaves Leaves 40 80 30 60 40 20 20 10 0 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment Yeast Database Crx Database 1400 160 1200 140 120 1000 100 800 Leaves Leaves 80 600 60 400 40 200 20 0 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment Pima Database Glass Database 700 180 160 600 140 500 120 400 100 Leaves Leaves 300 80 200 60 40 100 20 0 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment Abalone Database Adult Database 2500 5000 4500 2000 4000 3500 1500 3000 Leaves Leaves 2500 1000 2000 1500 1000 500 500 0 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Experiment Number of Experiment ____________________________________________________________________________________ TECHNICAL REPORT No. 99.03.01 12