SlideShare a Scribd company logo
1 of 82
Download to read offline
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011



                                    Lecture 4
                        Data Mining Concepts
        Data Preprocessing and Postprocessing

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
       Data Mining vs. Machine Learning vs. Statistics
       Instances with attributes and concepts(input)
       Knowledge Representation (output)
       Why we need data preprocessing and postprocessing?
       Engineering the input
           Data cleaning
           Data integration
           Data transformation and data reduction
       Engineering the output
           Combining multiple models

    2                                   Data Warehousing and Data Mining by Kritsada Sriphaew
Data Mining vs. Machine Learning
       We are overwhelmed with electronic/recorded data, how
        we can discover the knowledge from such data.
       Data Mining (DM) is a process of discovering patterns in
        data. The process must be automatic or semi-automatic.
       Many techniques have been developed within a field
        known as Machine Learning (ML).
       DM is a practical topic and involves learning in a
        practical, not a theoretical sense while ML focuses on
        theoretical one.
       DM is for gaining knowledge, not just good prediction.
       DM = ML + topic-oriented + knowledge-oriented

    3                             Data Warehousing and Data Mining by Kritsada Sriphaew
DM&ML vs. Statistics
       DM = Statistics + Marketing
       Machine learning has been more concerned with formulating
        the process of generalization as a search through possible
        hypothesis
       Statistics has been more concerned with testing hypotheses.
       Very similar schemes have been developed in parallel in
        machine learning and statistics, e.g., decision tree induction,
        classification and regression tree, nearest-neighbor methods.
       Most learning algorithms use statistical tests when constructing
        rules or trees and for correcting models that are “overfitted” in
        that they depend too strongly on the details of particular
        examples used for building the model.
    4                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Generalization as Search
       An aspect that distinguishes ML from statistical
        approaches, is a search process through a space of
        possible concept descriptions for one that fits the data.
       Three properties that are important to characterize a
        machine learning process, are
           language bias: the concept description language, e.g., decision
            tree, classification rule, association rules
           search bias: the order in which the space is explored, e.g.,
            greedy search, beam search
           overfitting-avoidance bias: the way to avoid overfitting to the
            particular training data, e.g., forward pruning or backward
            pruning.
    5                                    Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of Structural Patterns
                                                                          Part of a structural description of the
      age
    young
                    Spectacle
                   prescription
                      myope
                                  astigmatism
                                     no
                                                Tear prod.
                                                  rate
                                                 reduced
                                                             Recom.
                                                             lenses
                                                             none
                                                                           contact lens data might be as follows:
    young             myope          no           normal      soft                         Spectacle                   Tear prod.   Recom.
                                                                                 age      prescription   astigmatism     rate       lenses
    young             myope         yes          reduced     none
                                                                            presbyopic       myope          no          reduced     none
    young             myope         yes           normal     hard
                                                                             presbyopic      myope          no           normal     none
    young          hypermetrope      no          reduced     none
                                                                             presbyopic      myope         yes          reduced     none
    young          hypermetrope      no           normal     soft
                                                                             presbyopic      myope         yes           normal     hard
    young          hypermetrope     yes          reduced     none
                                                                             presbyopic   hypermetrope      no          reduced     none
    young          hypermetrope     yes           normal     hard
                                                                             presbyopic   hypermetrope      no           normal     soft
  Pre-presbyopic      myope          no          reduced     none
                                                                             presbyopic   hypermetrope     yes          reduced     none
  Pre-presbyopic      myope          no           normal     soft
                                                                             presbyopic   hypermetrope     yes           normal     none
  Pre-presbyopic      myope         yes          reduced     none
  Pre-presbyopic      myope         yes           normal     hard
  Pre-presbyopic hypermetrope
  Pre-presbyopic hypermetrope
                                     no
                                     no
                                                 reduced
                                                  normal
                                                             none
                                                             soft
                                                                           All combinations of possible values
                                                                                          = 3x2x2x2= 24 possibilities
  Pre-presbyopic hypermetrope       yes          reduced     none

  Pre-presbyopic hypermetrope       yes           normal     none




If tear_production_rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = no
               then recommendation = soft
     6                                                                    Knowledge Management and Discovery © Kritsada Sriphaew
Input: Concepts, Instance & Attributes
       Concept description
           the thing that is to be learned (learning result)
           hard to pin down precisely but
           intelligible and operational
       Instances (‘examples’ referred as input)
           Information that the learner is given
           A single table vs. multiple tables (denormalization to a single table)
           Denormalization sometimes produces apparent regularities, such as
            supplier vs. supplier address do always match together.
       Attribute (features)
           Each instance is characterized by a fixed, predefined set of features
            or attributes

    7                                        Data Warehousing and Data Mining by Kritsada Sriphaew
Input: Concepts, Instance & Attributes
                            Attributes                        Concepts
    Ordinal Attr.    Numeric Attr.        Nominal Attr.     Numeric Nominal


      outlook       temp.    humidity   windy    Sponsor play-time       play

       sunny        85         87       True       Sony        85         Y

      sunny         80         90       False       HP         90         Y




                                                                                       Instances (Examples)
     overcast       87         75       True       Ford        63         Y

       rainy        70         95       True       Ford         5         N

      rainy         75         65       False       HP         56         Y

      sunny         90         94       True         ?         25         N

      rainy         65         86       True       Nokia        5         N

     overcast       88         92       True      Honda        86         Y

      rainy         79         75       False      Ford        78         Y
                                                                                    Missing value
     overcast       85         88       True       Sony        74         Y
8                                                   Data Warehousing and Data Mining by Kritsada Sriphaew
Independent vs. Dependent Instances
       Normally, the input data are represented as a set of independent
        instances.
       But there are many problems involving relationship between objects.
        That is, some instances depend with the others.
       Ex.: A family tree: the sister-of relation              Close World Assumption


                                                             first   second sis    first   second sis
                                                            person   person ter   person   person ter
          Harry    Sally       Richard       Julia
                                                            Harry     Sally   N   Steven    Demi    Y
           M        F             M           F             Harry    Steven   N   Bruce     Demi    Y
                                                                                  Tison     Diana   Y
                                                            Steven   Peter    N    Bill     Diana   Y
Steven Bruce Demi             Tison     Diana        Bill
                                                            Steven   Bruce    N   Nina      Rica    Y
  M      M    F                 M         F          M      Steven   Demi     Y   Rica      Nina    Y

                                                            Bruce     Demi    Y     All the rest    N

                   Nina               Rica                   Rica     Nina    Y
                    F                  F

    9                                           Data Warehousing and Data Mining by Kritsada Sriphaew
Independent vs. Dependent Instances
   Harry   Sally        Richard       Julia         name      gender parent1 parent2           first   second sis
    M       F              M           F                                                      person   person ter
                                                   Harry    Male            ?        ?
                                                    Sally Female                              Steven     Demi    Y
Steven Bruce Demi      Tison      Diana       Bill                          ?        ?
                                                                                              Bruce      Demi    Y
  M      M    F          M          F         M Richard     Male            ?        ?
                                                    Julia Female            ?        ?        Tison      Diana   Y
                                                   Steven   Male          Harry     Sally       Bill     Diana   Y
           Nina                Rica                Bruce    Male          Harry     Sally      Nina      Rica    Y
            F                   F                  Demi   Female          Harry     Sally      Rica      Nina    Y
                                                   Tison    Male          Richard   Julia
 sister_of(X,Y) :- female(Y),                      Diana Female           Richard   Julia        All the rest    N
                   parent(Z,X),                      Bill   Male          Richard   Julia
                   parent(Z,Y).                    Nina   Female          Tison     Demi
                                                    Rica   Female         Tison     Demi
    Denormalization
                    first                                      second                                  sister
                               gender parent1 parent2                 gender parent1 parent2
                   person                                      person
                   Steven      Male       Harry       Sally     Demi       Female   Harry     Sally      Y
                   Bruce       Male       Harry       Sally     Demi       Female   Harry     Sally      Y
                   Tison       Male       Richard    Julia      Diana      Female   Richard   Julia      Y
                    Bill       Male       Richard    Julia      Diana      Female   Richard   Julia      Y
                    Nina    Female        Tison      Demi       Rica       Female   Tison     Demi       Y
                    Rica    Female        Tison      Demi       Nina       Female   Tison     Demi       Y

                                                           All the rest                                  N
  10                                                          Data Warehousing and Data Mining by Kritsada Sriphaew
Problems of Denormalization
    A large table with duplication values included.
    Relations among instances (rows) are ignored.
    Some regularities in the data are merely reflections of the original
     database structure but might be found by the data mining process,
     e.g., supplier and supplier address.
    Some relations are not finite, e.g., ancestor-of relations. Inductive logic
     programming can use recursion to deal with this situations (the infinite
     number of possible instances)
                     If person1 is a parent of person2
                          then person1 is an ancestor of person2
                     If person1 is a parent of person2 and
                        person2 is a parent of person3
                          then person1 is an ancestor of person3

    11                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Missing, Inaccurate, duplicated values
    Many practical datasets may include three types of errors:
        Missing values
           frequently indicated by out-of-range entries (a negative number)
           unknown vs. unrecorded vs. irrelevant values
        Inaccurate values
           typographical errors: misspelling, mistyping
           measurement errors: errors generated by a measuring machine.
           Intended errors: Ex.: input the zip code of the rental agency
            instead of the renter’s zip code.
        Duplicated values
           repetition of data gives such data more influence on the result.




    12                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Output: Knowledge Representation
    There are many different ways for representing the
     patterns that can be discovered by machine learning.
     Some popular ones are:
        Decision tables
        Decision trees
        Classification rules
        Association rules
        Rules with exceptions
        Rules involving relations
        Trees for numeric prediction
        Instance-based representation
        Clusters

    13                              Data Warehousing and Data Mining by Kritsada Sriphaew
Decision Tables
    The simplest, most rudimentary way of representing the output from
     machine learning or data mining
    Ex.: A decision table for the weather data to decide whether or not to
     “play”
     outlook     temp. humidity   windy Sponsor play-time    play
         sunny   hot     high     True    Sony      85        Y       (1) How to make a
         sunny   hot     high     False    HP       90        Y           smaller, condensed
    overcast     hot    normal    True    Ford      63        Y           table with some
                                                                          useless attributes
         rainy   mild    high     True    Ford      5         N
                                                                          are omitted.
         rainy   cool    low      False    HP       56        Y
         sunny   hot     low      True    Sony      25        N
                                                                      (2) How to cope with a
         rainy   cool   normal    True    Nokia     5         N           case which does
    overcast     mild    high     True    Honda     86        Y           not exist in the
         rainy   mild    low      False   Ford      78        Y           table.
    overcast     hot     high     True    Sony      74        Y

    14                                            Data Warehousing and Data Mining by Kritsada Sriphaew
Decision Trees (1)
    A “divide-and-onquer” approach to the problem of learning.
    Ex.: A decision tree (DT) for the contact lens data to decide which type
     of contact lens is suitable.

                  Tear production rate
     reduced
                                       normal

           none                     astigmatism
                               no                   yes
                        soft
                                         Spectacle prescription
                                      myope                           hyperope

                                           hard                     none
    15                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Decision Trees (2)
    Nodes in a DT involve testing a particular attribute with
     a constant. However, it is possible to compare two
     attributes with each other, or to utilized some function
     of one or more attributes.
    If the attribute that is tested at a node is a nominal one,
     the number of children is usually the number of possible
     values of the attributes.
    In this case, the same attribute will not be tested again
     further down the tree.
    In the case that the attributes are divided into two
     subsets, the attribute might be tested more than one
     times in a path.

    16                           Data Warehousing and Data Mining by Kritsada Sriphaew
Decision Trees (3)
 If the attribute is numeric, the test at a node usually
  determines whether its value is greater or less than a
  predetermined constant.
 If missing value is treated as an attribute value, there
  will be a third branch.
 Three-way split into (1) less-than, equal-to and
  greater-than, or (2) below, within and above.



 17                         Data Warehousing and Data Mining by Kritsada Sriphaew
Classification Rules (1)
    A popular alternative to decision trees. Also called a decision list.
    Ex.: If outlook = sunny and humidity = high then play = yes
            If outlook = rainy and windy = true then play = no
            If outlook = overcast then play = yes

                  outlook    temp. humidity   windy Sponsor play-time      play      Decision Table
                   sunny     hot     high     True     Sony       85        Y
                   sunny     hot     high     False     HP        90        Y
                  overcast   hot    normal    True     Ford       63        Y
                   rainy     mild    high     True     Ford       5         N
                   rainy     cool     low     False     HP        56        Y
                   sunny     hot      low     True     Sony       25        N
                   rainy     cool   normal    True     Nokia      5         N
                 overcast    mild    high     True    Honda       86        Y
                   rainy     mild     low     False    Ford       78        Y
                 overcast    hot     high     True     Sony       74        Y

    18                                           Data Warehousing and Data Mining by Kritsada Sriphaew
Classification Rules (2)
    A set of rules is interpreted in
     sequence.                                            a
                                                     y       n
    The antecedent (or precondition) is a
     series of tests while the consequent (or y b            c

     conclusion) gives the class or classes            n        y n
                                                 x
     to the instances.                               c        d
                                                   y             n
    It is easy to read a set of rules directly          n y
     off a decision trees but the opposite y d n              x
     function is not quite straightforward.
                                               x
    Ex.: replicated subtree problem
        If a and b then x
        If c and d then x

    19                            Data Warehousing and Data Mining by Kritsada Sriphaew
Classification Rules (3)
    One reason why classification rules are popular:
        Each rule seems to represent an independent “nugget” of
         knowledge.
        New rules can be added to an existing rule set without disturbing
         those already there (In the DT case, it is necessary to reshaping the
         whole tree).
    If a rule set gives multiple classifications for a particular
     example, one solution is to give no conclusion at all.
    Another solution is to count how often each rule fires on the
     training data and go with the most popular one.
    One more problem occurs when an instance is encountered
     that the rules fail to classify at all.
        Solutions: (1) fail to classify, or (2) choose the most popular class


    20                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Classification Rules (4)
 In a particularly straightforward situation, when rules
  lead to a class that is boolean (y/n) and when only
  rules leading to one outcome (say yes) are expressed
 A form of closed world assumption.
 The result rules cannot be conflict and there is no
  ambiguity in rule interpretation.
 A set of rules can be written as a logic expression
  disjunctive normal form ( a disjunction (OR) of
  conjunctive (AND) conditions ).

 21                         Data Warehousing and Data Mining by Kritsada Sriphaew
Association Rules (1)
    Association rules are really no different from classification rules
     except that they can predict any attribute, not just the class.
    This gives them the freedom to predict combinations of
     attributes, too.
    Association rules (ARs) are not intended to be used together as
     a set, as classification rules are
    Different ARs express different regularities that underlies the
     dataset, and they generally predict different things.
    From even a small dataset, a large number of ARs can be
     generated. Therefore, some constraints are needed for finding
     useful rules. Two most popular ones are (1) support and (2)
     confidence.
    22                              Data Warehousing and Data Mining by Kritsada Sriphaew
Association Rules (2)
   For example,          xy [ s = p(x,y), c = p(x,y)/p(x) ]
        If temperature = hot then humidity = high (s=3/10,c=3/5)
        If windy=true and play=Y then humidity=high and outlook=overcast (s=2/10, c=2/4)
        If windy=true and play=Y and humidity=high then outlook=overcast (s=2/10, c=2/3)

                        outlook   temp. humidity   windy Sponsor play-time      play
                        sunny     hot     high     True     Sony       85       Y
                        sunny     hot     high     False     HP        90       Y
                       overcast   hot    normal    True     Ford       63       Y
                        rainy     mild    high     True     Ford       5        N
                        rainy     cool    low      False     HP        56       Y
                        sunny     hot     low      True     Sony       25       N
                        rainy     cool   normal    True     Nokia      5        N
                       overcast   mild    high     True    Honda       86       Y
                        rainy     mild    low      False    Ford       78       Y
                       overcast   hot     high     True     Sony       74       Y

    23                                             Data Warehousing and Data Mining by Kritsada Sriphaew
Rules with Exception (1)
    For classification rules, incremental modifications can be
     made to a rule set by expressing exceptions to existing
     rules rather than by reengineering the entire set. Ex.:
        If petal-length >= 2.45 and petal-length < 4.45
         then Iris-versicolor

                        Sepal length Sepal width Petal length    Petal width      type
         A new case
                            5.1         3.5           2.6           0.2        Iris-setosa

        If petal-length >= 2.45 and petal-length < 4.45 then Iris-
         versicolor EXCEPT if petal-width < 1.0 then Iris-setosa
    Of course, we can have exceptions to the exceptions,
     exceptions to these and so on.

    24                                     Data Warehousing and Data Mining by Kritsada Sriphaew
Rules with Exception (2)
    Rules with exceptions can be used to represent the entire concept description in the
     first place.
    Ex.:
              Default: Iris-setosa
              except if petal-length >= 2.45 and petal-length < 5.355 and
                              petal-width < 1.75
                           then Iris-versicolor
                              except if petal-length >= 4.95 and petal-width < 1.55
                                   then Iris-virginica
                                   else if sepal-length < 4.95 and sepal-width >=2.45
                                          then Iris-virginica
                            else if petal-length >= 3.35
                               then Iris-virginica
                                  except if petal-length < 4.85 and sepal-length<5.95
                                          then Iris-versicolor

    25                                       Data Warehousing and Data Mining by Kritsada Sriphaew
Rules with Exception (3)
 Rules with exceptions can be proved to be logically
  equivalent to an if-else statements.
 The user can see that it is plausible, the expression
  in terms of (common) rules and (rare) exceptions will
  be easier to grasp than a normal structure (if-else).




 26                        Data Warehousing and Data Mining by Kritsada Sriphaew
Rules involving relations (1)
    So far the conditions in rules involve testing an attribute value against a
     constant.
    This is called propositional (in propositional calculus).
    Anyway, there are situation where a more expressive form of rule would
     provide more intuitive&concise concept description.
    Ex.: the concept of standing up.
        There are two classes: standing and lying.
        The information given is the width, height and the number of sides of each block.

                                                      standing
                                      lying




    27                                        Data Warehousing and Data Mining by Kritsada Sriphaew
Rules involving relations (2)
    A propositional rule set produced for this data might be
        If width >= 3.5 and height < 7.0 then lying
        If height >= 3.5 then standing
    A rule set with relations that will be produced, is
        If width(b)>height(b) then lying
        If height(b)>width(b) then standing                                              lying
           width   height   sides   class
             2       4        4     stand
             3       6        4     stand
             4       3        4     lying
                                                       standing
             7       8        3     stand
            7        6        3     lying
             2       9        3     stand
            9        1        4     lying
            10       2        3     lying
    28                                         Data Warehousing and Data Mining by Kritsada Sriphaew
Trees for numeric prediction
 Instead of predicting categories, predicting numeric
  quantities is also very important.
 We can use regression equation.
 There are two more knowledge representations:
  regression tree and model tree.
       Regression trees are decision tree with averaged numeric
        values at the leaves.
       It is possible to combine regression equations with
        regression trees. The result model is model tree, a tree
        whose leaves contain linear expressions.

 29                              Data Warehousing and Data Mining by Kritsada Sriphaew
An example of numeric prediction
CPU performance (Numeric prediction)
   PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN +                                                                     <=7.5
                                                                                                                            CHMIN
                                                                                                                                          >7.5

                   0.0056 MMAX + 0.6410 CACH -                                                               CACH                        MMAX
                                                                                                <=8.5                         >28
                                                                                                              (8.5,28]
                   0.2700 CHMIN + 1.480 CHMAX                                                                  19.3
                                                                                                                                        <=28000         >28000
                                                                                                                                                            CHMAX
                                                                                               MMAX          (28/8.7%)       MMAX       157(21/73.7%
                                                                                                                                              )          <=58       >58
          cycle   main memory    cache         channels    perfor               <=2500 (2500,4250] >4250 <=10000>10000
          time    min    max      (Kb)    min       max    mace                   19.3          29.8                         75.7           133                       783
                                                                                                              CACH                                      MMIN
                                                                               (28/8.7%)     (37/8.18%)                   (10/24.6%)    (16/28.8%)                 (5/35.9%)
          MYCT    MMIN MMAX CACH         CHMIN CHMAX       PRP
                                                                                                              <=0.5                                  <=12000 >12000
     1    125     256    6000     256     16         128   198                                                             (0.5,8.5]
                                                                                                         MYCT
     2     29     8000   32000    32       8         32    269                                                               59.3                281           492

     3     29     8000   32000    32       8         32    220                                   <=550           >550 (24/16.9%)              (11/56%)      (7/53.9%)

    4      29     8000   32000    32       8         32    172                                  37.3             18.3
    5      29     8000   16000    32       8         16    132                               (19/11.3%)       (7/3.83%)

    …      ...     ...     ...    ...     ...        ...    ...
    207   125     2000   8000     0        2         14     52
                                                                                                                                            Regression
                                                                                                            CHMIN
    208   480     512    8000     32       0         0      67                                 <=7.5                       >7.5                Tree
    209   480     1000   4000      0       0          0     45
                                                                                             CACH                          MMAX
                                                                                <=8.5                     >8.5
LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN                                                                                 <=28000        >28000
LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN                                     MMAX                          LM4            LM5(21/45.5         LM6
             + 0.946 CHMAX                                        <=4250                   >4250        (50/22.1%)           %)            (23/63.5%)
LM3: PRP = 38.1 + 0.12 MMIN
                                                                 LM1
LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH                      (65/7.32%)
                                                                                        CACH
             + 0.969 CHMAX                                                     <=0.5       (0.5,8.5]
LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH                                        LM2             LM3
             - 9.39 CHMIN                                                  (26/6.37%)      (24/14.5%)                 Model Tree
LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN
 30
            + 4.98 CHMAX
                                                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Instance-based representation (1)
    The simplest form of learning is plain memorization.
    Encountering a new instance the memory is searched for
     the training instance that most strongly resembles the
     new one.
    This is a completely different way of representing the
     “knowledge” extracted from a set of instances: just store
     the instances themselves and operate by relating new
     instances whose class is unknown to existing ones whose
     class is known.
    Instead of creating rules, work directly from the
     examples themselves.
    31                          Data Warehousing and Data Mining by Kritsada Sriphaew
Instance-based representation (2)
    Instance-based learning is lazy, deferring the real work as long
     as possible.
    Other methods are eager, producing a generalization as soon
     as the data has been seen.
    In instance-based learning, each new instance is compared with
     existing ones using a distance metric, and the closest existing
     instance is used to assign the class to the new one. This is also
     called the nearest-neighbor classification method.
    Sometimes more than one nearest neighbor is used, and the
     majority class of the closest k neighbors is assigned to the new
     instance. This is termed the k-nearest-neighbor method.


    32                             Data Warehousing and Data Mining by Kritsada Sriphaew
Instance-based representation (3)
    When computing the distance between two examples, the
     standard Euclidean distance may be used.
    When nominal attributes are present, we may use the following
     procedure.
        A distance of 0 is assigned if the values are identical, otherwise the
         distance is 1.
    Some attributes will be more important than others. We need
     some kinds of attribute weighting. To get suitable attribute
     weights from the training set is a key problem.
    It may not be necessary, or desirable, to store all the training
     instances.
        To reduce the nearest neighbor calculation time.
        To reduce the unrealistic amounts of storages.

    33                                   Data Warehousing and Data Mining by Kritsada Sriphaew
Instance-based representation (4)
 Generally some regions of attribute space are more
  stable with regard to class than others, and just a few
  examples are needed inside stable regions.
 An apparent drawback to instance-based
  representation is that they do not make explicit the
  structures that are learned.



             (a)             (b)                     (c)


 34                        Data Warehousing and Data Mining by Kritsada Sriphaew
Clusters
   The output takes the form of a diagram that shows how the instances fall into clusters.
   The simplest case involving associating a cluster number with each instance (Fig. a).
   Some clustering algorithm allow one instance to belong to more than one cluster, a Venn
    diagram (Fig. b).
   Some algorithms associate instances with clusters probabilistically rather than
    categorically (Fig. c).
   Other algorithms produce a hierarchical structure of clusters, called dendrograms (Fig. d).
   Clustering may work with other learning methods for more performance.

                                                     1        2     3
          g                                    a    0.4      0.3   0.3
                              g                b    0.6      0.3   0.1
      h       e   a       h        e a         c    0.1      0.4   0.5
                      d                  d     d    0.5      0.2   0.3
          c b f           c       b f          e    0.6      0.3   0.1
                                               f    0.4      0.1   0.5
                                               g    0.1      0.4   0.5
           (a)                (b)              h    0.2      0.7   0.1   a   b c   d    e    f   g   h
    35                                                 (c)                             (d)
Why Data Preprocessing? (1)
    Data in the real world is dirty
        incomplete: lacking attribute values, lacking certain attributes
         of interest, or containing only aggregate data. Ex: occupation
         =“”
        noisy: containing errors or outliers. Ex: salary = “-10”
        inconsistent: containing discrepancies in codes or names.
         Ex: Age=“42” but Birthday = “01/01/1997”
              Was rating “1,2,3” but now rating “A,B,C”
    No quality data, no quality mining results!
        Quality decisions must be based on quality data
        Data warehouse needs consistent integration of quality data

    36                                Data Warehousing and Data Mining by Kritsada Sriphaew
Why Data Preprocessing? (2)
    To integrate multiple sources of data to more meaningful one.
    To transform data to the form that makes sense and is more
     descriptive
    To reduce the size (1) in cardinality aspect and/or (2) in variety
     aspect in order to improve the computational time and
     accuracy
                   Multi-Dimensional Measure of Data Quality
            A well-accepted multidimensional view:
           •   Accuracy                      •    Believability
           •   Completeness                  •    Value added
           •   Consistency                   •    Interpretability
           •   Timeliness                    •    Accessibility
    37                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Major Tasks in Data Preprocessing
    Data cleaning
        Fill in missing values, smooth noisy data, identify or remove
         outliers, and resolve inconsistencies
    Data integration
        Integration of multiple databases, data cubes, or files
    Data transformation and data reduction
        Normalization and aggregation
        Obtains reduced representation in volume but produces the
         same or similar analytical results
        Data discretization: data reduction, especially for numerical
         data

    38                                Data Warehousing and Data Mining by Kritsada Sriphaew
Forms of Data Preprocessing

       Data
      Cleaning


        Data
     Integration


    Data
Transformation

       Data
     Reduction

39                 Data Warehousing and Data Mining by Kritsada Sriphaew
Data Cleaning
Topics in Data Cleaning
   Data cleaning tasks
       Fill in missing values
       Identify outliers and smooth out noisy data
       Correct inconsistent data
   Advanced techniques for automatic data cleaning
       Improving decision tree
       Robust regression
       Detecting anomalies



40                                Data Warehousing and Data Mining by Kritsada Sriphaew
Missing Data
    Data is not always available
        e.g., many tuples have no recorded value for several
         attributes, such as customer income in sales data
    Missing data may be due to
        equipment malfunction
        inconsistent with other recorded data and thus deleted
        data not entered due to misunderstanding
        certain data may not be considered important at the time of
         entry
        not register history or changes of the data
    Missing data may need to be inferred.

    41                               Data Warehousing and Data Mining by Kritsada Sriphaew
How to Handle Missing Data?
    Ignore the tuple: usually done when class label is missing
    Fill in the missing value manually: tedious + infeasible?
    Use a global constant to fill in the missing value: e.g.,
     “unknown”, a new class?!
    Use the attribute mean to fill in the missing value
    Use the attribute mean for all samples belonging to the same
     class to fill in the missing value: smarter
    Use the most probable value to fill in the missing value:
     inference-based such as Bayesian formula or decision tree
        The most popular, preserve relationship between missing attributes
         and other attributes


    42                                 Data Warehousing and Data Mining by Kritsada Sriphaew
How to Handle Missing Data?
(Examples) Attributes       Concepts

       outlook   temp. humidity   windy       Sponsor play-time       play
       sunny      85      87      True         Sony         85         Y                    1
       sunny      80      90      False         HP          90         Y            ignore
      overcast    87      75      True         Ford         63         ?                          4
       rainy      70      95      True         Ford         5          N       humid = 86.9
       rainy      75      ?       False         HP          56         Y
                                                                               humid|play=y           5
       sunny      90      94      True           ?          25         N
                                                                                   = 86.4
       rainy      65      86      True         Nokia        5          N
      overcast    88      92      True        Honda         86         Y                        3

       rainy      79      75      False        Ford         78         Y            Add
                                                                                  Unknown
      overcast    85      88       ?           Sony         74         Y
                                                                                        2
                                                     6
Predict by Bayesian formula or decision tree                    Manually Checking
43                                        Data Warehousing and Data Mining by Kritsada Sriphaew
Noisy Data
    Noise: random error or variance in a measured variable
    Incorrect attribute values may due to
        faulty data collection instruments
        data entry problems
        data transmission problems
        technology limitation
        inconsistency in naming convention
    Other data problems which requires data cleaning
        duplicate records
        incomplete data
        inconsistent data

    44                               Data Warehousing and Data Mining by Kritsada Sriphaew
How to Handle Noisy Data
   Binning method (Data smoothing):
       first sort data and partition into (equi-depth) bins
       then one can smooth by bin means, smooth by bin
        median, smooth by bin boundaries, etc.
   Clustering
       detect and remove outliers
   Combined computer and human inspection
       detect suspicious values and check by human
   Regression
       smooth by fitting the data into regression functions

45                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Simple Discretization Methods: Binning
    Equal-width (distance) partitioning:
        It divides the range into N intervals of equal size: uniform grid
        if A and B are the lowest and highest values of the attribute,
         the width of intervals W = (B-A)/N.
        The most straightforward
        But outliers may dominate presentation
         (since we use lowest/highest values)
        Skewed (asymmetrical) data is not handled well.
    Equal-depth (frequency) partitioning:
        It divides the range into N intervals, each containing around
         same number of samples
        Good data scaling
        Managing categorical attributes can be tricky.

    46                                          Data Warehousing and Data Mining by Kritsada Sriphaew
Binning Methods for Data Smoothing
    Sorted data for price (in dollars):
                  4, 8, 9, 15, 21, 21, 24, 25, 26, 27, 29, 34
    Partition into (equi-depth) bins:
         - Bin 1: 4, 8, 9, 15 (mean = 9, median = 8.5)                                     Partition into
         - Bin 2: 21, 21, 24, 25           (mean = 22.75, median = 23)                     equidepth bin
         - Bin 3: 26, 27, 29, 34           (mean = 29, median = 28)                        (depth=3)
    Smoothing by bin means:
         - Bin 1: 9, 9, 9, 9
         - Bin 2: 22.75, 22.75, 22.75, 22.75
         - Bin 3: 29, 29, 29, 29
                                                               Each value in a bin is replaced by the
    Smoothing by bin medians:                                 mean (or median) value of the bin.
         - Bin 1: 8.5, 8.5, 8.5, 8.5                          Similarly, smoothing by bin median
         - Bin 2: 23, 23, 23, 23
         - Bin 3: 28, 28, 28, 28
    Smoothing by bin boundaries:
         - Bin 1: 4, 4, 4, 15                                     The minimum and
         - Bin 2: 21, 21, 25, 25                                maximum values in a
                                                                given bin are identified
         - Bin 3: 26, 26, 26, 34
                                                                as the bin boundaries


    47                                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Cluster Analysis




                   [Clustering]
                       detect and remove outliers


48                  Data Warehousing and Data Mining by Kritsada Sriphaew
Regression
                   y

             Y1


             Y1’                        y=x+1


                           X1                                  x
                   [Regression]
                       smooth by fitting the data into
                       regression functions



49                     Data Warehousing and Data Mining by Kritsada Sriphaew
Automatic Data Cleaning
(Improving Decision Trees)
 Improving decision trees: relearn tree with
  misclassified instances removed or pruning away
  some subtrees
 Better strategy (of course): let human expert check
  misclassified instances
 When systematic noise is present it is better not to
  modify the data
 Also: attribute noise should be left in training set
 (Unsystematic) class noise in training set should be
  eliminated if possible
 50                        Data Warehousing and Data Mining by Kritsada Sriphaew
Automatic Data Cleaning
(Robust Regression - I)
 Statistical methods that address problem of outliers
  are called robust
 Possible way of making regression more robust:
       Minimize absolute error instead of squared error
         Remove outliers (i. e. 10% of points farthest from the
          regression plane)
       Minimize median instead of mean of squares (copes with
        outliers in any direction)
         Finds narrowest strip covering half the observations



 51                               Data Warehousing and Data Mining by Kritsada Sriphaew
Automatic Data Cleaning
(Robust Regression - II)




                                                  Least absolute




                                             perpendicular

52                  Data Warehousing and Data Mining by Kritsada Sriphaew
Automatic Data Cleaning
(Detecting Anomalies)
 Visualization is a best way of detecting anomalies
  (but often can’t be done)
 Automatic approach:
       committee of different learning schemes, e.g. decision
        tree, nearest- neighbor learner, and a linear discriminant
        function
       Conservative approach: only delete instances which are
        incorrectly classified by all of them
       Problem: might sacrifice instances of small classes


 53                                Data Warehousing and Data Mining by Kritsada Sriphaew
Data Integration
Data Integration
    Data integration:
        combines data from multiple sources into a coherent store
    Schema integration
        integrate metadata from different sources
        Entity identification problem: identify real world entities from
         multiple data sources, e.g., How to match A.cust-num with
         B.customer-id
    Detecting and resolving data value conflicts
        for the same real world entity, attribute values from different
         sources are different
        possible reasons: different representations, different scales,
         e.g., metric vs. British units

    54                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Handling Redundant Data in Data
Integration
    Redundant data occur often                           A correlation between
     when integration of multiple                           attribute A and B
     databases                                                 n
        The same attribute may have
         different names in different                         ( A  A )( B  B )
                                                                        i                i
         databases                               R A, B      i 1

        One attribute may be a                                      (n  1) A B
         “derived” attribute in another
         table, e.g., annual revenue
    Some redundancies can be                             (x  x)     2
                                                                                n x 2  ( x ) 2
     detected by correlational                                            
                                                             n 1                        n(n  1)
     analysis                                             where   standard deviation
    Careful integration of the data
     from multiple sources may help             If RA,B > 0 then A and B are positively correlated.
     reduce/avoid redundancies and              If RA,B = 0 then A and B are independent.
                                                If RA,B < 0 then A and B are negatively correlated.
     inconsistencies and improve
     mining speed and quality

    55                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Data Transformation and Data Reduction

Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
  range
       min-max normalization
       z-score normalization
       normalization by decimal scaling
   Attribute/feature construction
       New attributes constructed from the given ones
 56                               Data Warehousing and Data Mining by Kritsada Sriphaew
Data Transformation: Normalization
   min-max normalization
             v  vmin
       v'              (vmax  vmin )  vmin
                          new    new      new

            vmax  vmin
                                               xx
                                    z ( x) 
   z-score normalization                      
            vv                              (x  x)2            n x 2  ( x) 2
       v'                                                 
             v                                  n 1                    n(n  1)

   normalization by decimal scaling
            v
       v'  j Where j is the smallest integer such that Max(| v' |)<1
           10

57                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Data Reduction
Data Reduction Strategies
    Warehouse may store terabytes of data: Complex data
     analysis/mining may take a very long time to run on the
     complete data
    Data reduction
        Obtains a reduced representation of the data set that is much
         smaller in volume but yet produces the same (or almost the
         same) analytical results
    Data reduction strategies
        Data cube aggregation (reduce rows)
        Dimensionality reduction (reduce columns)
        Numerosity reduction (reduce columns or values)
        Discretization / Concept hierarchy generation (reduce values)

    58                               Data Warehousing and Data Mining by Kritsada Sriphaew
Three Types of Data Reduction
   Three types of data reduction are:
       Reduce no. of column (feature or attribute)
       Reduce no. of row (case, example or instance)
       Reduce no. of the values in a column (numeric/nominal)
                                      Columns                        Values
              outlook   temp. humidity   windy   Sponsor play-time        play
              sunny     85       87      True      Sony         85         Y
              sunny      80      90      False      HP          90         Y
    Rows     overcast   87       75      True      Ford         63         Y
              rainy     70       95      True      Ford         5          N
              rainy      75      65      False      HP          56         Y


59                                       Data Warehousing and Data Mining by Kritsada Sriphaew
Data Cube Aggregation
    Ex. You are interested in the annual sales rather than the total per
     quarter, thus the data can be aggregated resulting data summarize the
     total sales per year instead of per quarter
        The resulting data set is smaller in volume, without loss of information necessary
         for the analysis task




    60                                         Data Warehousing and Data Mining by Kritsada Sriphaew
Dimensionality Reduction
    Feature selection (i.e., attribute subset selection):
        Select a minimum set of features such that the probability
         distribution of different classes given the values for those
         features is as close as possible to the original distribution given
         the values of all features
        reduce the number of patterns, easier to understand
    Heuristic methods (due to exponential number of
     choices):
        decision-tree induction (wrapper approach)
        independent assessment (filter method)
        step-wise forward selection
        step-wise backward elimination
        combining forward selection+backward elimination
    61                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Decision Tree Induction
(Wrapper Approach)
                      Initial attribute set:
                      {A1, A2, A3, A4, A5, A6}

                                    A4 ?


                    A1?                             A6?



                          Class 2          Class 1           Class 2
          Class 1

         Reduced attribute set: {A1, A4, A6}
62                              Data Warehousing and Data Mining by Kritsada Sriphaew
Numerosity Reduction
   Parametric methods
       Assume the data fits some model, estimate model
        parameters, store only the parameters, and discard the
        data (except possible outliers)
       Log-linear models: obtain value at a point in m-D space as
        the product on appropriate marginal subspaces (estimate
        the probability of each cell in a larger cuboid based on the
        smaller cuboids)
   Non-parametric methods
       Do not assume models
       Major families: histograms, clustering, sampling

65                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Regression
   Linear regression: Y = a + bX
       Two parameters , a and b specify the line and are to be
        estimated by using the data at hand.
       using the least squares criterion to the known values of Y1,
        Y2, …, X1, X2, ….

   Multiple regression: Y = a + b1X1 + b2X2.
       Many nonlinear functions can be transformed into the
        above.


66                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Histograms
    A popular data reduction technique
    Divide data into buckets and store average (or sum) for
     each bucket
    Related to quantization problems.
                           40
                           35
                           30
                           25
                           20
                           15
                           10
                            5
                            0
                                10000        30000    50000    70000    90000



    67                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Clustering
 Partition data set into clusters, and one can store
  cluster representation only
 Can be very effective if data is clustered but not if
  data is “smeared (dirty)”
 Can have hierarchical clustering and be stored in
  multi-dimensional index tree structures
 There are many choices of clustering definitions and
  clustering algorithms.


 68                        Data Warehousing and Data Mining by Kritsada Sriphaew
Sampling
 Allow a mining algorithm to run in complexity that is
  potentially sub-linear to the size of the data
 Choose a representative subset of the data
       Simple random sampling may have very poor performance
        in the presence of skew (bias)
   Develop adaptive sampling methods
       Stratified (classify) sampling:
         Approximate the percentage of each class (or
          subpopulation of interest) in the overall database
         Used in conjunction with skewed (biased) data


 69                                Data Warehousing and Data Mining by Kritsada Sriphaew
Sampling




     Raw Data

70              Data Warehousing and Data Mining by Kritsada Sriphaew
Sampling
     Raw Data        Cluster/Stratified Sample




71              Data Warehousing and Data Mining by Kritsada Sriphaew
Discretization and concept hierarchy generation
Discretization
   Three types of attributes:
       Nominal: values from an unordered set
       Ordinal: values from an ordered set
       Continuous: real numbers
   Discretization:
       divide the range of a continuous attribute into intervals
       Some classification algorithms only accept categorical
        attributes.
       Reduce data size by discretization
       Prepare for further analysis

72                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Discretization and Concept hierachy
   Discretization
       reduce the number of values for a given continuous
        attribute by dividing the range of the attribute into
        intervals. Interval labels can then be used to replace actual
        data values.
   Concept hierarchies
       reduce the data by collecting and replacing low level
        concepts (such as numeric values for the attribute age) by
        higher level concepts (such as young, middle-aged, or
        senior).

73                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Discretization and Concept hierarchy generation
- numeric data
 Binning (see sections before)
 Histogram analysis (see sections before)
 Clustering analysis (see sections before)
 Entropy-based discretization
 Keywords:
       Supervised discretization
         Entropy-based discretization
       Unsupervised discretization
         Clustering, Binning, Histogram

 74                                Data Warehousing and Data Mining by Kritsada Sriphaew
Entropy-Based Discretization
    Given a set of samples S, if S is partitioned into two
     intervals S1 and S2 using boundary T, the entropy after
     partitioning is
       info(S,T) = (|S1|/|S|) × info(S1) + (|S2|/|S|) × info(S2)
    The boundary that minimizes the entropy function over
     all possible boundaries is selected as a binary
     discretization.
    The process is recursively applied to partitions obtained
     until some stopping criterion is met, e.g.,
                  info(S) - info(S,T) < threshold
    Experiments show that it may reduce data size and
     improve classification accuracy
    75                           Data Warehousing and Data Mining by Kritsada Sriphaew
Entropy-Based Discretization
    Ex.: temperature attribute of weather data are
         64   65   68   69   70   71 72 75           80      81      83     85
         y    n    y    y    y    n y/n y/y          n       y       y      n
                                                                                 N

                              Temp=71.5                    info ( X )   pi log 2 pi
                                                                                i 1




                                                                6                8                 
                                          info ([4,2], [5.3])    info ([4,2])     info ([5,3]) 
                                                                 14               14               
                                                               0.939 bits

                                                                info([9,5])  0.940 bits


    76
Specification of a set of attributes (Concept
hierarchy generation)
   Concept hierarchy can be automatically generated
    based on the number of distinct values per attribute
    in the given attribute set. The attribute with the most
    distinct values is placed at the lowest level of the
    hierarchy. country                    15 distinct values

             province_or_ state                         65 distinct values


                    city                             3567 distinct values

                   street                         674,339 distinct values

77                                Data Warehousing and Data Mining by Kritsada Sriphaew
Why Postprocessing?
 To improve the acquired model (the mined
  knowledge)?
 Techniques to combine several mining approaches to
  find better results
                    Method 1




                                                                 Output Data
                                          Combined
       Input Data




                    Method 2



                    Method N

 78                        Data Warehousing and Data Mining by Kritsada Sriphaew
Combining Multiple Models                               Engineering the Output
(Overview)
   Basic idea of “meta” learning schemes: build
    different “experts” and let them vote
       Advantage: often improves predictive performance
       Disadvantage: produces output that is very hard to analyze
 Schemes we will discuss are bagging, boosting and
  stacking (or stacked generalization)
 These approaches can be applied to both numeric
  and nominal classification


 79                               Data Warehousing and Data Mining by Kritsada Sriphaew
Combining Multiple Models
(Bagging - general)
    Employs simplest way of combining predictions: voting/
     averaging
    Each model receives equal weight
    “Idealized” version of bagging:
        Sample several training sets of size(instead of just having one
         training set of size n)
        Build a classifier for each training set
        Combine the classifier’s predictions
    This improves performance in almost all cases if learning
     scheme is unstable
         (i.e. decision trees)
    80                                Data Warehousing and Data Mining by Kritsada Sriphaew
Combining Multiple Models
(Bagging - algorithm)
   Model generation
       Let N be the number of instances in the training data.
       For each of t iterations:
          Sample n instances with replacement from training set.
          Apply the learning algorithm to the sample.
          Store the resulting model.
   Classification
       For each of the t models:
          Predict class of instance using model.
       Return class that has been predicted most often.

81                                Data Warehousing and Data Mining by Kritsada Sriphaew
Combining Multiple Models
(Boosting - general)
   Also uses voting/ averaging but models are weighted
    according to their performance
       Iterative procedure: new models are influenced by
        performance of previously built ones
       New model is encouraged to become expert for instances
        classified incorrectly by earlier models
       Intuitive justification: models should be experts that
        complement each other

   (There are several variants of this algorithm)

82                               Data Warehousing and Data Mining by Kritsada Sriphaew
Combining Multiple Models
(Stacking - I)
 Hard to analyze theoretically: “black magic”
 Uses “meta learner” instead of voting to combine
  predictions of base learners
       Predictions of base learners (level-0 models) are used as
        input for meta learner (level-1 model)
 Base learners usually have different learning schemes
 Predictions on training data can’t be used to
  generate data for level-1 model!
       Cross-validation-like scheme is employed

 83                               Data Warehousing and Data Mining by Kritsada Sriphaew
Combining Multiple Models
(Stacking - II)




84                    Data Warehousing and Data Mining by Kritsada Sriphaew

More Related Content

More from Tokyo Institute of Technology (12)

Lecture0: introduction Online Marketing
Lecture0: introduction Online MarketingLecture0: introduction Online Marketing
Lecture0: introduction Online Marketing
 
Lecture2: Marketing and Social Media
Lecture2: Marketing and Social MediaLecture2: Marketing and Social Media
Lecture2: Marketing and Social Media
 
Lecture1: E-Commerce Business Model
Lecture1: E-Commerce Business ModelLecture1: E-Commerce Business Model
Lecture1: E-Commerce Business Model
 
Lecture0: Introduction Social Commerce
Lecture0: Introduction Social CommerceLecture0: Introduction Social Commerce
Lecture0: Introduction Social Commerce
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Dbm630 lecture08
Dbm630 lecture08Dbm630 lecture08
Dbm630 lecture08
 
Dbm630 lecture07
Dbm630 lecture07Dbm630 lecture07
Dbm630 lecture07
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Dbm630 lecture05
Dbm630 lecture05Dbm630 lecture05
Dbm630 lecture05
 
Coursesyllabus_dbm630
Coursesyllabus_dbm630Coursesyllabus_dbm630
Coursesyllabus_dbm630
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 

Recently uploaded

Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 

Recently uploaded (20)

Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 

Dbm630 lecture04

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 4 Data Mining Concepts Data Preprocessing and Postprocessing by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Topics  Data Mining vs. Machine Learning vs. Statistics  Instances with attributes and concepts(input)  Knowledge Representation (output)  Why we need data preprocessing and postprocessing?  Engineering the input  Data cleaning  Data integration  Data transformation and data reduction  Engineering the output  Combining multiple models 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. Data Mining vs. Machine Learning  We are overwhelmed with electronic/recorded data, how we can discover the knowledge from such data.  Data Mining (DM) is a process of discovering patterns in data. The process must be automatic or semi-automatic.  Many techniques have been developed within a field known as Machine Learning (ML).  DM is a practical topic and involves learning in a practical, not a theoretical sense while ML focuses on theoretical one.  DM is for gaining knowledge, not just good prediction.  DM = ML + topic-oriented + knowledge-oriented 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 4. DM&ML vs. Statistics  DM = Statistics + Marketing  Machine learning has been more concerned with formulating the process of generalization as a search through possible hypothesis  Statistics has been more concerned with testing hypotheses.  Very similar schemes have been developed in parallel in machine learning and statistics, e.g., decision tree induction, classification and regression tree, nearest-neighbor methods.  Most learning algorithms use statistical tests when constructing rules or trees and for correcting models that are “overfitted” in that they depend too strongly on the details of particular examples used for building the model. 4 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 5. Generalization as Search  An aspect that distinguishes ML from statistical approaches, is a search process through a space of possible concept descriptions for one that fits the data.  Three properties that are important to characterize a machine learning process, are  language bias: the concept description language, e.g., decision tree, classification rule, association rules  search bias: the order in which the space is explored, e.g., greedy search, beam search  overfitting-avoidance bias: the way to avoid overfitting to the particular training data, e.g., forward pruning or backward pruning. 5 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 6. An Example of Structural Patterns  Part of a structural description of the age young Spectacle prescription myope astigmatism no Tear prod. rate reduced Recom. lenses none contact lens data might be as follows: young myope no normal soft Spectacle Tear prod. Recom. age prescription astigmatism rate lenses young myope yes reduced none presbyopic myope no reduced none young myope yes normal hard presbyopic myope no normal none young hypermetrope no reduced none presbyopic myope yes reduced none young hypermetrope no normal soft presbyopic myope yes normal hard young hypermetrope yes reduced none presbyopic hypermetrope no reduced none young hypermetrope yes normal hard presbyopic hypermetrope no normal soft Pre-presbyopic myope no reduced none presbyopic hypermetrope yes reduced none Pre-presbyopic myope no normal soft presbyopic hypermetrope yes normal none Pre-presbyopic myope yes reduced none Pre-presbyopic myope yes normal hard Pre-presbyopic hypermetrope Pre-presbyopic hypermetrope no no reduced normal none soft All combinations of possible values = 3x2x2x2= 24 possibilities Pre-presbyopic hypermetrope yes reduced none Pre-presbyopic hypermetrope yes normal none If tear_production_rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft 6 Knowledge Management and Discovery © Kritsada Sriphaew
  • 7. Input: Concepts, Instance & Attributes  Concept description  the thing that is to be learned (learning result)  hard to pin down precisely but  intelligible and operational  Instances (‘examples’ referred as input)  Information that the learner is given  A single table vs. multiple tables (denormalization to a single table)  Denormalization sometimes produces apparent regularities, such as supplier vs. supplier address do always match together.  Attribute (features)  Each instance is characterized by a fixed, predefined set of features or attributes 7 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 8. Input: Concepts, Instance & Attributes Attributes Concepts Ordinal Attr. Numeric Attr. Nominal Attr. Numeric Nominal outlook temp. humidity windy Sponsor play-time play sunny 85 87 True Sony 85 Y sunny 80 90 False HP 90 Y Instances (Examples) overcast 87 75 True Ford 63 Y rainy 70 95 True Ford 5 N rainy 75 65 False HP 56 Y sunny 90 94 True ? 25 N rainy 65 86 True Nokia 5 N overcast 88 92 True Honda 86 Y rainy 79 75 False Ford 78 Y Missing value overcast 85 88 True Sony 74 Y 8 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 9. Independent vs. Dependent Instances  Normally, the input data are represented as a set of independent instances.  But there are many problems involving relationship between objects. That is, some instances depend with the others.  Ex.: A family tree: the sister-of relation Close World Assumption first second sis first second sis person person ter person person ter Harry Sally Richard Julia Harry Sally N Steven Demi Y M F M F Harry Steven N Bruce Demi Y Tison Diana Y Steven Peter N Bill Diana Y Steven Bruce Demi Tison Diana Bill Steven Bruce N Nina Rica Y M M F M F M Steven Demi Y Rica Nina Y Bruce Demi Y All the rest N Nina Rica Rica Nina Y F F 9 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 10. Independent vs. Dependent Instances Harry Sally Richard Julia name gender parent1 parent2 first second sis M F M F person person ter Harry Male ? ? Sally Female Steven Demi Y Steven Bruce Demi Tison Diana Bill ? ? Bruce Demi Y M M F M F M Richard Male ? ? Julia Female ? ? Tison Diana Y Steven Male Harry Sally Bill Diana Y Nina Rica Bruce Male Harry Sally Nina Rica Y F F Demi Female Harry Sally Rica Nina Y Tison Male Richard Julia sister_of(X,Y) :- female(Y), Diana Female Richard Julia All the rest N parent(Z,X), Bill Male Richard Julia parent(Z,Y). Nina Female Tison Demi Rica Female Tison Demi Denormalization first second sister gender parent1 parent2 gender parent1 parent2 person person Steven Male Harry Sally Demi Female Harry Sally Y Bruce Male Harry Sally Demi Female Harry Sally Y Tison Male Richard Julia Diana Female Richard Julia Y Bill Male Richard Julia Diana Female Richard Julia Y Nina Female Tison Demi Rica Female Tison Demi Y Rica Female Tison Demi Nina Female Tison Demi Y All the rest N 10 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 11. Problems of Denormalization  A large table with duplication values included.  Relations among instances (rows) are ignored.  Some regularities in the data are merely reflections of the original database structure but might be found by the data mining process, e.g., supplier and supplier address.  Some relations are not finite, e.g., ancestor-of relations. Inductive logic programming can use recursion to deal with this situations (the infinite number of possible instances) If person1 is a parent of person2 then person1 is an ancestor of person2 If person1 is a parent of person2 and person2 is a parent of person3 then person1 is an ancestor of person3 11 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 12. Missing, Inaccurate, duplicated values  Many practical datasets may include three types of errors:  Missing values  frequently indicated by out-of-range entries (a negative number)  unknown vs. unrecorded vs. irrelevant values  Inaccurate values  typographical errors: misspelling, mistyping  measurement errors: errors generated by a measuring machine.  Intended errors: Ex.: input the zip code of the rental agency instead of the renter’s zip code.  Duplicated values  repetition of data gives such data more influence on the result. 12 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 13. Output: Knowledge Representation  There are many different ways for representing the patterns that can be discovered by machine learning. Some popular ones are:  Decision tables  Decision trees  Classification rules  Association rules  Rules with exceptions  Rules involving relations  Trees for numeric prediction  Instance-based representation  Clusters 13 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 14. Decision Tables  The simplest, most rudimentary way of representing the output from machine learning or data mining  Ex.: A decision table for the weather data to decide whether or not to “play” outlook temp. humidity windy Sponsor play-time play sunny hot high True Sony 85 Y (1) How to make a sunny hot high False HP 90 Y smaller, condensed overcast hot normal True Ford 63 Y table with some useless attributes rainy mild high True Ford 5 N are omitted. rainy cool low False HP 56 Y sunny hot low True Sony 25 N (2) How to cope with a rainy cool normal True Nokia 5 N case which does overcast mild high True Honda 86 Y not exist in the rainy mild low False Ford 78 Y table. overcast hot high True Sony 74 Y 14 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 15. Decision Trees (1)  A “divide-and-onquer” approach to the problem of learning.  Ex.: A decision tree (DT) for the contact lens data to decide which type of contact lens is suitable. Tear production rate reduced normal none astigmatism no yes soft Spectacle prescription myope hyperope hard none 15 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 16. Decision Trees (2)  Nodes in a DT involve testing a particular attribute with a constant. However, it is possible to compare two attributes with each other, or to utilized some function of one or more attributes.  If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible values of the attributes.  In this case, the same attribute will not be tested again further down the tree.  In the case that the attributes are divided into two subsets, the attribute might be tested more than one times in a path. 16 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 17. Decision Trees (3)  If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a predetermined constant.  If missing value is treated as an attribute value, there will be a third branch.  Three-way split into (1) less-than, equal-to and greater-than, or (2) below, within and above. 17 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 18. Classification Rules (1)  A popular alternative to decision trees. Also called a decision list.  Ex.: If outlook = sunny and humidity = high then play = yes If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes outlook temp. humidity windy Sponsor play-time play Decision Table sunny hot high True Sony 85 Y sunny hot high False HP 90 Y overcast hot normal True Ford 63 Y rainy mild high True Ford 5 N rainy cool low False HP 56 Y sunny hot low True Sony 25 N rainy cool normal True Nokia 5 N overcast mild high True Honda 86 Y rainy mild low False Ford 78 Y overcast hot high True Sony 74 Y 18 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 19. Classification Rules (2)  A set of rules is interpreted in sequence. a y n  The antecedent (or precondition) is a series of tests while the consequent (or y b c conclusion) gives the class or classes n y n x to the instances. c d y n  It is easy to read a set of rules directly n y off a decision trees but the opposite y d n x function is not quite straightforward. x  Ex.: replicated subtree problem  If a and b then x  If c and d then x 19 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 20. Classification Rules (3)  One reason why classification rules are popular:  Each rule seems to represent an independent “nugget” of knowledge.  New rules can be added to an existing rule set without disturbing those already there (In the DT case, it is necessary to reshaping the whole tree).  If a rule set gives multiple classifications for a particular example, one solution is to give no conclusion at all.  Another solution is to count how often each rule fires on the training data and go with the most popular one.  One more problem occurs when an instance is encountered that the rules fail to classify at all.  Solutions: (1) fail to classify, or (2) choose the most popular class 20 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 21. Classification Rules (4)  In a particularly straightforward situation, when rules lead to a class that is boolean (y/n) and when only rules leading to one outcome (say yes) are expressed  A form of closed world assumption.  The result rules cannot be conflict and there is no ambiguity in rule interpretation.  A set of rules can be written as a logic expression disjunctive normal form ( a disjunction (OR) of conjunctive (AND) conditions ). 21 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 22. Association Rules (1)  Association rules are really no different from classification rules except that they can predict any attribute, not just the class.  This gives them the freedom to predict combinations of attributes, too.  Association rules (ARs) are not intended to be used together as a set, as classification rules are  Different ARs express different regularities that underlies the dataset, and they generally predict different things.  From even a small dataset, a large number of ARs can be generated. Therefore, some constraints are needed for finding useful rules. Two most popular ones are (1) support and (2) confidence. 22 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 23. Association Rules (2)  For example, xy [ s = p(x,y), c = p(x,y)/p(x) ]  If temperature = hot then humidity = high (s=3/10,c=3/5)  If windy=true and play=Y then humidity=high and outlook=overcast (s=2/10, c=2/4)  If windy=true and play=Y and humidity=high then outlook=overcast (s=2/10, c=2/3) outlook temp. humidity windy Sponsor play-time play sunny hot high True Sony 85 Y sunny hot high False HP 90 Y overcast hot normal True Ford 63 Y rainy mild high True Ford 5 N rainy cool low False HP 56 Y sunny hot low True Sony 25 N rainy cool normal True Nokia 5 N overcast mild high True Honda 86 Y rainy mild low False Ford 78 Y overcast hot high True Sony 74 Y 23 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 24. Rules with Exception (1)  For classification rules, incremental modifications can be made to a rule set by expressing exceptions to existing rules rather than by reengineering the entire set. Ex.:  If petal-length >= 2.45 and petal-length < 4.45 then Iris-versicolor Sepal length Sepal width Petal length Petal width type A new case 5.1 3.5 2.6 0.2 Iris-setosa  If petal-length >= 2.45 and petal-length < 4.45 then Iris- versicolor EXCEPT if petal-width < 1.0 then Iris-setosa  Of course, we can have exceptions to the exceptions, exceptions to these and so on. 24 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 25. Rules with Exception (2)  Rules with exceptions can be used to represent the entire concept description in the first place.  Ex.: Default: Iris-setosa except if petal-length >= 2.45 and petal-length < 5.355 and petal-width < 1.75 then Iris-versicolor except if petal-length >= 4.95 and petal-width < 1.55 then Iris-virginica else if sepal-length < 4.95 and sepal-width >=2.45 then Iris-virginica else if petal-length >= 3.35 then Iris-virginica except if petal-length < 4.85 and sepal-length<5.95 then Iris-versicolor 25 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 26. Rules with Exception (3)  Rules with exceptions can be proved to be logically equivalent to an if-else statements.  The user can see that it is plausible, the expression in terms of (common) rules and (rare) exceptions will be easier to grasp than a normal structure (if-else). 26 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 27. Rules involving relations (1)  So far the conditions in rules involve testing an attribute value against a constant.  This is called propositional (in propositional calculus).  Anyway, there are situation where a more expressive form of rule would provide more intuitive&concise concept description.  Ex.: the concept of standing up.  There are two classes: standing and lying.  The information given is the width, height and the number of sides of each block. standing lying 27 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 28. Rules involving relations (2)  A propositional rule set produced for this data might be  If width >= 3.5 and height < 7.0 then lying  If height >= 3.5 then standing  A rule set with relations that will be produced, is  If width(b)>height(b) then lying  If height(b)>width(b) then standing lying width height sides class 2 4 4 stand 3 6 4 stand 4 3 4 lying standing 7 8 3 stand 7 6 3 lying 2 9 3 stand 9 1 4 lying 10 2 3 lying 28 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 29. Trees for numeric prediction  Instead of predicting categories, predicting numeric quantities is also very important.  We can use regression equation.  There are two more knowledge representations: regression tree and model tree.  Regression trees are decision tree with averaged numeric values at the leaves.  It is possible to combine regression equations with regression trees. The result model is model tree, a tree whose leaves contain linear expressions. 29 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 30. An example of numeric prediction CPU performance (Numeric prediction)  PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN + <=7.5 CHMIN >7.5  0.0056 MMAX + 0.6410 CACH - CACH MMAX <=8.5 >28 (8.5,28]  0.2700 CHMIN + 1.480 CHMAX 19.3 <=28000 >28000 CHMAX MMAX (28/8.7%) MMAX 157(21/73.7% ) <=58 >58 cycle main memory cache channels perfor <=2500 (2500,4250] >4250 <=10000>10000 time min max (Kb) min max mace 19.3 29.8 75.7 133 783 CACH MMIN (28/8.7%) (37/8.18%) (10/24.6%) (16/28.8%) (5/35.9%) MYCT MMIN MMAX CACH CHMIN CHMAX PRP <=0.5 <=12000 >12000 1 125 256 6000 256 16 128 198 (0.5,8.5] MYCT 2 29 8000 32000 32 8 32 269 59.3 281 492 3 29 8000 32000 32 8 32 220 <=550 >550 (24/16.9%) (11/56%) (7/53.9%) 4 29 8000 32000 32 8 32 172 37.3 18.3 5 29 8000 16000 32 8 16 132 (19/11.3%) (7/3.83%) … ... ... ... ... ... ... ... 207 125 2000 8000 0 2 14 52 Regression CHMIN 208 480 512 8000 32 0 0 67 <=7.5 >7.5 Tree 209 480 1000 4000 0 0 0 45 CACH MMAX <=8.5 >8.5 LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN <=28000 >28000 LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN MMAX LM4 LM5(21/45.5 LM6 + 0.946 CHMAX <=4250 >4250 (50/22.1%) %) (23/63.5%) LM3: PRP = 38.1 + 0.12 MMIN LM1 LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH (65/7.32%) CACH + 0.969 CHMAX <=0.5 (0.5,8.5] LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH LM2 LM3 - 9.39 CHMIN (26/6.37%) (24/14.5%) Model Tree LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN 30 + 4.98 CHMAX Data Warehousing and Data Mining by Kritsada Sriphaew
  • 31. Instance-based representation (1)  The simplest form of learning is plain memorization.  Encountering a new instance the memory is searched for the training instance that most strongly resembles the new one.  This is a completely different way of representing the “knowledge” extracted from a set of instances: just store the instances themselves and operate by relating new instances whose class is unknown to existing ones whose class is known.  Instead of creating rules, work directly from the examples themselves. 31 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 32. Instance-based representation (2)  Instance-based learning is lazy, deferring the real work as long as possible.  Other methods are eager, producing a generalization as soon as the data has been seen.  In instance-based learning, each new instance is compared with existing ones using a distance metric, and the closest existing instance is used to assign the class to the new one. This is also called the nearest-neighbor classification method.  Sometimes more than one nearest neighbor is used, and the majority class of the closest k neighbors is assigned to the new instance. This is termed the k-nearest-neighbor method. 32 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 33. Instance-based representation (3)  When computing the distance between two examples, the standard Euclidean distance may be used.  When nominal attributes are present, we may use the following procedure.  A distance of 0 is assigned if the values are identical, otherwise the distance is 1.  Some attributes will be more important than others. We need some kinds of attribute weighting. To get suitable attribute weights from the training set is a key problem.  It may not be necessary, or desirable, to store all the training instances.  To reduce the nearest neighbor calculation time.  To reduce the unrealistic amounts of storages. 33 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 34. Instance-based representation (4)  Generally some regions of attribute space are more stable with regard to class than others, and just a few examples are needed inside stable regions.  An apparent drawback to instance-based representation is that they do not make explicit the structures that are learned. (a) (b) (c) 34 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 35. Clusters  The output takes the form of a diagram that shows how the instances fall into clusters.  The simplest case involving associating a cluster number with each instance (Fig. a).  Some clustering algorithm allow one instance to belong to more than one cluster, a Venn diagram (Fig. b).  Some algorithms associate instances with clusters probabilistically rather than categorically (Fig. c).  Other algorithms produce a hierarchical structure of clusters, called dendrograms (Fig. d).  Clustering may work with other learning methods for more performance. 1 2 3 g a 0.4 0.3 0.3 g b 0.6 0.3 0.1 h e a h e a c 0.1 0.4 0.5 d d d 0.5 0.2 0.3 c b f c b f e 0.6 0.3 0.1 f 0.4 0.1 0.5 g 0.1 0.4 0.5 (a) (b) h 0.2 0.7 0.1 a b c d e f g h 35 (c) (d)
  • 36. Why Data Preprocessing? (1)  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. Ex: occupation =“”  noisy: containing errors or outliers. Ex: salary = “-10”  inconsistent: containing discrepancies in codes or names. Ex: Age=“42” but Birthday = “01/01/1997” Was rating “1,2,3” but now rating “A,B,C”  No quality data, no quality mining results!  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data 36 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 37. Why Data Preprocessing? (2)  To integrate multiple sources of data to more meaningful one.  To transform data to the form that makes sense and is more descriptive  To reduce the size (1) in cardinality aspect and/or (2) in variety aspect in order to improve the computational time and accuracy Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: • Accuracy • Believability • Completeness • Value added • Consistency • Interpretability • Timeliness • Accessibility 37 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 38. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation and data reduction  Normalization and aggregation  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization: data reduction, especially for numerical data 38 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 39. Forms of Data Preprocessing Data Cleaning Data Integration Data Transformation Data Reduction 39 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 40. Data Cleaning Topics in Data Cleaning  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Advanced techniques for automatic data cleaning  Improving decision tree  Robust regression  Detecting anomalies 40 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 41. Missing Data  Data is not always available  e.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred. 41 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 42. How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree  The most popular, preserve relationship between missing attributes and other attributes 42 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 43. How to Handle Missing Data? (Examples) Attributes Concepts outlook temp. humidity windy Sponsor play-time play sunny 85 87 True Sony 85 Y 1 sunny 80 90 False HP 90 Y ignore overcast 87 75 True Ford 63 ? 4 rainy 70 95 True Ford 5 N humid = 86.9 rainy 75 ? False HP 56 Y humid|play=y 5 sunny 90 94 True ? 25 N = 86.4 rainy 65 86 True Nokia 5 N overcast 88 92 True Honda 86 Y 3 rainy 79 75 False Ford 78 Y Add Unknown overcast 85 88 ? Sony 74 Y 2 6 Predict by Bayesian formula or decision tree Manually Checking 43 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 44. Noisy Data  Noise: random error or variance in a measured variable  Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data 44 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 45. How to Handle Noisy Data  Binning method (Data smoothing):  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human  Regression  smooth by fitting the data into regression functions 45 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 46. Simple Discretization Methods: Binning  Equal-width (distance) partitioning:  It divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals W = (B-A)/N.  The most straightforward  But outliers may dominate presentation (since we use lowest/highest values)  Skewed (asymmetrical) data is not handled well.  Equal-depth (frequency) partitioning:  It divides the range into N intervals, each containing around same number of samples  Good data scaling  Managing categorical attributes can be tricky. 46 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 47. Binning Methods for Data Smoothing  Sorted data for price (in dollars):  4, 8, 9, 15, 21, 21, 24, 25, 26, 27, 29, 34  Partition into (equi-depth) bins:  - Bin 1: 4, 8, 9, 15 (mean = 9, median = 8.5) Partition into  - Bin 2: 21, 21, 24, 25 (mean = 22.75, median = 23) equidepth bin  - Bin 3: 26, 27, 29, 34 (mean = 29, median = 28) (depth=3)  Smoothing by bin means:  - Bin 1: 9, 9, 9, 9  - Bin 2: 22.75, 22.75, 22.75, 22.75  - Bin 3: 29, 29, 29, 29 Each value in a bin is replaced by the  Smoothing by bin medians: mean (or median) value of the bin.  - Bin 1: 8.5, 8.5, 8.5, 8.5 Similarly, smoothing by bin median  - Bin 2: 23, 23, 23, 23  - Bin 3: 28, 28, 28, 28  Smoothing by bin boundaries:  - Bin 1: 4, 4, 4, 15 The minimum and  - Bin 2: 21, 21, 25, 25 maximum values in a given bin are identified  - Bin 3: 26, 26, 26, 34 as the bin boundaries 47 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 48. Cluster Analysis [Clustering] detect and remove outliers 48 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 49. Regression y Y1 Y1’ y=x+1 X1 x [Regression] smooth by fitting the data into regression functions 49 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 50. Automatic Data Cleaning (Improving Decision Trees)  Improving decision trees: relearn tree with misclassified instances removed or pruning away some subtrees  Better strategy (of course): let human expert check misclassified instances  When systematic noise is present it is better not to modify the data  Also: attribute noise should be left in training set  (Unsystematic) class noise in training set should be eliminated if possible 50 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 51. Automatic Data Cleaning (Robust Regression - I)  Statistical methods that address problem of outliers are called robust  Possible way of making regression more robust:  Minimize absolute error instead of squared error  Remove outliers (i. e. 10% of points farthest from the regression plane)  Minimize median instead of mean of squares (copes with outliers in any direction)  Finds narrowest strip covering half the observations 51 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 52. Automatic Data Cleaning (Robust Regression - II) Least absolute perpendicular 52 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 53. Automatic Data Cleaning (Detecting Anomalies)  Visualization is a best way of detecting anomalies (but often can’t be done)  Automatic approach:  committee of different learning schemes, e.g. decision tree, nearest- neighbor learner, and a linear discriminant function  Conservative approach: only delete instances which are incorrectly classified by all of them  Problem: might sacrifice instances of small classes 53 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 54. Data Integration Data Integration  Data integration:  combines data from multiple sources into a coherent store  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., How to match A.cust-num with B.customer-id  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different  possible reasons: different representations, different scales, e.g., metric vs. British units 54 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 55. Handling Redundant Data in Data Integration  Redundant data occur often A correlation between when integration of multiple attribute A and B databases n  The same attribute may have different names in different  ( A  A )( B  B ) i i databases R A, B  i 1  One attribute may be a (n  1) A B “derived” attribute in another table, e.g., annual revenue  Some redundancies can be  (x  x) 2 n x 2  ( x ) 2 detected by correlational   n 1 n(n  1) analysis where   standard deviation  Careful integration of the data from multiple sources may help If RA,B > 0 then A and B are positively correlated. reduce/avoid redundancies and If RA,B = 0 then A and B are independent. If RA,B < 0 then A and B are negatively correlated. inconsistencies and improve mining speed and quality 55 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 56. Data Transformation and Data Reduction Data Transformation  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Attribute/feature construction  New attributes constructed from the given ones 56 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 57. Data Transformation: Normalization  min-max normalization v  vmin v'  (vmax  vmin )  vmin new new new vmax  vmin xx z ( x)   z-score normalization  vv  (x  x)2 n x 2  ( x) 2 v'    v n 1 n(n  1)  normalization by decimal scaling v v'  j Where j is the smallest integer such that Max(| v' |)<1 10 57 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 58. Data Reduction Data Reduction Strategies  Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data  Data reduction  Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Data reduction strategies  Data cube aggregation (reduce rows)  Dimensionality reduction (reduce columns)  Numerosity reduction (reduce columns or values)  Discretization / Concept hierarchy generation (reduce values) 58 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 59. Three Types of Data Reduction  Three types of data reduction are:  Reduce no. of column (feature or attribute)  Reduce no. of row (case, example or instance)  Reduce no. of the values in a column (numeric/nominal) Columns Values outlook temp. humidity windy Sponsor play-time play sunny 85 87 True Sony 85 Y sunny 80 90 False HP 90 Y Rows overcast 87 75 True Ford 63 Y rainy 70 95 True Ford 5 N rainy 75 65 False HP 56 Y 59 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 60. Data Cube Aggregation  Ex. You are interested in the annual sales rather than the total per quarter, thus the data can be aggregated resulting data summarize the total sales per year instead of per quarter  The resulting data set is smaller in volume, without loss of information necessary for the analysis task 60 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 61. Dimensionality Reduction  Feature selection (i.e., attribute subset selection):  Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features  reduce the number of patterns, easier to understand  Heuristic methods (due to exponential number of choices):  decision-tree induction (wrapper approach)  independent assessment (filter method)  step-wise forward selection  step-wise backward elimination  combining forward selection+backward elimination 61 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 62. Decision Tree Induction (Wrapper Approach) Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 2 Class 1 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} 62 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 63. Numerosity Reduction  Parametric methods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces (estimate the probability of each cell in a larger cuboid based on the smaller cuboids)  Non-parametric methods  Do not assume models  Major families: histograms, clustering, sampling 65 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 64. Regression  Linear regression: Y = a + bX  Two parameters , a and b specify the line and are to be estimated by using the data at hand.  using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Multiple regression: Y = a + b1X1 + b2X2.  Many nonlinear functions can be transformed into the above. 66 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 65. Histograms  A popular data reduction technique  Divide data into buckets and store average (or sum) for each bucket  Related to quantization problems. 40 35 30 25 20 15 10 5 0 10000 30000 50000 70000 90000 67 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 66. Clustering  Partition data set into clusters, and one can store cluster representation only  Can be very effective if data is clustered but not if data is “smeared (dirty)”  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms. 68 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 67. Sampling  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew (bias)  Develop adaptive sampling methods  Stratified (classify) sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed (biased) data 69 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 68. Sampling Raw Data 70 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 69. Sampling Raw Data Cluster/Stratified Sample 71 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 70. Discretization and concept hierarchy generation Discretization  Three types of attributes:  Nominal: values from an unordered set  Ordinal: values from an ordered set  Continuous: real numbers  Discretization:  divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization  Prepare for further analysis 72 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 71. Discretization and Concept hierachy  Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.  Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). 73 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 72. Discretization and Concept hierarchy generation - numeric data  Binning (see sections before)  Histogram analysis (see sections before)  Clustering analysis (see sections before)  Entropy-based discretization  Keywords:  Supervised discretization  Entropy-based discretization  Unsupervised discretization  Clustering, Binning, Histogram 74 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 73. Entropy-Based Discretization  Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is info(S,T) = (|S1|/|S|) × info(S1) + (|S2|/|S|) × info(S2)  The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.  The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., info(S) - info(S,T) < threshold  Experiments show that it may reduce data size and improve classification accuracy 75 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 74. Entropy-Based Discretization  Ex.: temperature attribute of weather data are 64 65 68 69 70 71 72 75 80 81 83 85 y n y y y n y/n y/y n y y n N Temp=71.5 info ( X )   pi log 2 pi i 1 6  8  info ([4,2], [5.3])    info ([4,2])     info ([5,3])   14   14   0.939 bits info([9,5])  0.940 bits 76
  • 75. Specification of a set of attributes (Concept hierarchy generation)  Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country 15 distinct values province_or_ state 65 distinct values city 3567 distinct values street 674,339 distinct values 77 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 76. Why Postprocessing?  To improve the acquired model (the mined knowledge)?  Techniques to combine several mining approaches to find better results Method 1 Output Data Combined Input Data Method 2 Method N 78 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 77. Combining Multiple Models Engineering the Output (Overview)  Basic idea of “meta” learning schemes: build different “experts” and let them vote  Advantage: often improves predictive performance  Disadvantage: produces output that is very hard to analyze  Schemes we will discuss are bagging, boosting and stacking (or stacked generalization)  These approaches can be applied to both numeric and nominal classification 79 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 78. Combining Multiple Models (Bagging - general)  Employs simplest way of combining predictions: voting/ averaging  Each model receives equal weight  “Idealized” version of bagging:  Sample several training sets of size(instead of just having one training set of size n)  Build a classifier for each training set  Combine the classifier’s predictions  This improves performance in almost all cases if learning scheme is unstable  (i.e. decision trees) 80 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 79. Combining Multiple Models (Bagging - algorithm)  Model generation  Let N be the number of instances in the training data.  For each of t iterations:  Sample n instances with replacement from training set.  Apply the learning algorithm to the sample.  Store the resulting model.  Classification  For each of the t models:  Predict class of instance using model.  Return class that has been predicted most often. 81 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 80. Combining Multiple Models (Boosting - general)  Also uses voting/ averaging but models are weighted according to their performance  Iterative procedure: new models are influenced by performance of previously built ones  New model is encouraged to become expert for instances classified incorrectly by earlier models  Intuitive justification: models should be experts that complement each other  (There are several variants of this algorithm) 82 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 81. Combining Multiple Models (Stacking - I)  Hard to analyze theoretically: “black magic”  Uses “meta learner” instead of voting to combine predictions of base learners  Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model)  Base learners usually have different learning schemes  Predictions on training data can’t be used to generate data for level-1 model!  Cross-validation-like scheme is employed 83 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 82. Combining Multiple Models (Stacking - II) 84 Data Warehousing and Data Mining by Kritsada Sriphaew