SlideShare a Scribd company logo
1 of 77
Download to read offline
Copyright © 2013 Geoffrey I Webb




Fundamental and Advanced Machine Learning Methods for
                 Big Data Applications

                    Geoffrey I Webb,
        Ana Martinez, Nayyar Zaidi, Shenglei Chen
                    Monash University
                 http://www.csse.monash.edu.au/~webb
Copyright © 2013 Geoffrey I Webb
Copyright © 2013 Geoffrey I Webb




                             Overview

•   Big data
•   Classification learning
•   Sampling
•   Dimensionality reduction
•   Scaling-up existing algorithms
•   Stream learning
•   Bias and variance and big data
•   Selective KDB
•   Incremental Bayesian Network Classifiers
Copyright © 2013 Geoffrey I Webb




                                                         Big data

   • Can mean many things
      – Complex integration of many heterogeneous data sources
      – Very large/streaming data

                       Name (SI                Value              Binary usage
                   decimal prefixes)
                   kilobyte (kB)

                   megabyte (MB)

                   gigabyte (GB)

                   terabyte (TB)

                   petabyte (PB)

                   exabyte (EB)

                   zettabyte (ZB)

                    yottabyte (YB)
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                  What is ‘big’?
   • Number of
       – instances
       – dimensions
       – classes
   • Big data usually includes data sets with sizes beyond the ability of
       commonly used software tools to capture, curate, manage, and process
       the data within a tolerable elapsed time. Big data sizes are a
       constantly moving target, as of 2012 ranging from a few dozen
       terabytes to many petabytes of data in a single data set.
         – Wikipedia
   • Machine learning research usually treats more than 1 million examples
     as very large.



Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                     Examples

   •    Spelling correction
   •    Translation
   •    Farecast
   •    Recommender systems
   •    Electoral outcomes




Whitelaw, C, B Hutchinson, GY Chung, & G Ellis. "Using the web for language independent spellchecking and autocorrection." In Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899. Association for Computational Linguistics, 2009.
Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                           Not a universal panacea

   • Jeopardy but not chess
   • Spelling correction and translation but not comprehension




                            http://www.engadget.com/2011/02/15/watson-soundly-beats-the-humans-in-first-round-of-jeopardy/

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                            Classification learning




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                             Evolving distributions

   • Key issue
      – Is the distribution from which the data are drawn static or
         dynamic?
      – Concept drift
             • class membership changes, eg rich
          – Concept evolution
             • new classes emerge
          – Distribution drift
             • probabilities change

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                              Dimension of change

   • Normally time but may be other such as location
   • Classifier can only take dimension of change into account if
     data to be classified will fall within current scope or if it is
     possible to extrapolate


     18                                                              14
     16
                                                                     12
     14
                                                                     10
     12
     10                                                               8
                                                        Training                                                           Training
      8                                                               6
                                                        Testing                                                            Testing
      6
                                                                      4
      4
                                                                      2
      2
      0                                                               0
          0 2 4 6 8 10 12 14 16 18 20 22 24 26 28                         0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                   Loss functions




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                              Imbalanced classes

   • Many big datasets have a rare class of interest and a
     majority class from which we seek to distinguish it.
      – Ad click-through
      – Conversions
      – Disease
      – Fraud
      – Homeland security




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                           Loss functions for imbalanced classes




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                           Loss functions for imbalanced classes




                                                        Predictions
                                                        Pos       Neg
                                        Actual




                                                 Pos     TP        FN
                                                 Neg     FP       TN




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                  Loss functions for imbalanced classes

   • Area under the ROC curve
                                                                                                            True Positive Rate (TPR)
                                                                                                                              Predictions
                                                                                                                              Pos        Neg




                                                                                                               Actual
                                                                                                                        Pos   TP          FN
                                                                                                                        Neg   FP          TN


                                                                                                            False Positive Rate (FPR)
                                                                                                                              Predictions
                                                                                                                              Pos        Neg



                                                                                                               Actual
                                                                                                                        Pos   TP          FN
                                                                                                                        Neg   FP          TN
          Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”
          http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                  Loss functions for imbalanced classes

   • Area under the Precision Recall Curve
                                                                                                            Recall = True Positive
                                                                                                                 Rate (TPR)
                                                                                                                            Predictions
                                                                                                                            Pos          Neg




                                                                                                             Actual
                                                                                                                      Pos   TP            FN
                                                                                                                      Neg   FP            TN


                                                                                                                      Precision
                                                                                                                            Predictions
                                                                                                                            Pos          Neg




                                                                                                             Actual
                                                                                                                      Pos   TP            FN
                                                                                                                      Neg   FP            TN
          Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”
          http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                Mutual information




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                   Learning curves
                               0.7

                                                                                                                                KDB k=2

                               0.6




                               0.5
     Root Mean Squared Error




                               0.4




                               0.3




                               0.2




                               0.1




                                0
                                     0   100,000   200,000   300,000   400,000     500,000       600,000   700,000   800,000   900,000       1,000,000
                                                                                 Data quantity


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                         Sampling

   • Select s instances from a dataset of size n
   • Important that sample be selected randomly
   • Make sure you use a robust random number generator




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                      Ideal Sampling
   • Select data quantity at which learning curve approaches asymptotic error and
     learn from sample
                                   0.7

                                                                                                                                    KDB k=2
                                   0.6


                                   0.5
         Root Mean Squared Error




                                   0.4


                                   0.3


                                   0.2


                                   0.1


                                    0
                                         0   100,000   200,000   300,000   400,000     500,000       600,000   700,000   800,000   900,000     1,000,000
                                                                                     Data quantity



Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                 Finding asymptotic error

     • Progressive sampling
                                       0.7

                                                                                                                                                 KDB k=2
                                       0.6


                                       0.5
             Root Mean Squared Error




                                       0.4


                                       0.3


                                       0.2


                                       0.1


                                        0
                                             0   100,000   200,000   300,000   400,000     500,000       600,000    700,000       800,000      900,000       1,000,000
                                                                                         Data quantity

Provost, F, D Jensen, T Oates. “Efficient progressive sampling.” In Proc 5th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                                 Hoeffding's bound

                                                                                                                                                                                 Error margin



                                                                                 Sample
                                        Population mean                                                            Sample size
                                                                                  mean




Hulten, G, and P Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002.


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                   Maximum sample

   • Take largest sample capacity can handle
                                   0.7

                                                                                                                                    KDB k=2
                                   0.6



                                   0.5
         Root Mean Squared Error




                                   0.4



                                   0.3



                                   0.2



                                   0.1



                                    0
                                         0   100,000   200,000   300,000   400,000     500,000       600,000   700,000   800,000   900,000     1,000,000
                                                                                     Data quantity



Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                Maximum sample

   • Take largest sample capacity can handle
   • Saves overheads of repeated sampling and risk of terminating
     too soon
   • Has risk that asymptotic error may not be reached
       – but alternative techniques wouldn’t be able to handle a
         larger sample anyway!




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                          Sampling with and without replacement

   • Sampling involves deciding how many times Ki each element i
     of a collection should occur in the sample
   • Sampling without replacement restricts Ki to 0 or 1




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




              Uniform fixed-sized sampling with replacement for fixed n

   selected ← 0
   while selected < s
         add a randomly selected instance to the sample
         increment selected




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




          Uniform sequential variable-sized sampling without replacement

   i ← 1
   while i < n
        with fixed probability do
             add the next instance to the sample
        increment i




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb



         Uniform sequential fixed-sized sampling without replacement for
                                     known n
   selected ← 0
   i ← 1
   while selected < s
         with probability (s - selected )/(n-i+1) do
              add the next instance to the sample
              increment selected
         increment i




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb



          Uniform sequential fixed-sized sampling without replacement for
                                     unknown n
   count ← 0
   while count < s and count < n
       add the next instance to the sample
       increment count
   while more instances remain
       increment count
       with probability s/count do
           add the next instance to the sample replacing an existing instance
           selected at random
       else
           discard the next instance
 Tille, Yves. Sampling algorithms. Springer, 2006.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                          Dimensionality reduction

   • Many learning algorithms are super-linear with respect to
     dimensionality
   • Dimensionality can be reduced by
       – feature selection
       – feature projection




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                 Feature selection
   • Most powerful techniques are too computationally intensive for big
     data
      – Eg wrapper techniques
      – Best approach varies depending on base learner
   • Techniques that consider only the relationship between an attribute
     and the class are efficient
      – Eg top-k mutual information
      – However, overlook complex interactions between attributes
   • May be most effective to use powerful technique on a sample



Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                Feature Projection
   • Project feature space onto lower dimensional space
   • Principal Components Analysis
   • First principal component is the planar projection that maximises
     variance (= minimises RMSE with respect to original)
   • Subsequent principal components are those that maximise variance (=
     minimise RMSE) while being uncorrelated with prior components
   • First few principal components will capture most of the variation (=
     information) in the data
   • Generalisations including principal curves and manifolds project onto
     manifolds instead of planes



Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                      Scaling-up existing algorithms

   • Distributed cloud/cluster computing
   • Hadoop
      – Commodity clusters
      – Map Reduce
                  • Map problem onto sub-problems and distribute these
                  • Assemble solution from solutions to sub-problems




 White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                             Streaming algorithms
   • Handle data that are too large to retain
       – computer network/phone traffic, financial transactions, web
          searches, sensor data
   • May be difficult to get labelled data
   • Strong memory and running time constraints
       – learning rate must be greater than the data rate
       – only limited data can be retained
   • Real time accuracy evaluation and formalisation, mainly to adjust
     the parameters accordingly.


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                     Online and incremental learning

   • Online learning
      – Data arrives as input stream
      – Classifier makes prediction
      – Then correct classification is revealed and classifier updated
      – Examples
                   • Ad placement, online conversions
   • Incremental learning
       – Classifier is updated as input arrives
       – Classifier is identical to batch classifier
Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 736-743.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                             Streaming Strategies
   • Retain samples of data and learn from these
      – Continually assess current model against incoming data and
          when models lose accuracy take new samples and relearn
   • Continually update a model using current data
      – Refine using new data
      – Prune elements that decline in accuracy
   • Create ensemble of classifiers each learned from successive time
     periods
      – Retire older classifiers as newer ones are created


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                              Weighted majority algorithm

   • Each classifier E has a weight wt E
   • Classification by weighted majority vote
   • All incorrect classifiers have their weights reduced wt+1 E
     =wt E , 0<  <1
   • Error is bounded to no more than twice the error of the best
     classifier




Littlestone, N, and MK Warmuth. "The weighted majority algorithm." In 30th Annual Symposium on Foundation of Computer Science, pp. 256-261. IEEE, 1989.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                               Winnow

                                                                                                            Threshold



                                                                                                          Binary attributes
                                                                                             Non-negative real valued
                                                                                                    weights
            Prediction           Correct              xi = 0               xi = 1
                   1                  0           unchanged
                   0                  1           unchanged




Littlestone, N. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine Learning 2(4)(1988): 285-318.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                          Stochastic gradient descent
    • Many classifiers have parameters that are learned by optimisation
       e.g. logistic regression and SVM
       – usually requires many passes through the data
    • For linear classifiers stochastic gradient descent often converges before
      a single pass is completed.
       – global gradient approximated by the gradient at each example
       – performs sequential updates
       – good step size is essential
                  • learn from an initial sample
               – must take examples in random order


Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st International Conference on Machine learning, p. 116. ACM, 2004.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                               Bias and variance

                           • Learning curves are not all equal
                           0.8

                                                                                                                       KDB k=2
                                                                                                                             KDB k=2
                                                                                                                                   KDB k=5
                           0.7


                           0.6
 Root Mean Squared Error




                           0.5


                           0.4


                           0.3


                           0.2


                           0.1


                            0
                                 0   100,000   200,000   300,000   400,000     500,000       600,000   700,000   800,000    900,000        1,000,000
                                                                             Data quantity

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                    Bias and variance

   • A major factor in the difference between learning curves
   • Decomposition of 0-1 loss
   • Bias and variance relate to the performance of the learner
     given different training sets




 “Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                               Bias and Variance


                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0
                                                                                                                           1
                         1,0,1,1,1,0,1,1,0                                                                                 1

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0          Learner                                    1,0,1,1,0,1,0,0,?           1
                                                                                               1,1,0,1,0,1,1,1,?
                         1,0,1,1,1,0,1,1,0                                                                                 0

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                                                                                                                           0
                         0,0,1,1,0,1,0,1,0
                         1,0,1,1,1,0,1,1,0
                                                                                                                           0




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                               Bias and Variance


                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0
                                                                                                                           1
                         1,0,1,1,1,0,1,1,0                                                                                 1

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0          Learner                                    1,0,1,1,0,1,0,0,?           1
                                                                                               1,1,0,1,0,1,1,1,?
                         1,0,1,1,1,0,1,1,0                                                                                 0

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                                                                                                                           0
                         0,0,1,1,0,1,0,1,0
                         1,0,1,1,1,0,1,1,0
                                                                                                                           0




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                               Bias and Variance


                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0
                                                                                                                           1
                         1,0,1,1,1,0,1,1,0                                                                                 1

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0          Learner                                    1,0,1,1,0,1,0,0,?           1
                                                                                               1,1,0,1,0,1,1,1,?
                         1,0,1,1,1,0,1,1,0                                                                                 0

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                                                                                                                           0
                         0,0,1,1,0,1,0,1,0
                         1,0,1,1,1,0,1,1,0
                                                                                                                           0




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                               Bias and Variance


                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0
                                                                                                                           1
                         1,0,1,1,1,0,1,1,0                                                                                 1X

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0          Learner                                    1,0,1,1,0,1,0,0,?           1
                                                                                               1,1,0,1,0,1,1,1,?
                         1,0,1,1,1,0,1,1,0                                                                                 0

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                                                                                                                           0X
                         0,0,1,1,0,1,0,1,0
                         1,0,1,1,1,0,1,1,0
                                                                                                                           0


    Variance ≈ (lower limit on) error due to variability in response to sampling




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                               Bias and Variance


                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0
                                                                                                                           1X
                         1,0,1,1,1,0,1,1,0                                                                                 1

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                         0,0,1,1,0,1,0,1,0          Learner                                    1,0,1,1,0,1,0,0,?           1X
                                                                                               1,1,0,1,0,1,1,1,?
                         1,0,1,1,1,0,1,1,0                                                                                 0X

                         1,0,1,1,0,1,0,0,1
                         1,1,0,1,0,1,1,1,1
                                                                                                                           0
                         0,0,1,1,0,1,0,1,0
                         1,0,1,1,1,0,1,1,0
                                                                                                                           0X


    Variance ≈ (lower limit on) error due to variability in response to sampling
    Bias ≈ error due to central tendency of the learner
    Bias = error - variance

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                            Bias and variance




           High bias                               Low bias                                  High bias                                   Low bias
         High variance                           High variance                             Low variance                                Low variance




            Image from Bias Variance Decomposition, in Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                     Intrinsic error

   • Many bias/variance analyses also include intrinsic error
   • For our purposes this is included in bias




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                      Bias/variance and big data

   • As data quantity increases, variance should decrease
   • Low variance important for small data
   • Low bias important for big data




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                 Low bias important for big data

   • Low bias requires capacity to describe wide variety of
     multivariate distributions
   • Big datasets contain fine detail needed to precisely delineate
     complex multivariate distributions




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                        Bias/variance and big data
                              0.8
                                                                                                                          Naïve Bayes

                              0.7                                                                                         KDB k=2

                                                                                                                          KDB k=5
                              0.6
    Root Mean Squared Error




                              0.5



                              0.4



                              0.3



                              0.2



                              0.1



                               0
                                    0   100,000   200,000   300,000   400,000     500,000       600,000   700,000   800,000    900,000          1,000,000
                                                                                Data quantity


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                    Most machine learning research has used small data
                              0.8
                                                                                                                          Naïve Bayes

                              0.7                                                                                         KDB k=2

                                                                                                                          KDB k=5
                              0.6
    Root Mean Squared Error




                              0.5



                              0.4



                              0.3



                              0.2



                              0.1



                               0
                                    0   100,000   200,000   300,000   400,000     500,000       600,000   700,000   800,000    900,000          1,000,000
                                                                                Data quantity


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                         Computational tractability

   • Error will be minimised by low bias algorithms
   • Big data require efficient computation
      – Linear wrt size
      – Learn in a limited number of passes
   • Most low-bias learners are compute intensive
      – super-linear with respect to data quantity
      – Kernel SVM and Random Forests




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                       k-dependence Bayesian classifier (KDB)
   • Bayesian network classifier proposed by Sahami (1995).
   • KDB
      – the probability of each
         attribute value is conditioned                C
         by the class and at most
         k other attributes.
                                                 A   A   A                                                                   A4
      – Extends TAN to multiple                                                                  1        2        3



         parents.



Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                           k-dependence Bayesian classifier (KDB)
   • k=0 is Naïve Bayes                                                                                                   C

   • k   variance and  bias
   • High k with low bias should                                                                          A1         A2       A3       A4

     have low error for big data.
                              0.8
                                                                                                                                Naïve Bayes
                              0.7                                                                                               KDB k=2
                                                                                                                                KDB k=5
                              0.6
    Root Mean Squared Error




                              0.5

                              0.4

                              0.3

                              0.2

                              0.1

                               0
                                    0   100,000   200,000   300,000   400,000     500,000       600,000    700,000        800,000    900,000          1,000,000
                                                                                Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                   KDB algorithm


        1st pass:
        • Order attributes according to mutual
                                                                                             C
            information (MI) with the class.

        2nd pass:
        • Assign k parents to each attribute                                   A1       A2       A3       A4
           according to MI conditioned on the
           class.
        • Add the class as parent of all
           attributes




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                               Two pass learning




              No of instances                                                                        Av no of values/att
                                          No of attributes             No of classes


                                 No of classes                   No of attributes                Av no of values/att




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                      Selective KDB - Motivation

   •   KDB is efficient and effective for large data.
   •   Irrelevant attributes can increase error.
   •   Cannot predetermine the best k for a given data quantity.
   •   Want an efficient way to select attributes and best k.




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                   Selective KDB




                                       C                                        C                                  C

      MI(Ai;C)

    MI(Ai;Aj,C)          A1       A2       A3       A4                                                   A1       A2           A3
                                                                 A1       A2        A3      A4

                                                                 LF1      LF2       LF3    LF4

                                                                                    best


                                              Leave-one-out cv (Pazzani’s trick)
                                                  Attributes ordered by MI
                               Each alternative model tested is a minor addition to the previous

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                     Selective KDB

   • Loss function can be RMSE, 0-1 loss, Matthews Correlation
     Coefficient (for unbalanced datasets), etc.
   • Still the value of k has to be tuned.
      – Solution: Selective2 KDB: matrix of loss function results
           kxn .


                                     a1         a2         a3        a4         a5         a6
                                                p1        p1         p1         p1         p1
                                                          p2         p2         p2         p2
                                                                     p3         p3         p3


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                   Selective KDB

   • Loss function can be RMSE, 0-1 loss, Matthews Correlation
     Coefficient (for unbalanced datasets), etc.
   • Still the value of k has to be tuned.
      – Solution: Selective2 KDB: matrix of loss function results
           kxn .

                                          KDB                      Selective KDB                  Selective2 KDB

     Training time

     Test time


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                              Selective KDB – Results (RMSE)


   • Competitive with KDB in 16 very large datasets (165K-
     54.6M examples):

                                                                           KDB


              selective KDB                  8-8-0          5-11-0          5-11-0         6-10-0         6-9-1
              k-selective KDB                                             5-11-0

   • Mean best k = 4.11
   • Mean % attributes selected = 82.6626.72

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                              Selective KDB – Results (RMSE)

   • Comparison with Random Forest.
                                                     RF (5EF)                                RF (Num)
                                          Trees = 10          Trees = 100         Trees = 10          Trees = 100
               k-selective KDB                6-1-6               4-1-7               5-0-8               4-0-8


   • Need to sample in 3/4 (out of 16) datasets to get RF
     10/100 results.
                                          Mnist                  MITC                    Satellite            Splice
                                          (250K/8.1M)            (600K/839K)             (2M/8.7M)            (10M/54.6M)
     RF (100)          Sample              0.29580.0017          0.05180.0007          0.45680.0006         0.05300.0005

     k-selective Sample                    0.23240.0029          0.04550.0019          0.45310.0011         0.05210.0006
     KDB         All data                  0.14490.0007          0.04460.0020          0.44480.0004         0.05230.0002

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                               Selective KDB – Results (MCC)
   • Unbalanced datasets: use MCC as loss function.
   • Splice dataset: 0.32% of positive classes.
                                                      KDB                 selective KDB
                                                     0.1768                    0.1918
                                                     0.1855                    0.1984
                                                     0.1932                    0.2043
                                                     0.1986                    0.2105
                                                     0.2061                      0.2148
                                                                           Numeric                   Discrete
   • Comparison with Random Forest.                                        attributes                attributes
                                                           MITC                  Splice
                                                           (600K/839K)           (10M/54.6M)
                         RF (100)         Sample                 0.9989                 0.0950

                         k-selective      Sample                 0.9954                 0.1963
                         KDB              All data               0.9956                 0.2148

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                         Incremental Bayesian Network Classifiers




                                                                                                         y



                                                                                    x1         x2         x3      …           xn




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                         Incremental naïve Bayes

   • Probability estimates are based on counts of the frequency of
     each attribute value co-occurring with the class
   • These can be updated incrementally
   • Can these desirable features be generalised to more
     sophisticated learners?




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                      Adding edges reduces bias

   • With additional edges it is possible to exactly represent all
     naïve Bayes distributions and more
      – Lower bias
      – Higher variance
      – Should be more accurate for bigger data
      – But which edges should we add?

                                                                                                         y



                                                                                    x1         x2         x3      …           xn

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                   Averaged n-Dependence Estimators
      • Develop all of a family of classifiers that each add edges to
        naïve Bayes
      • Select order of dependence, n
      • Each model selects n attributes
         – All other attributes are independent given these attributes
             and the class


                  – Each model has lower bias but higher variance than NB
                  – Ensembling reduces the variance
Webb, GI, JR Boughton, FZheng, KM Ting, HSalem. "Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86(2) (2012): 233-272


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                              Averaged n-Dependence Estimators



                                                                       All subsets of
                                                                        n attributes




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                              Averaged n-Dependence Estimators

   • Incremental learning in a single pass through the data
   • Training time complexity O(man+1) Number of
                                            attributes


                                                           Number of                                           Number
   • Classification time complexity O(a k)                         n+1
                                                       training examples                                      of classes

   • Space complexity O(an+1vn+1k)

                                                                                Average number of
                                                                                values per attribute




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                   Averaged n-Dependence Estimators

   • As n increases bias decreases
      – Good for big data
                                      0.7


                                      0.6
            Root Mean Squared Error




                                      0.5


                                      0.4

                                                                                                                        Naïve Bayes
                                      0.3
                                                                                                                        A1DE
                                      0.2
                                                                                                                        A2DE
                                      0.1
                                                                                                                        A3DE

                                       0
                                            0   100000   200000   300000   400000   500000   600000   700000   800000    900000       1000000




Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                         Subsumption resolution
    • If P(x1 | x2) = 1.0 then P(y | x1,x2) = P(y | x2)
        – Eg P(oedema | female, pregnant) =
                    P(oedema | pregnant)
    • Subsumption resolution looks for subsuming attributes at classification
      time and ignores them
        – Simple correction for extreme form of violation of attribute
           independence assumption
        – Very effective in practice – reduce bias at small cost in variance –
           though not always applicable
        – For AnDE with n≥1 uses statistics collected already – no learning
           overhead – often reduces classification time


Zheng, F, GI Webb, P Suraweera, L Zhu. "Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning." Machine Learning 87(1)(2012): 93-125.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                              Weighting




 Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006, pp. 970-974. Springer Berlin, 2006.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




                                                                   Weighting

         • Weighting also reduces bias at the cost of a small increase
           in variance
                             0.6

                                                                                                                      A3DE         A3DE W
                             0.5
   Root Mean Squared Error




                             0.4



                             0.3



                             0.2



                             0.1



                              0
                                   0   100000   200000   300000   400000      500000       600000   700000   800000      900000           1000000
                                                                           Data quantity


Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




         Weighting and subsumption resolution are complementary

   • When SR is applicable, both in combination have lower bias
     but slightly higher variance than either alone

                                                                 RMSE
         Dataset                             Size                 A2DE           A2DE-SR A2DE-W A2DE-WSR
         cleveland                                   303          0.359           0.360   0.361   0.361
 small




         balance-scale                               625          0.430           0.430   0.430   0.430
         anneal                                      898          0.118           0.098   0.116   0.096
         adult                                    48,842          0.313           0.306   0.308   0.303
         localization                           164,860           0.499           0.499   0.498   0.498
 large




         covtype                                 581,102          0.371           0.349   0.350   0.335
         poker-hand                           1,025,010           0.496           0.496   0.420   0.420
         kddcup                               5,209,460           0.044           0.040   0.043   0.039

Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
Copyright © 2013 Geoffrey I Webb




Questions?
Copyright © 2013 Geoffrey I Webb




                                                         References
Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012.
Whitelaw, Casey, Ben Hutchinson, Grace Y. Chung, and Gerard Ellis. "Using the web for language independent spellchecking and
autocorrection." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899.
Association for Computational Linguistics, 2009.
Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”
http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf
Provost, Foster, David Jensen, and Tim Oates. “Efficient progressive sampling.” In Proceedings 5th ACM SIGKDD international
conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999.
Hulten, Geoff, and Pedro Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002.
Tille, Yves. Sampling algorithms. Springer, 2006.
White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012
Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p.
736-743.
Littlestone, Nick, and Manfred K. Warmuth. "The weighted majority algorithm." In Foundations of Computer Science, 1989., 30th Annual
Symposium on, pp. 256-261. IEEE, 1989.
Littlestone, Nick. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine learning 2, no. 4
(1988): 285-318.
Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st
International Conference on Machine learning, p. 116. ACM, 2004.
“Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p.
100-101.
Sahami, Mehran. "Learning limited dependence Bayesian classifiers." In KDD-96: Proceedings of the Second International Conference
on Knowledge Discovery and Data Mining, pp. 335-338. 1996.
Webb, Geoffrey I., Janice R. Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. "Learning by extrapolation from marginal to full-
multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86, no. 2 (2012): 233-272.
Zheng, Fei, Geoffrey I. Webb, Pramuditha Suraweera, and Liguang Zhu. "Subsumption resolution: an efficient and effective technique for
semi-naive Bayesian learning." Machine Learning 87, no. 1 (2012): 93-125.
Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006: trends in artificial intelligence, pp.
970-974. Springer Berlin Heidelberg, 2006.

More Related Content

Similar to Geoff

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and InconsistencyNeil Rubens
 
071510 sun b_1515_feldman_stephen_forpublic
071510 sun b_1515_feldman_stephen_forpublic071510 sun b_1515_feldman_stephen_forpublic
071510 sun b_1515_feldman_stephen_forpublicSteve Feldman
 
Building a decision tree from decision stumps
Building a decision tree from decision stumpsBuilding a decision tree from decision stumps
Building a decision tree from decision stumpsMurphy Choy
 
Dlfan v2 bethel admin share
Dlfan v2 bethel admin shareDlfan v2 bethel admin share
Dlfan v2 bethel admin shareEdAdvance
 
Leprince imagin labs_2012_09_28_4
Leprince imagin labs_2012_09_28_4Leprince imagin labs_2012_09_28_4
Leprince imagin labs_2012_09_28_4Stanford University
 
The Modern Columbian Exchange: Biovision 2012 Presentation
The Modern Columbian Exchange: Biovision 2012 PresentationThe Modern Columbian Exchange: Biovision 2012 Presentation
The Modern Columbian Exchange: Biovision 2012 PresentationMerck
 
PDK Slide Share
PDK Slide SharePDK Slide Share
PDK Slide ShareEdAdvance
 
DBA-~HEM Anniversary September 2012 slides Oblinger
DBA-~HEM Anniversary September 2012 slides OblingerDBA-~HEM Anniversary September 2012 slides Oblinger
DBA-~HEM Anniversary September 2012 slides OblingerICHEM, University of Bath
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured DataDataWorks Summit
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy
 
Goncol CoEs Service Presentation Short EN
Goncol CoEs Service Presentation Short ENGoncol CoEs Service Presentation Short EN
Goncol CoEs Service Presentation Short ENTabea Hirzel
 
using Self-Supervised Learning Can Improve Model Robustness and uncertainty....
using Self-Supervised Learning Can Improve  Model Robustness and uncertainty....using Self-Supervised Learning Can Improve  Model Robustness and uncertainty....
using Self-Supervised Learning Can Improve Model Robustness and uncertainty....ssuserbafbd0
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...Antonio Tejero de Pablos
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementMongoDB
 
COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...
COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...
COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...EDINA, University of Edinburgh
 

Similar to Geoff (20)

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and Inconsistency
 
071510 sun b_1515_feldman_stephen_forpublic
071510 sun b_1515_feldman_stephen_forpublic071510 sun b_1515_feldman_stephen_forpublic
071510 sun b_1515_feldman_stephen_forpublic
 
Building a decision tree from decision stumps
Building a decision tree from decision stumpsBuilding a decision tree from decision stumps
Building a decision tree from decision stumps
 
Dlfan v2 bethel admin share
Dlfan v2 bethel admin shareDlfan v2 bethel admin share
Dlfan v2 bethel admin share
 
Leprince imagin labs_2012_09_28_4
Leprince imagin labs_2012_09_28_4Leprince imagin labs_2012_09_28_4
Leprince imagin labs_2012_09_28_4
 
The Modern Columbian Exchange: Biovision 2012 Presentation
The Modern Columbian Exchange: Biovision 2012 PresentationThe Modern Columbian Exchange: Biovision 2012 Presentation
The Modern Columbian Exchange: Biovision 2012 Presentation
 
0930 maximising the knowledge base uksg 090413_liam_earney
0930 maximising the knowledge base uksg 090413_liam_earney0930 maximising the knowledge base uksg 090413_liam_earney
0930 maximising the knowledge base uksg 090413_liam_earney
 
PDK Slide Share
PDK Slide SharePDK Slide Share
PDK Slide Share
 
DBA-~HEM Anniversary September 2012 slides Oblinger
DBA-~HEM Anniversary September 2012 slides OblingerDBA-~HEM Anniversary September 2012 slides Oblinger
DBA-~HEM Anniversary September 2012 slides Oblinger
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured Data
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Goncol CoEs Service Presentation Short EN
Goncol CoEs Service Presentation Short ENGoncol CoEs Service Presentation Short EN
Goncol CoEs Service Presentation Short EN
 
COBWEB: Brief Introduction, GBIF Secretariat
COBWEB: Brief Introduction, GBIF SecretariatCOBWEB: Brief Introduction, GBIF Secretariat
COBWEB: Brief Introduction, GBIF Secretariat
 
using Self-Supervised Learning Can Improve Model Robustness and uncertainty....
using Self-Supervised Learning Can Improve  Model Robustness and uncertainty....using Self-Supervised Learning Can Improve  Model Robustness and uncertainty....
using Self-Supervised Learning Can Improve Model Robustness and uncertainty....
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
 
Boston hug
Boston hugBoston hug
Boston hug
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
 
COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...
COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...
COBWEB - infrastructure and platform for Environmental Crowd Sensing and Big ...
 

Geoff

  • 1. Copyright © 2013 Geoffrey I Webb Fundamental and Advanced Machine Learning Methods for Big Data Applications Geoffrey I Webb, Ana Martinez, Nayyar Zaidi, Shenglei Chen Monash University http://www.csse.monash.edu.au/~webb
  • 2. Copyright © 2013 Geoffrey I Webb
  • 3. Copyright © 2013 Geoffrey I Webb Overview • Big data • Classification learning • Sampling • Dimensionality reduction • Scaling-up existing algorithms • Stream learning • Bias and variance and big data • Selective KDB • Incremental Bayesian Network Classifiers
  • 4. Copyright © 2013 Geoffrey I Webb Big data • Can mean many things – Complex integration of many heterogeneous data sources – Very large/streaming data Name (SI Value Binary usage decimal prefixes) kilobyte (kB) megabyte (MB) gigabyte (GB) terabyte (TB) petabyte (PB) exabyte (EB) zettabyte (ZB) yottabyte (YB) Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 5. Copyright © 2013 Geoffrey I Webb What is ‘big’? • Number of – instances – dimensions – classes • Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. – Wikipedia • Machine learning research usually treats more than 1 million examples as very large. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 6. Copyright © 2013 Geoffrey I Webb Examples • Spelling correction • Translation • Farecast • Recommender systems • Electoral outcomes Whitelaw, C, B Hutchinson, GY Chung, & G Ellis. "Using the web for language independent spellchecking and autocorrection." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899. Association for Computational Linguistics, 2009. Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 7. Copyright © 2013 Geoffrey I Webb Not a universal panacea • Jeopardy but not chess • Spelling correction and translation but not comprehension http://www.engadget.com/2011/02/15/watson-soundly-beats-the-humans-in-first-round-of-jeopardy/ Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 8. Copyright © 2013 Geoffrey I Webb Classification learning Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 9. Copyright © 2013 Geoffrey I Webb Evolving distributions • Key issue – Is the distribution from which the data are drawn static or dynamic? – Concept drift • class membership changes, eg rich – Concept evolution • new classes emerge – Distribution drift • probabilities change Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 10. Copyright © 2013 Geoffrey I Webb Dimension of change • Normally time but may be other such as location • Classifier can only take dimension of change into account if data to be classified will fall within current scope or if it is possible to extrapolate 18 14 16 12 14 10 12 10 8 Training Training 8 6 Testing Testing 6 4 4 2 2 0 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 11. Copyright © 2013 Geoffrey I Webb Loss functions Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 12. Copyright © 2013 Geoffrey I Webb Imbalanced classes • Many big datasets have a rare class of interest and a majority class from which we seek to distinguish it. – Ad click-through – Conversions – Disease – Fraud – Homeland security Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 13. Copyright © 2013 Geoffrey I Webb Loss functions for imbalanced classes Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 14. Copyright © 2013 Geoffrey I Webb Loss functions for imbalanced classes Predictions Pos Neg Actual Pos TP FN Neg FP TN Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 15. Copyright © 2013 Geoffrey I Webb Loss functions for imbalanced classes • Area under the ROC curve True Positive Rate (TPR) Predictions Pos Neg Actual Pos TP FN Neg FP TN False Positive Rate (FPR) Predictions Pos Neg Actual Pos TP FN Neg FP TN Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.” http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 16. Copyright © 2013 Geoffrey I Webb Loss functions for imbalanced classes • Area under the Precision Recall Curve Recall = True Positive Rate (TPR) Predictions Pos Neg Actual Pos TP FN Neg FP TN Precision Predictions Pos Neg Actual Pos TP FN Neg FP TN Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.” http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 17. Copyright © 2013 Geoffrey I Webb Mutual information Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 18. Copyright © 2013 Geoffrey I Webb Learning curves 0.7 KDB k=2 0.6 0.5 Root Mean Squared Error 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 19. Copyright © 2013 Geoffrey I Webb Sampling • Select s instances from a dataset of size n • Important that sample be selected randomly • Make sure you use a robust random number generator Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 20. Copyright © 2013 Geoffrey I Webb Ideal Sampling • Select data quantity at which learning curve approaches asymptotic error and learn from sample 0.7 KDB k=2 0.6 0.5 Root Mean Squared Error 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 21. Copyright © 2013 Geoffrey I Webb Finding asymptotic error • Progressive sampling 0.7 KDB k=2 0.6 0.5 Root Mean Squared Error 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Provost, F, D Jensen, T Oates. “Efficient progressive sampling.” In Proc 5th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 22. Copyright © 2013 Geoffrey I Webb Hoeffding's bound Error margin Sample Population mean Sample size mean Hulten, G, and P Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 23. Copyright © 2013 Geoffrey I Webb Maximum sample • Take largest sample capacity can handle 0.7 KDB k=2 0.6 0.5 Root Mean Squared Error 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 24. Copyright © 2013 Geoffrey I Webb Maximum sample • Take largest sample capacity can handle • Saves overheads of repeated sampling and risk of terminating too soon • Has risk that asymptotic error may not be reached – but alternative techniques wouldn’t be able to handle a larger sample anyway! Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 25. Copyright © 2013 Geoffrey I Webb Sampling with and without replacement • Sampling involves deciding how many times Ki each element i of a collection should occur in the sample • Sampling without replacement restricts Ki to 0 or 1 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 26. Copyright © 2013 Geoffrey I Webb Uniform fixed-sized sampling with replacement for fixed n selected ← 0 while selected < s add a randomly selected instance to the sample increment selected Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 27. Copyright © 2013 Geoffrey I Webb Uniform sequential variable-sized sampling without replacement i ← 1 while i < n with fixed probability do add the next instance to the sample increment i Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 28. Copyright © 2013 Geoffrey I Webb Uniform sequential fixed-sized sampling without replacement for known n selected ← 0 i ← 1 while selected < s with probability (s - selected )/(n-i+1) do add the next instance to the sample increment selected increment i Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 29. Copyright © 2013 Geoffrey I Webb Uniform sequential fixed-sized sampling without replacement for unknown n count ← 0 while count < s and count < n add the next instance to the sample increment count while more instances remain increment count with probability s/count do add the next instance to the sample replacing an existing instance selected at random else discard the next instance Tille, Yves. Sampling algorithms. Springer, 2006. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 30. Copyright © 2013 Geoffrey I Webb Dimensionality reduction • Many learning algorithms are super-linear with respect to dimensionality • Dimensionality can be reduced by – feature selection – feature projection Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 31. Copyright © 2013 Geoffrey I Webb Feature selection • Most powerful techniques are too computationally intensive for big data – Eg wrapper techniques – Best approach varies depending on base learner • Techniques that consider only the relationship between an attribute and the class are efficient – Eg top-k mutual information – However, overlook complex interactions between attributes • May be most effective to use powerful technique on a sample Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 32. Copyright © 2013 Geoffrey I Webb Feature Projection • Project feature space onto lower dimensional space • Principal Components Analysis • First principal component is the planar projection that maximises variance (= minimises RMSE with respect to original) • Subsequent principal components are those that maximise variance (= minimise RMSE) while being uncorrelated with prior components • First few principal components will capture most of the variation (= information) in the data • Generalisations including principal curves and manifolds project onto manifolds instead of planes Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 33. Copyright © 2013 Geoffrey I Webb Scaling-up existing algorithms • Distributed cloud/cluster computing • Hadoop – Commodity clusters – Map Reduce • Map problem onto sub-problems and distribute these • Assemble solution from solutions to sub-problems White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 34. Copyright © 2013 Geoffrey I Webb Streaming algorithms • Handle data that are too large to retain – computer network/phone traffic, financial transactions, web searches, sensor data • May be difficult to get labelled data • Strong memory and running time constraints – learning rate must be greater than the data rate – only limited data can be retained • Real time accuracy evaluation and formalisation, mainly to adjust the parameters accordingly. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 35. Copyright © 2013 Geoffrey I Webb Online and incremental learning • Online learning – Data arrives as input stream – Classifier makes prediction – Then correct classification is revealed and classifier updated – Examples • Ad placement, online conversions • Incremental learning – Classifier is updated as input arrives – Classifier is identical to batch classifier Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 736-743. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 36. Copyright © 2013 Geoffrey I Webb Streaming Strategies • Retain samples of data and learn from these – Continually assess current model against incoming data and when models lose accuracy take new samples and relearn • Continually update a model using current data – Refine using new data – Prune elements that decline in accuracy • Create ensemble of classifiers each learned from successive time periods – Retire older classifiers as newer ones are created Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 37. Copyright © 2013 Geoffrey I Webb Weighted majority algorithm • Each classifier E has a weight wt E • Classification by weighted majority vote • All incorrect classifiers have their weights reduced wt+1 E =wt E , 0<  <1 • Error is bounded to no more than twice the error of the best classifier Littlestone, N, and MK Warmuth. "The weighted majority algorithm." In 30th Annual Symposium on Foundation of Computer Science, pp. 256-261. IEEE, 1989. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 38. Copyright © 2013 Geoffrey I Webb Winnow Threshold Binary attributes Non-negative real valued weights Prediction Correct xi = 0 xi = 1 1 0 unchanged 0 1 unchanged Littlestone, N. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine Learning 2(4)(1988): 285-318. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 39. Copyright © 2013 Geoffrey I Webb Stochastic gradient descent • Many classifiers have parameters that are learned by optimisation e.g. logistic regression and SVM – usually requires many passes through the data • For linear classifiers stochastic gradient descent often converges before a single pass is completed. – global gradient approximated by the gradient at each example – performs sequential updates – good step size is essential • learn from an initial sample – must take examples in random order Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st International Conference on Machine learning, p. 116. ACM, 2004. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 40. Copyright © 2013 Geoffrey I Webb Bias and variance • Learning curves are not all equal 0.8 KDB k=2 KDB k=2 KDB k=5 0.7 0.6 Root Mean Squared Error 0.5 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 41. Copyright © 2013 Geoffrey I Webb Bias and variance • A major factor in the difference between learning curves • Decomposition of 0-1 loss • Bias and variance relate to the performance of the learner given different training sets “Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 42. Copyright © 2013 Geoffrey I Webb Bias and Variance 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 1 1,0,1,1,1,0,1,1,0 1 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1 1,1,0,1,0,1,1,1,? 1,0,1,1,1,0,1,1,0 0 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0 0,0,1,1,0,1,0,1,0 1,0,1,1,1,0,1,1,0 0 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 43. Copyright © 2013 Geoffrey I Webb Bias and Variance 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 1 1,0,1,1,1,0,1,1,0 1 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1 1,1,0,1,0,1,1,1,? 1,0,1,1,1,0,1,1,0 0 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0 0,0,1,1,0,1,0,1,0 1,0,1,1,1,0,1,1,0 0 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 44. Copyright © 2013 Geoffrey I Webb Bias and Variance 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 1 1,0,1,1,1,0,1,1,0 1 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1 1,1,0,1,0,1,1,1,? 1,0,1,1,1,0,1,1,0 0 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0 0,0,1,1,0,1,0,1,0 1,0,1,1,1,0,1,1,0 0 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 45. Copyright © 2013 Geoffrey I Webb Bias and Variance 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 1 1,0,1,1,1,0,1,1,0 1X 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1 1,1,0,1,0,1,1,1,? 1,0,1,1,1,0,1,1,0 0 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0X 0,0,1,1,0,1,0,1,0 1,0,1,1,1,0,1,1,0 0 Variance ≈ (lower limit on) error due to variability in response to sampling Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 46. Copyright © 2013 Geoffrey I Webb Bias and Variance 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 1X 1,0,1,1,1,0,1,1,0 1 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1X 1,1,0,1,0,1,1,1,? 1,0,1,1,1,0,1,1,0 0X 1,0,1,1,0,1,0,0,1 1,1,0,1,0,1,1,1,1 0 0,0,1,1,0,1,0,1,0 1,0,1,1,1,0,1,1,0 0X Variance ≈ (lower limit on) error due to variability in response to sampling Bias ≈ error due to central tendency of the learner Bias = error - variance Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 47. Copyright © 2013 Geoffrey I Webb Bias and variance High bias Low bias High bias Low bias High variance High variance Low variance Low variance Image from Bias Variance Decomposition, in Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 48. Copyright © 2013 Geoffrey I Webb Intrinsic error • Many bias/variance analyses also include intrinsic error • For our purposes this is included in bias Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 49. Copyright © 2013 Geoffrey I Webb Bias/variance and big data • As data quantity increases, variance should decrease • Low variance important for small data • Low bias important for big data Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 50. Copyright © 2013 Geoffrey I Webb Low bias important for big data • Low bias requires capacity to describe wide variety of multivariate distributions • Big datasets contain fine detail needed to precisely delineate complex multivariate distributions Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 51. Copyright © 2013 Geoffrey I Webb Bias/variance and big data 0.8 Naïve Bayes 0.7 KDB k=2 KDB k=5 0.6 Root Mean Squared Error 0.5 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 52. Copyright © 2013 Geoffrey I Webb Most machine learning research has used small data 0.8 Naïve Bayes 0.7 KDB k=2 KDB k=5 0.6 Root Mean Squared Error 0.5 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 53. Copyright © 2013 Geoffrey I Webb Computational tractability • Error will be minimised by low bias algorithms • Big data require efficient computation – Linear wrt size – Learn in a limited number of passes • Most low-bias learners are compute intensive – super-linear with respect to data quantity – Kernel SVM and Random Forests Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 54. Copyright © 2013 Geoffrey I Webb k-dependence Bayesian classifier (KDB) • Bayesian network classifier proposed by Sahami (1995). • KDB – the probability of each attribute value is conditioned C by the class and at most k other attributes. A A A A4 – Extends TAN to multiple 1 2 3 parents. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 55. Copyright © 2013 Geoffrey I Webb k-dependence Bayesian classifier (KDB) • k=0 is Naïve Bayes C • k   variance and  bias • High k with low bias should A1 A2 A3 A4 have low error for big data. 0.8 Naïve Bayes 0.7 KDB k=2 KDB k=5 0.6 Root Mean Squared Error 0.5 0.4 0.3 0.2 0.1 0 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 56. Copyright © 2013 Geoffrey I Webb KDB algorithm 1st pass: • Order attributes according to mutual C information (MI) with the class. 2nd pass: • Assign k parents to each attribute A1 A2 A3 A4 according to MI conditioned on the class. • Add the class as parent of all attributes Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 57. Copyright © 2013 Geoffrey I Webb Two pass learning No of instances Av no of values/att No of attributes No of classes No of classes No of attributes Av no of values/att Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 58. Copyright © 2013 Geoffrey I Webb Selective KDB - Motivation • KDB is efficient and effective for large data. • Irrelevant attributes can increase error. • Cannot predetermine the best k for a given data quantity. • Want an efficient way to select attributes and best k. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 59. Copyright © 2013 Geoffrey I Webb Selective KDB C C C MI(Ai;C) MI(Ai;Aj,C) A1 A2 A3 A4 A1 A2 A3 A1 A2 A3 A4 LF1 LF2 LF3 LF4 best Leave-one-out cv (Pazzani’s trick) Attributes ordered by MI Each alternative model tested is a minor addition to the previous Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 60. Copyright © 2013 Geoffrey I Webb Selective KDB • Loss function can be RMSE, 0-1 loss, Matthews Correlation Coefficient (for unbalanced datasets), etc. • Still the value of k has to be tuned. – Solution: Selective2 KDB: matrix of loss function results kxn . a1 a2 a3 a4 a5 a6 p1 p1 p1 p1 p1 p2 p2 p2 p2 p3 p3 p3 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 61. Copyright © 2013 Geoffrey I Webb Selective KDB • Loss function can be RMSE, 0-1 loss, Matthews Correlation Coefficient (for unbalanced datasets), etc. • Still the value of k has to be tuned. – Solution: Selective2 KDB: matrix of loss function results kxn . KDB Selective KDB Selective2 KDB Training time Test time Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 62. Copyright © 2013 Geoffrey I Webb Selective KDB – Results (RMSE) • Competitive with KDB in 16 very large datasets (165K- 54.6M examples): KDB selective KDB 8-8-0 5-11-0 5-11-0 6-10-0 6-9-1 k-selective KDB 5-11-0 • Mean best k = 4.11 • Mean % attributes selected = 82.6626.72 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 63. Copyright © 2013 Geoffrey I Webb Selective KDB – Results (RMSE) • Comparison with Random Forest. RF (5EF) RF (Num) Trees = 10 Trees = 100 Trees = 10 Trees = 100 k-selective KDB 6-1-6 4-1-7 5-0-8 4-0-8 • Need to sample in 3/4 (out of 16) datasets to get RF 10/100 results. Mnist MITC Satellite Splice (250K/8.1M) (600K/839K) (2M/8.7M) (10M/54.6M) RF (100) Sample 0.29580.0017 0.05180.0007 0.45680.0006 0.05300.0005 k-selective Sample 0.23240.0029 0.04550.0019 0.45310.0011 0.05210.0006 KDB All data 0.14490.0007 0.04460.0020 0.44480.0004 0.05230.0002 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 64. Copyright © 2013 Geoffrey I Webb Selective KDB – Results (MCC) • Unbalanced datasets: use MCC as loss function. • Splice dataset: 0.32% of positive classes. KDB selective KDB 0.1768 0.1918 0.1855 0.1984 0.1932 0.2043 0.1986 0.2105 0.2061 0.2148 Numeric Discrete • Comparison with Random Forest. attributes attributes MITC Splice (600K/839K) (10M/54.6M) RF (100) Sample 0.9989 0.0950 k-selective Sample 0.9954 0.1963 KDB All data 0.9956 0.2148 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 65. Copyright © 2013 Geoffrey I Webb Incremental Bayesian Network Classifiers y x1 x2 x3 … xn Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 66. Copyright © 2013 Geoffrey I Webb Incremental naïve Bayes • Probability estimates are based on counts of the frequency of each attribute value co-occurring with the class • These can be updated incrementally • Can these desirable features be generalised to more sophisticated learners? Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 67. Copyright © 2013 Geoffrey I Webb Adding edges reduces bias • With additional edges it is possible to exactly represent all naïve Bayes distributions and more – Lower bias – Higher variance – Should be more accurate for bigger data – But which edges should we add? y x1 x2 x3 … xn Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 68. Copyright © 2013 Geoffrey I Webb Averaged n-Dependence Estimators • Develop all of a family of classifiers that each add edges to naïve Bayes • Select order of dependence, n • Each model selects n attributes – All other attributes are independent given these attributes and the class – Each model has lower bias but higher variance than NB – Ensembling reduces the variance Webb, GI, JR Boughton, FZheng, KM Ting, HSalem. "Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86(2) (2012): 233-272 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 69. Copyright © 2013 Geoffrey I Webb Averaged n-Dependence Estimators All subsets of n attributes Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 70. Copyright © 2013 Geoffrey I Webb Averaged n-Dependence Estimators • Incremental learning in a single pass through the data • Training time complexity O(man+1) Number of attributes Number of Number • Classification time complexity O(a k) n+1 training examples of classes • Space complexity O(an+1vn+1k) Average number of values per attribute Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 71. Copyright © 2013 Geoffrey I Webb Averaged n-Dependence Estimators • As n increases bias decreases – Good for big data 0.7 0.6 Root Mean Squared Error 0.5 0.4 Naïve Bayes 0.3 A1DE 0.2 A2DE 0.1 A3DE 0 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 72. Copyright © 2013 Geoffrey I Webb Subsumption resolution • If P(x1 | x2) = 1.0 then P(y | x1,x2) = P(y | x2) – Eg P(oedema | female, pregnant) = P(oedema | pregnant) • Subsumption resolution looks for subsuming attributes at classification time and ignores them – Simple correction for extreme form of violation of attribute independence assumption – Very effective in practice – reduce bias at small cost in variance – though not always applicable – For AnDE with n≥1 uses statistics collected already – no learning overhead – often reduces classification time Zheng, F, GI Webb, P Suraweera, L Zhu. "Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning." Machine Learning 87(1)(2012): 93-125. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 73. Copyright © 2013 Geoffrey I Webb Weighting Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006, pp. 970-974. Springer Berlin, 2006. Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 74. Copyright © 2013 Geoffrey I Webb Weighting • Weighting also reduces bias at the cost of a small increase in variance 0.6 A3DE A3DE W 0.5 Root Mean Squared Error 0.4 0.3 0.2 0.1 0 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Data quantity Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 75. Copyright © 2013 Geoffrey I Webb Weighting and subsumption resolution are complementary • When SR is applicable, both in combination have lower bias but slightly higher variance than either alone RMSE Dataset Size A2DE A2DE-SR A2DE-W A2DE-WSR cleveland 303 0.359 0.360 0.361 0.361 small balance-scale 625 0.430 0.430 0.430 0.430 anneal 898 0.118 0.098 0.116 0.096 adult 48,842 0.313 0.306 0.308 0.303 localization 164,860 0.499 0.499 0.498 0.498 large covtype 581,102 0.371 0.349 0.350 0.335 poker-hand 1,025,010 0.496 0.496 0.420 0.420 kddcup 5,209,460 0.044 0.040 0.043 0.039 Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
  • 76. Copyright © 2013 Geoffrey I Webb Questions?
  • 77. Copyright © 2013 Geoffrey I Webb References Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012. Whitelaw, Casey, Ben Hutchinson, Grace Y. Chung, and Gerard Ellis. "Using the web for language independent spellchecking and autocorrection." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899. Association for Computational Linguistics, 2009. Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.” http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Provost, Foster, David Jensen, and Tim Oates. “Efficient progressive sampling.” In Proceedings 5th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999. Hulten, Geoff, and Pedro Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002. Tille, Yves. Sampling algorithms. Springer, 2006. White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012 Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 736-743. Littlestone, Nick, and Manfred K. Warmuth. "The weighted majority algorithm." In Foundations of Computer Science, 1989., 30th Annual Symposium on, pp. 256-261. IEEE, 1989. Littlestone, Nick. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine learning 2, no. 4 (1988): 285-318. Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st International Conference on Machine learning, p. 116. ACM, 2004. “Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101. Sahami, Mehran. "Learning limited dependence Bayesian classifiers." In KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 335-338. 1996. Webb, Geoffrey I., Janice R. Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. "Learning by extrapolation from marginal to full- multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86, no. 2 (2012): 233-272. Zheng, Fei, Geoffrey I. Webb, Pramuditha Suraweera, and Liguang Zhu. "Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning." Machine Learning 87, no. 1 (2012): 93-125. Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006: trends in artificial intelligence, pp. 970-974. Springer Berlin Heidelberg, 2006.