SlideShare a Scribd company logo
1 of 17
Download to read offline
Dealing with large datasets
Avoiding the dangers
Adrien Ickowicz, Ross Sparks




MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au
Managing the data

       Can the input be massaged to make it more amenable for learning
       methods? (and how can you do it safely)


          Attribute Selection                       Attribute Discretization
                  – Scheme independent selection        – Unsupervized discretization
                  – Searching the attribute space       – Entropy-based discretization
                  – Scheme specific selection            – Other methods




          Data Transformation                       Data Cleansing
                  – Linear and Non-linear PCA           – Improving Decision Tree

                  – Random projections                  – Robust Regression

                  – Time Series                         – Detecting anomalies




Dealing with large datasets: Slide 2 of 17
Attribute Selection




                                                                                                  Ju
                                                                                                     st
                                                                                                       ifi
                                                                                                         ca
       An irrelevant attribute will often distract the performance




                                                                                                           tio
       of state-of-the-art decision tree and rule learners...




                                                                                                              n
                    ¯ Example: Random binary attribute
                             – Deteriorates the classification performance 5% to 10% of the time




       But a relevant attribute can be harmful as well...


                    ¯ Example: 65% same-class-value binary attribute
                             – Deteriorates the classification performance 1% to 5% of the time




Dealing with large datasets: Slide 3 of 17
Attribute Selection

   1 - Scheme-independant selection
       • No universal relevance measure
       • Beware of overfitting and model redundancy
       • Make sure that the attributes scales are the same
                                    2 - Searching the attribute space
                                        • Exhaustive search impractical
                                        • Forward, backward, ... : Need an expert to set alg. param.


                                                           3 - Scheme-specific selection
                                                              • Time consuming
                                                              • ”Burns” one classification method




Dealing with large datasets: Slide 4 of 17
Attribute Discretization




                                                                                         Ju
                                                                                            st
                                                                                              ifi
                                                                                                ca
       Deal with both continuous and discretized data




                                                                                                  tio
                                                                                                     n
       Handle the extreme values


       Some algorithms assume a unrealistic hypothesis on
       the attribute values...
                                             ¯ Example: normal distribution assumption

       ... or slow down the process.


                                             ¯ Example: need to sort the attribute values



Dealing with large datasets: Slide 5 of 17
Attribute Discretization

   1 - Unsupervized discretization
       • Avoid big differences in bin-frequencies
       • Avoid small sized bins


                                    2 - Entropy-based discretization
                                        • Recursive, so need a stopping criterion


                                                        3 - Other methods
                                                           • In practice, do not perform better than E-B-D.
                                                           • Some are time consuming




Dealing with large datasets: Slide 6 of 17
Data Transformation




                                                                             Ju
                                                                                st
                                                                                  ifi
                                                                                    ca
       Data often calls for general mathematical transforma-




                                                                                      tio
       tions of a set of attributes...




                                                                                         n
                    ¯ Example: Two date attributes may lead to a third attribute
                         representing age


       Test the robustness of a learning algorithm...


                    ¯ Example: add noise or change a given percentage of a nom-
                         inal attribute values




Dealing with large datasets: Slide 7 of 17
Data Transformation

   1 - Linear and Non-linear PCA
       • Dimension reduction technique: there is a loose in information
       • Very costly in high dimension


                                    2 - Random projections
                                        • Perform worse than PCA
                                        • Preserve distance relationship well on average


                                                           3 - Time Series
                                                             • Pay attention to the sampling




Dealing with large datasets: Slide 8 of 17
Application Example

       - What is the difference between theory and practice?
       - There is no difference ... in theory. But in practice, there is.


                    ¯ Example 1: Attribute Selection (Backward vs Filter)
                    ¯ Example 2: Attribute Discretization (Chi-2 based vs Top-down)
                    ¯ Example 3: Data Transformation




Dealing with large datasets: Slide 9 of 17
Example 1

       Data Set : Wine quality Data


                  Description of the data: 1599 obs. of 12 variables




                                              Question : What makes a good (red) wine?




Dealing with large datasets: Slide 10 of 17
Example 1

       How many features do we keep?

              Backward  RMSE




                     Number of features: 5




Dealing with large datasets: Slide 11 of 17
Example 1

        How many features do we keep?




Filter  RMSE




 Dealing with large datasets: Slide 12 of 17
Example 2

       How do we discretize the features?

                     Chi-2 discretization     MDL discretization




Dealing with large datasets: Slide 13 of 17
Example 2

       How do we discretize the features?

                     Chi-2 Merge discretization   Top-down discretization




Dealing with large datasets: Slide 14 of 17
Example 3

       How do we transform the data?




       Principal Component Analysis




Dealing with large datasets: Slide 15 of 17
Example 3
   How do we transform the data?



                                    Projection Pursuit
                                    Regression




Dealing with large datasets: Slide 16 of 17
CSIRO Mathematics, Informatics and Statistics   CSIRO Mathematics, Informatics and Statistics
Adrien Ickowicz                                 Ross Sparks
t   +61 2 9325 3260                             t   +61 2 9325 3262
e Adrien.Ickowicz@csiro.au                      e   Ross.Sparks@csiro.au
w Mathematics, Informatics and Statistics web   w   Mathematics, Informatics and Statistics web




MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au

More Related Content

Viewers also liked

Kitchen Accessories & Features
Kitchen Accessories & FeaturesKitchen Accessories & Features
Kitchen Accessories & FeaturesDIY Kitchens
 
KITCHEN REMODEL by TOC design
KITCHEN REMODEL by TOC designKITCHEN REMODEL by TOC design
KITCHEN REMODEL by TOC designTania Scardellato
 
Ekoloski protiv komaraca
Ekoloski protiv komaracaEkoloski protiv komaraca
Ekoloski protiv komaracaJelena Bašić
 
Pecesimportados
PecesimportadosPecesimportados
Pecesimportadospaulagc9
 
Health education for diabetics type 2 nachi-taroudannt
Health education for diabetics type 2 nachi-taroudanntHealth education for diabetics type 2 nachi-taroudannt
Health education for diabetics type 2 nachi-taroudanntRabii LARHRISSI
 
Se vende casas reposeidas en Panama
Se vende casas reposeidas en PanamaSe vende casas reposeidas en Panama
Se vende casas reposeidas en Panamaguillermo andrade
 
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.comHydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.comTareen IT Tech
 

Viewers also liked (13)

Kitchen Accessories & Features
Kitchen Accessories & FeaturesKitchen Accessories & Features
Kitchen Accessories & Features
 
KITCHEN REMODEL by TOC design
KITCHEN REMODEL by TOC designKITCHEN REMODEL by TOC design
KITCHEN REMODEL by TOC design
 
Presentacion revista
Presentacion revistaPresentacion revista
Presentacion revista
 
Proyekto sa araling panlipunan
Proyekto sa araling panlipunanProyekto sa araling panlipunan
Proyekto sa araling panlipunan
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
Ekoloski protiv komaraca
Ekoloski protiv komaracaEkoloski protiv komaraca
Ekoloski protiv komaraca
 
Pecesimportados
PecesimportadosPecesimportados
Pecesimportados
 
La vista on the green
La vista on the green La vista on the green
La vista on the green
 
Health education for diabetics type 2 nachi-taroudannt
Health education for diabetics type 2 nachi-taroudanntHealth education for diabetics type 2 nachi-taroudannt
Health education for diabetics type 2 nachi-taroudannt
 
Se vende casas reposeidas en Panama
Se vende casas reposeidas en PanamaSe vende casas reposeidas en Panama
Se vende casas reposeidas en Panama
 
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.comHydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
 
Fusion 2012
Fusion 2012Fusion 2012
Fusion 2012
 
The definitive guide to custom hinges
The definitive guide to custom hingesThe definitive guide to custom hinges
The definitive guide to custom hinges
 

Similar to Big Data Workshop

Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and InconsistencyNeil Rubens
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxAkash527744
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Predicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsPredicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsEnplus Advisors, Inc.
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Martin Pelikan
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
 
`Data mining
`Data mining`Data mining
`Data miningJebin R
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationPankaj Sharma
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 

Similar to Big Data Workshop (20)

Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and Inconsistency
 
Decision tree
Decision treeDecision tree
Decision tree
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Random Forests Lightning Talk
Random Forests Lightning TalkRandom Forests Lightning Talk
Random Forests Lightning Talk
 
Decision trees
Decision treesDecision trees
Decision trees
 
Computer Engineer Master Project
Computer Engineer Master ProjectComputer Engineer Master Project
Computer Engineer Master Project
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
 
Predicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsPredicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random Forests
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 
`Data mining
`Data mining`Data mining
`Data mining
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For Classification
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 

Big Data Workshop

  • 1. Dealing with large datasets Avoiding the dangers Adrien Ickowicz, Ross Sparks MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au
  • 2. Managing the data Can the input be massaged to make it more amenable for learning methods? (and how can you do it safely) Attribute Selection Attribute Discretization – Scheme independent selection – Unsupervized discretization – Searching the attribute space – Entropy-based discretization – Scheme specific selection – Other methods Data Transformation Data Cleansing – Linear and Non-linear PCA – Improving Decision Tree – Random projections – Robust Regression – Time Series – Detecting anomalies Dealing with large datasets: Slide 2 of 17
  • 3. Attribute Selection Ju st ifi ca An irrelevant attribute will often distract the performance tio of state-of-the-art decision tree and rule learners... n ¯ Example: Random binary attribute – Deteriorates the classification performance 5% to 10% of the time But a relevant attribute can be harmful as well... ¯ Example: 65% same-class-value binary attribute – Deteriorates the classification performance 1% to 5% of the time Dealing with large datasets: Slide 3 of 17
  • 4. Attribute Selection 1 - Scheme-independant selection • No universal relevance measure • Beware of overfitting and model redundancy • Make sure that the attributes scales are the same 2 - Searching the attribute space • Exhaustive search impractical • Forward, backward, ... : Need an expert to set alg. param. 3 - Scheme-specific selection • Time consuming • ”Burns” one classification method Dealing with large datasets: Slide 4 of 17
  • 5. Attribute Discretization Ju st ifi ca Deal with both continuous and discretized data tio n Handle the extreme values Some algorithms assume a unrealistic hypothesis on the attribute values... ¯ Example: normal distribution assumption ... or slow down the process. ¯ Example: need to sort the attribute values Dealing with large datasets: Slide 5 of 17
  • 6. Attribute Discretization 1 - Unsupervized discretization • Avoid big differences in bin-frequencies • Avoid small sized bins 2 - Entropy-based discretization • Recursive, so need a stopping criterion 3 - Other methods • In practice, do not perform better than E-B-D. • Some are time consuming Dealing with large datasets: Slide 6 of 17
  • 7. Data Transformation Ju st ifi ca Data often calls for general mathematical transforma- tio tions of a set of attributes... n ¯ Example: Two date attributes may lead to a third attribute representing age Test the robustness of a learning algorithm... ¯ Example: add noise or change a given percentage of a nom- inal attribute values Dealing with large datasets: Slide 7 of 17
  • 8. Data Transformation 1 - Linear and Non-linear PCA • Dimension reduction technique: there is a loose in information • Very costly in high dimension 2 - Random projections • Perform worse than PCA • Preserve distance relationship well on average 3 - Time Series • Pay attention to the sampling Dealing with large datasets: Slide 8 of 17
  • 9. Application Example - What is the difference between theory and practice? - There is no difference ... in theory. But in practice, there is. ¯ Example 1: Attribute Selection (Backward vs Filter) ¯ Example 2: Attribute Discretization (Chi-2 based vs Top-down) ¯ Example 3: Data Transformation Dealing with large datasets: Slide 9 of 17
  • 10. Example 1 Data Set : Wine quality Data Description of the data: 1599 obs. of 12 variables Question : What makes a good (red) wine? Dealing with large datasets: Slide 10 of 17
  • 11. Example 1 How many features do we keep? Backward RMSE Number of features: 5 Dealing with large datasets: Slide 11 of 17
  • 12. Example 1 How many features do we keep? Filter RMSE Dealing with large datasets: Slide 12 of 17
  • 13. Example 2 How do we discretize the features? Chi-2 discretization MDL discretization Dealing with large datasets: Slide 13 of 17
  • 14. Example 2 How do we discretize the features? Chi-2 Merge discretization Top-down discretization Dealing with large datasets: Slide 14 of 17
  • 15. Example 3 How do we transform the data? Principal Component Analysis Dealing with large datasets: Slide 15 of 17
  • 16. Example 3 How do we transform the data? Projection Pursuit Regression Dealing with large datasets: Slide 16 of 17
  • 17. CSIRO Mathematics, Informatics and Statistics CSIRO Mathematics, Informatics and Statistics Adrien Ickowicz Ross Sparks t +61 2 9325 3260 t +61 2 9325 3262 e Adrien.Ickowicz@csiro.au e Ross.Sparks@csiro.au w Mathematics, Informatics and Statistics web w Mathematics, Informatics and Statistics web MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au