H2O World - GBM and Random Forest in H2O- Mark Landry
Big Data Workshop
1. Dealing with large datasets
Avoiding the dangers
Adrien Ickowicz, Ross Sparks
MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au
2. Managing the data
Can the input be massaged to make it more amenable for learning
methods? (and how can you do it safely)
Attribute Selection Attribute Discretization
– Scheme independent selection – Unsupervized discretization
– Searching the attribute space – Entropy-based discretization
– Scheme specific selection – Other methods
Data Transformation Data Cleansing
– Linear and Non-linear PCA – Improving Decision Tree
– Random projections – Robust Regression
– Time Series – Detecting anomalies
Dealing with large datasets: Slide 2 of 17
3. Attribute Selection
Ju
st
ifi
ca
An irrelevant attribute will often distract the performance
tio
of state-of-the-art decision tree and rule learners...
n
¯ Example: Random binary attribute
– Deteriorates the classification performance 5% to 10% of the time
But a relevant attribute can be harmful as well...
¯ Example: 65% same-class-value binary attribute
– Deteriorates the classification performance 1% to 5% of the time
Dealing with large datasets: Slide 3 of 17
4. Attribute Selection
1 - Scheme-independant selection
• No universal relevance measure
• Beware of overfitting and model redundancy
• Make sure that the attributes scales are the same
2 - Searching the attribute space
• Exhaustive search impractical
• Forward, backward, ... : Need an expert to set alg. param.
3 - Scheme-specific selection
• Time consuming
• ”Burns” one classification method
Dealing with large datasets: Slide 4 of 17
5. Attribute Discretization
Ju
st
ifi
ca
Deal with both continuous and discretized data
tio
n
Handle the extreme values
Some algorithms assume a unrealistic hypothesis on
the attribute values...
¯ Example: normal distribution assumption
... or slow down the process.
¯ Example: need to sort the attribute values
Dealing with large datasets: Slide 5 of 17
6. Attribute Discretization
1 - Unsupervized discretization
• Avoid big differences in bin-frequencies
• Avoid small sized bins
2 - Entropy-based discretization
• Recursive, so need a stopping criterion
3 - Other methods
• In practice, do not perform better than E-B-D.
• Some are time consuming
Dealing with large datasets: Slide 6 of 17
7. Data Transformation
Ju
st
ifi
ca
Data often calls for general mathematical transforma-
tio
tions of a set of attributes...
n
¯ Example: Two date attributes may lead to a third attribute
representing age
Test the robustness of a learning algorithm...
¯ Example: add noise or change a given percentage of a nom-
inal attribute values
Dealing with large datasets: Slide 7 of 17
8. Data Transformation
1 - Linear and Non-linear PCA
• Dimension reduction technique: there is a loose in information
• Very costly in high dimension
2 - Random projections
• Perform worse than PCA
• Preserve distance relationship well on average
3 - Time Series
• Pay attention to the sampling
Dealing with large datasets: Slide 8 of 17
9. Application Example
- What is the difference between theory and practice?
- There is no difference ... in theory. But in practice, there is.
¯ Example 1: Attribute Selection (Backward vs Filter)
¯ Example 2: Attribute Discretization (Chi-2 based vs Top-down)
¯ Example 3: Data Transformation
Dealing with large datasets: Slide 9 of 17
10. Example 1
Data Set : Wine quality Data
Description of the data: 1599 obs. of 12 variables
Question : What makes a good (red) wine?
Dealing with large datasets: Slide 10 of 17
11. Example 1
How many features do we keep?
Backward RMSE
Number of features: 5
Dealing with large datasets: Slide 11 of 17
12. Example 1
How many features do we keep?
Filter RMSE
Dealing with large datasets: Slide 12 of 17
13. Example 2
How do we discretize the features?
Chi-2 discretization MDL discretization
Dealing with large datasets: Slide 13 of 17
14. Example 2
How do we discretize the features?
Chi-2 Merge discretization Top-down discretization
Dealing with large datasets: Slide 14 of 17
15. Example 3
How do we transform the data?
Principal Component Analysis
Dealing with large datasets: Slide 15 of 17
16. Example 3
How do we transform the data?
Projection Pursuit
Regression
Dealing with large datasets: Slide 16 of 17
17. CSIRO Mathematics, Informatics and Statistics CSIRO Mathematics, Informatics and Statistics
Adrien Ickowicz Ross Sparks
t +61 2 9325 3260 t +61 2 9325 3262
e Adrien.Ickowicz@csiro.au e Ross.Sparks@csiro.au
w Mathematics, Informatics and Statistics web w Mathematics, Informatics and Statistics web
MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au