Big Data Workshop

Dealing with large datasets
Avoiding the dangers
Adrien Ickowicz, Ross Sparks

MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au

Managing the data

Can the input be massaged to make it more amenable for learning
methods? (and how can you do it safely)

Attribute Selection Attribute Discretization
– Scheme independent selection – Unsupervized discretization
– Searching the attribute space – Entropy-based discretization
– Scheme speciﬁc selection – Other methods

Data Transformation Data Cleansing
– Linear and Non-linear PCA – Improving Decision Tree

– Random projections – Robust Regression

– Time Series – Detecting anomalies

Dealing with large datasets: Slide 2 of 17

Attribute Selection

Ju
st
ifi
ca
An irrelevant attribute will often distract the performance

tio
of state-of-the-art decision tree and rule learners...

n
¯ Example: Random binary attribute
– Deteriorates the classification performance 5% to 10% of the time

But a relevant attribute can be harmful as well...

¯ Example: 65% same-class-value binary attribute
– Deteriorates the classification performance 1% to 5% of the time


Attribute Selection

1 - Scheme-independant selection
• No universal relevance measure
• Beware of overfitting and model redundancy
• Make sure that the attributes scales are the same
2 - Searching the attribute space
• Exhaustive search impractical
• Forward, backward, ... : Need an expert to set alg. param.

3 - Scheme-specific selection
• Time consuming
• ”Burns” one classification method


Attribute Discretization

Ju
st
iﬁ
ca
Deal with both continuous and discretized data

tio
n
Handle the extreme values

Some algorithms assume a unrealistic hypothesis on
the attribute values...
¯ Example: normal distribution assumption

... or slow down the process.

¯ Example: need to sort the attribute values


Attribute Discretization

1 - Unsupervized discretization
• Avoid big differences in bin-frequencies
• Avoid small sized bins

2 - Entropy-based discretization
• Recursive, so need a stopping criterion

3 - Other methods
• In practice, do not perform better than E-B-D.
• Some are time consuming


Data Transformation

Ju
st
iﬁ
ca
Data often calls for general mathematical transforma-

tio
tions of a set of attributes...

n
¯ Example: Two date attributes may lead to a third attribute
representing age

Test the robustness of a learning algorithm...

¯ Example: add noise or change a given percentage of a nom-
inal attribute values


Data Transformation

1 - Linear and Non-linear PCA
• Dimension reduction technique: there is a loose in information
• Very costly in high dimension

2 - Random projections
• Perform worse than PCA
• Preserve distance relationship well on average

3 - Time Series
• Pay attention to the sampling


Application Example

- What is the difference between theory and practice?
- There is no difference ... in theory. But in practice, there is.

¯ Example 1: Attribute Selection (Backward vs Filter)
¯ Example 2: Attribute Discretization (Chi-2 based vs Top-down)
¯ Example 3: Data Transformation


Example 1

Data Set : Wine quality Data

Description of the data: 1599 obs. of 12 variables

Question : What makes a good (red) wine?


Example 1

How many features do we keep?

Backward RMSE

Number of features: 5


Example 1

How many features do we keep?

Filter RMSE


Example 2

How do we discretize the features?

Chi-2 discretization MDL discretization


Example 2

How do we discretize the features?

Chi-2 Merge discretization Top-down discretization


Example 3

How do we transform the data?

Principal Component Analysis


Example 3
How do we transform the data?

Projection Pursuit
Regression


CSIRO Mathematics, Informatics and Statistics CSIRO Mathematics, Informatics and Statistics
Adrien Ickowicz Ross Sparks
t +61 2 9325 3260 t +61 2 9325 3262
e Adrien.Ickowicz@csiro.au e Ross.Sparks@csiro.au
w Mathematics, Informatics and Statistics web w Mathematics, Informatics and Statistics web

MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au

Big Data Workshop

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Big Data Workshop

Similar to Big Data Workshop (20)

Big Data Workshop