Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Good data preparation is key to
producing valid and reliable models…

Lecture 7 Agenda

Heads Up, Lecture 8 Agenda

Course Project

https://www.kaggle.com/c/springleaf-marketing-response
not
Determine whether to send a direct mail piece to a customer

The Data

The Rules

what is the data telling you

The Tragedy of the Titanic

Does the new data point x* exactly match a previous point xi?
If so, assign it to the same class as xi
Otherwise, just guess.
This is the “rote” classifier

Does the new data point x* match a set pf previous points xi on some specific attribute?
If so, take a vote to determine class.
Example: If most females survived, then assume every female survives
But there are lots of possible rules like this.
And an attribute can have more than two values.
If most people under 4 years old survive, then assume everyone under 4 survives
If most people with 1 sibling survive, then assume everyone with 1 sibling survives
How do we choose?

IF sex=‘female’ THEN survive=yes
ELSE IF sex=‘male’ THEN survive = no
confusion matrix
no yes <-- classified as
468 109 | no
81 233 | yes
(468 + 233) / (468+109+81+233) = 79% correct (and 21% incorrect)
Not bad!

IF pclass=‘1’ THEN survive=yes
ELSE IF pclass=‘2’ THEN survive=yes
ELSE IF pclass=‘3’ THEN survive=no
confusion matrix
no yes <-- classified as
372 119 | no
177 223 | yes
(372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect)
a little worse

Support Vector Machines

fx yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Estimation:
w: weight vector
x: data vector

fx
a
yest
denotes +1
denotes -1
How would you
classify this data?

fx
a
yest
denotes +1
denotes -1
Any of these
would be fine..
..but which is best?

fx
a
yest
denotes +1
denotes -1
Define the margin of a linear classifier
as the width that the boundary could
be increased by before hitting a
datapoint.

O
X
O
O
O
X
X
X
SUPPORT VECTORS

O
X
X
O
O
O
X
X
X
SUPPORT VECTORS

fx
a
yest
denotes +1
denotes -1
The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind
of SVM (Called an LSVM)
Linear SVM

Deriving Knowledge from Data at Scale2016/4/12 37
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x + b)
The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind
of SVM (Called an LSVM)Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM

denotes +1
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Intuitively this feels safest.
2. If we’ve made a small error in the location of
the boundary this gives us least chance of
causing a misclassification.
3. LOOCV (leave one out cross validation) is
easy since the model is immune to removal
of any nonsupport-vector data points.
5. Empirically it works very well.

Class 1
Class 2
m

This is going to be a problem!

What if the data looks like that below?...
kernel function

What if the data looks like that below?...

Everything you need to know in one slide…

Probably the most tricky part of using SVM
RBF is a good first option…
Depends on your data—try several.
• Kernels have even been developed for nonnumeric data like sequences,
structures, and trees/graphs.
May help to use a combination of several kernels.
Don’t touch your evaluation data while you’re trying out different
kernels and parameters.
– Use cross-validation for this if you’re short on data

Complexity of the optimization problem remains only dependent on the dimensionality of
the input space and not of the feature space!

• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
….
• SVM N learns “Output==N” vs “Output != N”

Deriving Knowledge from Data at Scale2016/4/12 51

• weka.classifiers.functions.SMO
• weka.classifiers.functions.libSVM
• weka.classifiers.functions.SMOreg

Support Vector Machine Model, Titanic Data, Linear Kernel

Support Vector Machine Model, RBF Kernel
Titanic Data

Titanic Data
overfitting?

Bill Howe, UW 63
Titanic Data
A gamma, parameter that controls/balances model complexity against accuracy

How They Won It!
Lessons from data mining
the past Kaggle contests…

10 Minute Break…

Data
Wrangling

gold standard
data sets

one dominant attribute even
distribution

• Rule of thumb: 5,000 or more desired
• Rule of thumb: for each attribute, 10 or more instances
• Rule of thumb: >100 for each class

Data cleaning
Data integration
Data transformation
Data reduction
Data discretization

1. Missing values
2. Outliers
3. Coding
4. Constraints

ReplaceMissingValues
• RemoveMisclassified
• MergeTwoValues

Missing values – UCI machine learning repository, 31 of 68 data sets
reported to have missing values. “Missing” can mean many things…
MAR: "Missing at Random":
– usually best case
– usually not true
Non-randomly missing
Presumed normal, so not measured
Causally missing
– attribute value is missing because of other attribute values (or because of
the outcome value!)

Representation matters…

Bank Data A
Bank Data B
Q
Q

Simple transformations can often have a large impact in performance
Example transformations (not all for performance improvement):
• Difference of two date attributes, distance between coordinate,…
• Ratio of two numeric (ratioscale) attributes, average for smoothing,….
• Concatenating the values of nominal attributes
• Encoding (probabilistic) cluster membership
• Adding noise to data (for robustness tests)
• Removing data randomly or selectively
• Obfuscating the data (for anonymity)
Intuition: add features that increase class discrimination (E, IG)…
Data Transformation

• Combine attributes
• Normalizing data
• Simplifying data

Change of scale
1
1
1
1
1

Aggregated data tends to have less variability and potentially more
information for better predictions…

fills in the missing values for an instance with the expected
values

Specify the new ordering…
2-last, 1
1-5, 7-last, 6

instance
Resample, a random subsample
with or without replacement;
To replace or not…
Same random seed, will result in
same (repeatable) sample.
Sample size, as percentage of
original data set size.

Suggestions for you to try…

Curse of Dimensionality
exponentially
In many cases the information that is lost by
discarding variables is made up for by a more
accurate mapping/sampling in the lower-
dimensional space !

• Sampling

work almost as well as using the entire
data set
the same
property (of interest) as the original set of data

• Principle Component Analysis

find k ≤ n new features
(principal components) that can best represent data
 Works for numeric data only…

Attribute Selection
(feature selection)

What is Feature selection ?
DIMENSIONALITY REDUCTION

Problem: Where to focus attention?

Feature Selection, starts with you…
smallest subset of attributes

What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers

list of attributes
evaluated individually
select
subset

A correlation coefficient shows the degree of linear dependence of x and y. In other words, the coefficient shows
how close two variables lie along a line. If the coefficient is equal to 1 or -1, all the points lie along a line. If the
correlation coefficient is equal to zero, there is no linear relation between x and y. However, this does not
necessarily mean that there is no relation between the two variables. There could e.g. be a non-linear relation.

Tab for selecting attributes in a data set…

Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…

Select CorrelationAttributeEval for Pearson Correlation…
False, doesn’t return R score
True, returns R scores;

Ranks attributes by their individual evaluations, used in
conjunction with GainRatio, Entropy, Pearson, etc…
Number of attributes to return,
-1 returns all ranked attributes;
Attributes to ignore (skip) in the
evaluation forma: [1, 3-5, 10];
Cutoff at which attributes can
be discarded, -1 no cutoff;

Predicting Self-Reported Health Status
The Data Set, NHANES_data.csv (National Health and Nutrition Examination Survey)
How would you say your health in general is?
Excellent predictor of mortality, health care utilization & disability
How I processed it…
• 4000 variables;
• Attributes with > 30% missing values removed (dropped column);
• 105 variables remaining;
• Chi-square test, variable and target, remove variables with P value < .20;
• Impute all missing values using expectation minimization;
• 85 variables remaining;
Pearson Correlation Exercise…

NHANES_data.csv

NHANES_data.csv
• Convert the last column from numeric to nominal
• Find the top 15 features using Pearson Correlation

• OneRAttributeEval
• GainRatioAttributeEval
• InfoGainAttributeEval
• ChiSquaredAttributeEval
• ReliefFAttributeEval

• Right Click on the new line in the Result list;
• From the pop-up menu, select the item
Save reduced data…

• Save the dataset with 15 selected attributes
to file NHanesPearson.arff

• Save the dataset with 15 selected attributes
to file NHanesPearson.arff
• Switch to the Preprocess mode in Explorer
• Click on Open file… and open the file
NHanesPearson.arff
• Switch to the Classify submode
• Click on Choose, select classifier and use this
feature set and data to build a predictive
model;

• Anything below 0.3 isn’t highly
correlated with the target…

given the already picked features
independent

all
makes the least contribution
independent

Forward, Backward, Bi-Directional
Attributes to “seed” the search,
listed individually or by range.
Cutoff for backtracking…

True: Adds features that are correlated
with class and NOT intercorrelated with
other features already in selection.
False: Eliminates redundant features.
Precompute the correlation matrix in
advance, useful for fast backtracking, or
compute lazily. When given a large
number of attributes, compute lazily…
CfsSubsetEval

NHANES_data.csv
• Convert the last column from numeric to nominal
• Set the search method as Best First, Forward
• Set the attribute evaluator as CfsSubsetEval
• Run across all attributes in data set…

• Feature selection can significantly increase the performance of a learning
algorithm (both accuracy and computation time) – but it is not easy!
• Relevance <-> Optimality
• Correlation and Mutual information between single variables and the target are
often used as Ranking-Criteria of variables.
Important points 1/2

Important points 2/2
• One can not automatically discard variables with small scores – they may still be
useful together with other variables.
• Filters – Wrappers - Embedded Methods
• How to search the space of all feature subsets ?
• How to asses performance of a learner that uses a particular feature subset ?

not all about accuracy
• Filtering is fast linear intuitive
• Filtering model oblivious
may not be optimal
• Wrappers model-aware slow nonintuitive
• PCA and SVD are lossy
work on the entire data set
start
with fast feature filtering first
NOT to use any feature selection

That’s all for tonight…

Barga Data Science lecture 7

More Related Content

What's hot

Similar to Barga Data Science lecture 7

More from Roger Barga

Recently uploaded

Barga Data Science lecture 7