Practical issues in Machine Learning

PRACTICAL ISSUES
in Machine Learning
Partha Sarathi Kar
IVSM 166777
1

CONTENTS
1. Importance of Good Features
2. Irrelevant and Redundant Features
3. Feature Pruning and Normalization
4. Evaluating Model Performance
5. Cross Validation
6. Hypothesis Testing and Statistical Significance
7. Debugging Learning Algorithms
8. Bias/Variance Trade-off
2/15

IMPORTANCE OF GOOD FEATURES
Feature:
• a feature is an individual
measurable property
• a base of a model
3/15
Importance of
Feature:
• choosing poorly will
result in an unreliable
model
Figure: Machine learning workflow

FEATURE EXTRACTION EXAMPLE
pixel representation
• 100 x 100 pixel image = 30,000
dimension vector
• each dimension corresponds to the
RGB
• Like feature(1.1) is ..
4/15
patch representation
• the unit of interest is a small rectangular
block
• rather than a single pixel
object recognition from images
Figure: pixel representation
Figure: patch representation

FEATURE EXTRACTION EXAMPLE
shape representation
• throw out all color and pixel
information
• simply provide a bounding polygon
5/15
text categorization
bag of words representation
object recognition from images
Figure: shape representation
Figure: text categorization

IRRELEVANT AND REDUNDANT FEATURES
6/15
Irrelevant Feature:
the presence of
the word “the” might
be largely irrelevant
for predicting whether
a
course review is
positive or negative.
an irrelevant
feature is one that is
completely uncorrelated with
the prediction
task

IRRELEVANT AND REDUNDANT FEATURES
7/15
Redundant Feature:
having a bright red
pixel in an image at
position
(20, 93) is probably
highly redundant with
having a bright red
pixel
at position (21, 93)
two features are redundant if
they are highly correlated
eg: both might be useful for
identifying fire hydrants
Figure: fire hydrants

FEATURE PRUNING AND NORMALIZATION
8/15
Feature Pruning:
the word “good” appears
in exactly one training
document, which is
positive.
It’s hard to tell with just
one training example if it
is really correlated with
the
positive class, or is it just
noise.
• reduces the size of decision trees
• reduces the complexity of the
final classifier

FEATURE PRUNING AND NORMALIZATION
9/15
Normalization:
to make it easier for your learning
algorithm to learn.
Eg: the height of the “A” has been
reduced from 8 to 6 pixels, while the
width has been reduced from 7 to 5
pixels

EVALUATING MODEL PERFORMANCE
10/15
Purpose:
highly accurate classifier
eg:
Medical Diagnosis
Spam Detection
There are two major types of binary
classification problems.
1.“X versus Y.” For instance, positive versus
negative sentiment.
2. “X versus not-X.” For instance, spam versus
non-spam.

CROSS VALIDATION
11/15
• evaluating and comparing learning
algorithms
• how a model will perform in the
future
dividing data into two
segments:
one used to learn or
train a model
and the other used to
validate the model

HYPOTHESIS TESTING AND STATISTICAL SIGNIFICANCE
12/15
eg. In cross validation, compare
between 7% error and 6.9%
error over 1000 examples
in machine learning just as in statistical
hypothesis testing.

DEBUGGING LEARNING ALGORITHMS
13/15
Moreover, sometimes bugs lead
to learning algorithms
performing
better
• it’s unclear to identify there’s a bug
or
• problem is too hard or
• there’s too much noise
• Learning algorithms are notoriously hard to debug

BIAS/VARIANCE TRADE-OFF
14/15
trade-off between estimation error and
approximation error
f be the learned classifier, selected
from a set F of “all possible
classifiers using a fixed
representation,” and f * is optimal
classifier
estimation error, measures how
far the actual learned classifier f
is from the optimal classifier f *
approximation error, measures
the quality of the model family

REFERENCES
15/15
• http://ciml.info/dl/v0_8/ciml-v0_8-all.pdf
• https://en.wikipedia.org/wiki/Feature_(machine_learning)
• https://stats.stackexchange.com
• https://www.quora.com

Practical issues in Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Practical issues in Machine Learning

Similar to Practical issues in Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Practical issues in Machine Learning