L5. Data Transformation and Feature Engineering

Data Transformation and Feature Engineering
Charles Parker
Allston Trading

2
• Oregon State University (Structured output spaces)

• Music recognition

• Real-time strategy game-playing

• Kodak Research Labs

• Media classiﬁcation (audio, video)

• Document Classiﬁcation

• Performance Evaluation

• BigML

• Allston Trading (applying machine learning to market data)
Full Disclosure

3
• But it’s “machine learning”!

• Your data sucks (or at least I hope it does) . . .

• Data is broken

• Data is incomplete

• . . . but you know about it!

• Make the problem easier

• Make the answer more obvious

• Don’t waste time modeling the obvious

• Until you ﬁnd the right algorithm for it
Data Transformation

Your Data Sucks I: Broken Features
• Suppose you have a market data feature called
trade imbalance = (buy - sell) / total volume that
you calculate every ﬁve minutes

• Now suppose there are no trades over ﬁve minutes

• What to do?

• Point or feature removal

• Easy default
4

Your Data Sucks II: Missing Values
• Suppose you’re building a model
to predict the presence or
absence of cancer

• Each feature is a medical test

• Some are simple (height,
weight, temperature)

• Some are complex (blood
counts, CAT scan)

• Some patients have had all of
these done, some have not.

• Does the presence or absence of
a CAT scan tell you something?
Should it be a feature?
5
Height Weight
Blood
Test
Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No

Simplifying Your Problem
• What about the class
variable?

• It’s just another feature, so it
can be engineered

• Change the problem

• Do you need so many
classes?

• Do you need to do a
regression?
6

Feature Engineering: What?
• Your data may be too “raw”
for learning

• Multimedia Data

• Raw text data

• Something must be done to
make the data “learnable”

• Compute edge histograms,
SIFT features

• Do word counts, latent
topic modeling
7

An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No

• Whether two points are
walking distance from
each other is not an
obvious function of the
latitude and longitude

• But it is an obvious
function of the distance
between the two points

• Unfortunately, that
function is quite
complicated

• Fortunately, you know it
already!
9

An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2
Distance
(km)
Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No

Feature Engineering
• One of the core (maybe the core)
competencies of a machine learning engineer

• Requires domain understanding

• Requires algorithm understanding

• If you do it really well, you eliminate the need
for machine learning entirely

• Gives you another path to success; you can
often substitute domain knowledge for
modeling expertise

• But what if you don’t have speciﬁc domain
knowledge?
11

Techniques I: Discretization
• Construct meaningful bins for a
continuous feature (two or more)

• Body temperature

• Credit score

• The new features are categorical
features, each category of which
has nice semantics

• Don’t make the algorithm waste
eﬀort modeling things that you
already know about
12

Techniques II: Delta
• Sometimes, the diﬀerence between two features is
the important bit

• As it was in the distance example

• Also holds a lot in the time domain

• Example: Hiss in speech recognition

• Struggling? Just diﬀerentiate! (In all seriousness,
this sometimes works)
13

Techniques III: Windowing
• If points are distributed in time,
previous points in the same
window are often very informative

• Weather

• Stock prices

• Add this to a 1-d sequence of
points to get an instant machine
learning problem!

• Sensor data

• User behavior

• Maybe add some delta features?
14

Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one
(subtract the mean and divide by the standard deviation).

• Good for domains with heterogeneous but gaussian-distributed data sources

• Demographic data

• Medical testing

• Note that this isn’t in general eﬀective for decision trees!

• Transformation is order preserving

• Decision tree splits rely only on ordering!

• Good for things like k-NN
15

Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0]
and [1, 1, 1] -> [0.57, 0.57, 0.57])

• Nice for sparse feature spaces like text

• Helps us tell the diﬀerence between documents and dictionaries

• We’ll come back to the idea of sparsity

• Note that this will eﬀect decision trees

• Does not necessarily preserve order (co-dependency between
features)

• A lesson against over-generalization of technique!
16

What Do We Really Want?
• This is nice, but what ever happened to “machine
learning”?

• Construct a feature space in which “learning is
easy”, whatever that means

• The space must preserve “important aspects of the
data”, whatever that means

• Are there general ways of posing this problem?
(Spoiler Alert: Yes)
17

Aside I: Projection
• A projection is a one-to-
one mapping from one
feature space to another

• We want a function f(x)
that projects a point x
into a space where a
good classiﬁer is obvious

• The axes (features) in
your new space are
called your new basis
18
f(x)
f(x)

A Hack Projection: Distance to Cluster
• Do clustering on your data

• For each point, compute the
distance to each cluster centroid

• These distances are your new
features

• The new space can be either
higher or lower dimensional than
your new space

• For highly clustered data, this
can be a fairly powerful feature
space
19

Principle Components Analysis
• Find the axis through
the data with the
highest variance

• Repeat for the next
orthogonal axis and so
on, until you run out of
data or dimensions

• Each axis is a feature
20

PCA is Nice!
• Generally quite fast (matrix decomposition)

• Features are linear combinations of originals (which
means you can project test data into the space)

• Features are linearly independent (great for some
algorithms)

• Data can often be “explained” with just the ﬁrst few
components (so this can be “dimensionality
reduction”)
21

Spectral Embeddings
• Two of the seminal ones
are Isomap and LLE

• Generally, compute the
nearest neighbor matrix
and use this to create the
embedding

• Pro: Pretty spectacular
results

• Con: No projection matrix
22

Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method

• Create an objective function that preserves neighbor
relationships

• Neighbor distances (unsupervised)

• Closest points of the same class (supervised)

• Clever search for a projection matrix that satisﬁes this
objective (usually an elaborate sort of gradient descent)

• I’ve had some success with these
23

Aside II: Sparsity
• Machine learning is essentially compression, and
constantly plays at the edges of this idea

• Minimum description length

• Bayesian information criteria

• L1 and L2 regularization

• Sparse representations are easily compressed

• So does that mean they’re more powerful?
24

Sparsity I: Text Data
• Text data is inherently sparse

• The fact that we choose a small number of words to
use gives a document its semantics

• Text features are incredibly powerful in the grand
scheme of feature spaces

• One or two words allow us to do accurate
classiﬁcation

• But those one or two words must be sparse
25

Sparsity II: EigenFaces
• Here are the ﬁrst few
components of PCA applied to
a collection of face images

• A small number of these
explain a huge part of a huge
number of faces

• First components are like stop
words, last few (sparse)
components make recognition
easy
26

Sparsity III: The Fourier Transform
• Very complex waveform

• Turns out to be easily
expressible as a
combination of a few
(i.e., sparse) constant
frequency signals

• Such representations
make accurate speech
recognition possible
27

Sparse Coding
• Iterate

• Choose a basis

• Evaluate that basis based on how well you can use
it to reconstruct the input, and how sparse it is

• Take some sort of gradient step to improve that
evaluation

• Andrew Ng’s eﬃcient sparse coding algorithms and
Hinton’s deep autoencoders are both ﬂavors of this
28

The New Basis
• Text: Topics

• Audio: Frequency
Transform

• Visual: Pen Strokes
29

Another Hack: Totally Random Trees
• Train a bunch of decision trees

• With no objective!

• Each leaf is a feature

• Ta-da! Sparse basis

• This actually works
30

And More and More
• There are a ton a variations on these themes

• Dimensionality Reduction

• Metric Learning

• “Coding” or “Encoding”

• Nice canonical implementations can be found at:
http://lvdmaaten.github.io/drtoolbox/
31

L5. Data Transformation and Feature Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to L5. Data Transformation and Feature Engineering

Similar to L5. Data Transformation and Feature Engineering (20)

More from Machine Learning Valencia

More from Machine Learning Valencia (9)

Recently uploaded

Recently uploaded (20)

L5. Data Transformation and Feature Engineering