VSSML16 L6. Feature Engineering

Feature Engineering
#VSSML16
September 2016
#VSSML16 Feature Engineering September 2016 1 / 50

Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Should I Drive?
• Building a predictive model to
recommend driving (or not)
• Have data from the beginning
and ending of trip, and
whether or not there are paved
roads between the two points
• Tracked human-made
decisions for several hundred
trips

A Simple Function
• Create a predictive model to
emulate a simple Boolean
function
• Features are Boolean
variables
• Objective is the inputs X’ored
together (true if the number of
ones is odd and false
otherwise)

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

What?
• So we should just give up,
yes?
• All the model knows about are
the features we provide for it
• In the cases above the
features are broken

Broken?
• The features we’ve provided aren’t useful for prediction
Latitude and longitude don’t correlate especially well with drivability
Any single feature in the Xor problem doesn’t predict the outcome
(and, in fact, changing any one feature changes the class)
In both cases, the same feature value has different semantics in the
presence of other features
• Machine learning algorithms, in general, rely on some statistically
detectable relationship between the features and the class
• The nature of the relationship is particular to the algorithm

How to Confuse a Machine Learning Algorithm
• Remember that machine learning algorithms are searching for a
classiﬁer in a particular hypothesis space
• Decision Trees
Thresholds on individual features
Are you able to set a meaningful threshold on any of your input
features?
• Logistic Regression
Weighted combinations of features
Can a good model be made on a weighted average of your input
features?

Feature Engineering
• Feature engineering: The
process of transforming raw
input data into machine
learning ready-data
• Alternatively: Using your
existing features and some
math to make new features
that models will “like”
• Not covered here, but
important: Going out and
getting better information
• “Applied machine learning” is
domain-speciﬁc feature
engineering and evaluation!

Some Good Times to Engineer Features
• When the relationship between the feature and the objective is
mathematically unsatisfying
• When the relationship of a function of two or more features is far
more relevant than the original features
• When there is missing data
• When the data is time-series, especially when the previous time
period’s objective is known
• When the data can’t be used for machine learning in the obvious
way (e.g., timestamps, text data)
Rule of thumb: Every bit of work you do in feature engineering is a bit
that the model doesn’t have to ﬁgure out.

Aside: Flatline
• Feature engineering is the most important topic in real-world
machine learning
• BigML has its own domain speciﬁc language for it, Flatline, and
we’ll use it for our examples here
• Two things to note right off the bat:
Flatline uses lisp-like “preﬁx notation”
You get the value for a given feature using a function f
So, to create a new column in your dataset with the sum of
feature1 and feature2:
(+ (f "feature1") (f "feature2"))

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Statistical Aggregations
• Many times you have a bunch of features that all “mean” the same
thing
Pixels in the wavelet transform of an image
Did or did not make a purchase for day n − 1 to day n − 30
• Easiest thing is sum, average, or count (especially with sparse
data)
• The all and all-but ﬁeld selectors are helpful here:
(/ (+ (all-but "PurchasesM-0") (count (all-but "PurchasesM-0"))))

Better Categories
• Some categories are helped by collapsing them
Categories with elaborate hierarchies (occupation)
“Numeric” categories with too many levels (good, very good,
amazingly good)
Any category with a natural grouping (country, US state)
• Group categories with cond:
(cond (= "GRUNT" (f "job")) "Worker"
(> (occurrences (f "job") "Chief") 0) "Fancy Person"
"Everyone Else")
• Consider converting them to a numeric if they are ordinal
(cond (= (f "test1") "poor") 0
(= (f "test1") "fair") 1
(= (f "test1") "good") 2
(= (f "test1") "excellent") 3)

Binning or Discretization
• You can also turn a numeric variable into a categorical one
• Main idea is that you make bins and put the values into each one
(e.g., low, middle, high)
• Good to give the model information about potential noise (say,
body temperature)
• There are whole bunch of ways to do it (in the interface):
Quartiles
Deciles
Any generic percentiles
• Note: This includes the objective itself!
Turns a regression problem into a classiﬁcation problem
Might turn a hard problem into an easy one
Might be more what you actually care about

Linearization
• Not important for decision trees (transformations that preserve
ordering have no effect)
• Can be important for logistic regression and clustering
• Common and simple cases are exp and log
(log (f "test"))
• Use cases
Monetary amounts (salaries, proﬁts)
Medical tests
Various engagement metrics (e.g., website activity)
In general, hockey stick distributions

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Missing Data: Why?
• Occasionally, a feature value might be missing from one or more
instances in your data
• This could be for a number of diverse reasons:
Random noise (corruption, formatting)
Feature is not computable (mathematical reasons)
Collection errors (network errors, systems down)
Missing for a reason (test not performed, value doesn’t apply)
• Key question: Does the missing value have semantics?

How Machine Learning Algorithms Treat Missing Data
• The standard treatment of
missing values by decision
trees is that they’re just due to
random noise (unless you
choose the BigML “or missing”
splits. Plug: This isn’t available
in a lot of other packages.)
They’re essentially “ignored”
during tree construction (bit
of an oversimpliﬁcation)
This means that features
that have missing values are
less likely to be chose for a
split than those that aren’t
• Can we “repair” the features?

Missing Value Replacement
• Simplest thing to do is just replace the missing value with a
common thing
Mean or median - For symmetric distributions
Mode - Especially for features where the “default” value is incredibly
common (e.g., word counts, medical tests)
• Such a common operation that it’s built into the interface
• Also available in Flatline:
(if (missing? "test1") (mean "test1") (f "test1"))

Missing Value Induction
• But we can do better than the mean, can’t we? (Spoiler: Yes)
• If only we had an algorithm that could predict a value given the
other feature values for the same instance HEY WAIT A MINUTE
• Train a model to predict your missing values
Training set is all points with the value non-missing
Predict for points that have the training value missing
Remember not to use your objective as part of the missing value
predictor
• Some good news: You probably don’t know or care what your
performance is!
• If you’re modeling with a technique that’s robust to missing values,
you can model every column without getting into a “cycle”

Constructing Features From Missing Data
• Maybe a more interesting thing to do is to use missing data as a
feature
• Does your missing value have semantics?
Feature is incomputable?
Presence/absence is more telling than actual value
• Easy to make a binary feature
• You could also a quasi-numeric feature with a multilevel
categorical with Flatline’s cond operator:
(cond (missing? "test1") "not performed"
(< (f "test1") 10) "low value"
(< (f "test1") 20) "medium value"
"high value")

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Time-series Prediction
• Occasionally a time series
prediction problem comes at
you as a 1-d prediction
problem
• The objective: Predict the
value of the sequence given
history.
• But . . . there aren’t any
features!

Some Very Simple Time-series Data
• Closing Prices for the S&P 500
• A useless objective!
• No features!
• What are we going to do?
(Spoiler: Either drink away our
sorrows or FEATURE
ENGINEERING)
Price
2019.32
2032.43
2015.11
2043.93
2060.50
2085.38
2092.93

A Better Objective: Percent Change
• Going to be really difficult to predict actual closing price. Why?
Price gets larger over long time periods
If we train on historical data, the future price will be out of range
• Predicting the percent change from the previous close is a more
stationary and more relevant objective
• In flatline, we can get the previous field value by passing an index
to the f function:
(/ (- (f "price") (f "price" -1)) (f "price" -1))

Features: Delta from Previous Day (Week, Month, . . .)
• Percent change over the last n days
• Remember these are features so don’t include the objective day -
you won’t know it!
(/ (- (f "price" -1) (f "price" -10)) (f "price" -10))
• Note that this could be anything, and exactly what it should be is
domain-speciﬁc

Features: Above/Below Moving Average
• The avg-window function makes it easy to conduct a moving
average:
(avg-window "price" -50 -1)
• How far are we off the moving average?
(let (ma50 (avg-window "price" -50 -1))
(/ (- (f "price" -1) ma50) ma50))

Features: Recent Volatility
• Let’s do the standard deviation of a window:
(let (win-mean (avg-window "price" -10 -1))
(map (square (- _ win-mean)) (window "price" -10 -1)))
• With that, it’s easy to get the standard deviation:
(let (win-mean (avg-window "price" -10 -1)
sq-errs (map (square (- _ win-mean)) (window "price" -10 -1)))
(sqrt (/ (+ sq-errs) (- (count sq-errs) 1))))
• This is a reasonably nice measure of volatility of the objective over
the last n time periods.

Go Nuts!
• Not hard to imagine all sorts of interesting features:
Moving average crosses
Breaking out of a range
All with different time parameters
• One of the difﬁculties of feature engineering is dealing with this
exponential explosion
• Makes it spectacularly easy to keep wasting effort (or losing
money)

Some Caveats
• The regularity in time of the
points has to match your
training data
• You have to keep track of past
points to compute your
windows
• Really easy to get information
leakage by including your
objective in a window
computation (and can be very
hard to detect)!
• Did I mention how awful
information leakage can be
here?
• WHAT ABOUT
INFORMATION LEAKAGE

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

A Nearly Useless Datatype
• There’s no easy way to include
timestamps in our models
(really just a formatted text
ﬁeld)
• What about epoch time?
Usually not what we want.
Weather forecasting?
Activity prediction?
Energy usage?
• A date time is really a
collection of features

An Opportunity for Automatic Feature Engineering
• Timestamps are usually found in a fairly small (okay, massive)
number of standard formats
• Once parsed into epoch time, we can automatically extract a
bunch of features:
Date features - month, day, year, day of week
Time features - hour, minute, second, millisecond
• We do this “for free” in BigML
• You can also specify a speciﬁc format in Flatline

Useful Features May Be Buried Even More Deeply
• Important to remember that the computer doesn’t have the
information about time that you do
• Example - Working Hours
Need to know if it’s between, say, 9:00 and 17:00
Also need to know if it’s Saturday or Sunday
(let (hour (f "SomeDay.hour")
day (f "SomeDay.day-of-week"))
(and (<= 9 hour 18) (< day 6)))
• Example - Is Daylight
Need to know hour of day
Also need to know day of year

Go Nuts!
• Date Features
endOfMonth? - Important feature of lots of clerical work
nationalHoliday? - What it says on the box
duringWorldCup? - Certain behaviors (e.g., TV watching) might be
different during this time
• Time Features
isRushHour? - Between 7 and 9am on a weekday
mightBeSleeping? - From midnight to 6am
mightBeDrinking? - Weekend evenings (or Wednesday at 1am, if
that’s your thing)
• There are a ton of these things that are spectacularly domain
dependent (think contract rolls in futures trading)

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Bag of Words
• The standard way that BigML processes text is to create one
feature for each word found for any instance in the text ﬁeld.
• This is the so-called “bag of words” approach
• Called this because all notion of sequence goes away after
processing
• In this case, any notion of correlation also disappears as the
features are independent.

Tokenization
• Tokenization seems trivial
Except if you have numbers or special characters in the tokens
What about hyphens? Apostrophes?
• Do we want to do n-grams?
• Keep only tokens that occur a certain amount of time (not too rare,
not too frequent)
• Note that this is more difﬁcult with languages that don’t have clear
word boundaries

Word Stemming
• Now we have a list of tokens
• But sometimes we get “forms” of the same term
playing, plays, play
confirm, confirmed, confirmation
• We can use “stemming” to map these different forms back to the
same root
• Most western languages have a reasonable set of rules

Other Textual Features
• Example: Length of text
(length (f "textFeature"))
• Contains certain strings
• Dollar amounts? Dates? Salutations? Please and Thank you?
• Flatline has full regular expression capabilities

Latent Dirichlet Allocation
• Learn word distributions for topics
• Infer topic scores for each document
• Use the topic scores as features to a model

Outline
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics

Feature Construction as Projection
• Feature construction means increasing the space in which
learning happens
• An other set of techniques typically replaces the feature space
• Often these techniques are called dimensionality reduction, and
the models that are learned are a new basis for that data.
• Why would you do this?
New, possibly unrelated hypothesis space
Speed
Better visualization

Principle Components Analysis
• Find the axis that preserves the maximum amount of variance
from the data
• Find the axis, orthogonal to the ﬁrst, that preserves the next
largest amount variance, and so on
• In spite of this description, this isn’t an iterative algorithm (it can be
solved with a matrix decomposition)
• Projecting the data into the new space is accomplished with a
matrix multiplication
• Resulting features are a linear combination of the old features

Distance to Cluster Centroids
• Do a k-Means clustering
• Compute the distances from each point to the each cluster
centroid
• Ta-da! k new features
• Lots of variations on this theme:
Normalized / Unnormalized, and by what?
Average the class distributions of the resulting clusters
Take the number of points / spread of each cluster into account

Stacked Generalization: Classifiers as Features
• Idea: Use model scores as input to a “meta-classifier”
• Algorithm:
Split the training data into “base” and “meta” subsets
Learn several different classifiers on the “base” subset
Compute predictions for the “meta” subset with the “base”
classifiers
Use the scores on the “meta” subset as features for a classifier
learned on that subset
• Meta-classifier learns when the predictions of each of the “base”
classifiers are to be preferred.
• Pop quiz: Why do we need to split the data?

Caveats (Again)
• There’s obviously a lot of representational power here
• Surprising sometimes how little it helps (remember, the upstream
algorithm are already trying to do this)
• Really easy to get bogged down in details
• Some thoughts that might help you avoid that:
Something really good is likely to be good with little effort
Think of a decaying exponential where the vertical axis is “chance
of dramatic improvement” and the horizontal one is “amount of
tweaking”
Have faith in both your own intelligence and creativity
If you miss something really important, you’ll probably come back to
it
• Use Flatline as inspiration; it was with data scientists in mind

It’s over!
Questions

VSSML16 L6. Feature Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VSSML16 L6. Feature Engineering

Similar to VSSML16 L6. Feature Engineering (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

VSSML16 L6. Feature Engineering