Successfully reported this slideshow.
Upcoming SlideShare
×

# VSSML16 L6. Feature Engineering

648 views

Published on

VSSML16 L6. Feature Engineering
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 6
Feature Engineering
Charles Parker (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### VSSML16 L6. Feature Engineering

1. 1. Feature Engineering #VSSML16 September 2016 #VSSML16 Feature Engineering September 2016 1 / 50
2. 2. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 2 / 50
3. 3. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 3 / 50
4. 4. Should I Drive? • Building a predictive model to recommend driving (or not) • Have data from the beginning and ending of trip, and whether or not there are paved roads between the two points • Tracked human-made decisions for several hundred trips #VSSML16 Feature Engineering September 2016 4 / 50
5. 5. A Simple Function • Create a predictive model to emulate a simple Boolean function • Features are Boolean variables • Objective is the inputs X’ored together (true if the number of ones is odd and false otherwise) #VSSML16 Feature Engineering September 2016 5 / 50
6. 6. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 6 / 50
7. 7. What? • So we should just give up, yes? • All the model knows about are the features we provide for it • In the cases above the features are broken #VSSML16 Feature Engineering September 2016 7 / 50
8. 8. Broken? • The features we’ve provided aren’t useful for prediction Latitude and longitude don’t correlate especially well with drivability Any single feature in the Xor problem doesn’t predict the outcome (and, in fact, changing any one feature changes the class) In both cases, the same feature value has different semantics in the presence of other features • Machine learning algorithms, in general, rely on some statistically detectable relationship between the features and the class • The nature of the relationship is particular to the algorithm #VSSML16 Feature Engineering September 2016 8 / 50
9. 9. How to Confuse a Machine Learning Algorithm • Remember that machine learning algorithms are searching for a classiﬁer in a particular hypothesis space • Decision Trees Thresholds on individual features Are you able to set a meaningful threshold on any of your input features? • Logistic Regression Weighted combinations of features Can a good model be made on a weighted average of your input features? #VSSML16 Feature Engineering September 2016 9 / 50
10. 10. Feature Engineering • Feature engineering: The process of transforming raw input data into machine learning ready-data • Alternatively: Using your existing features and some math to make new features that models will “like” • Not covered here, but important: Going out and getting better information • “Applied machine learning” is domain-speciﬁc feature engineering and evaluation! #VSSML16 Feature Engineering September 2016 10 / 50
11. 11. Some Good Times to Engineer Features • When the relationship between the feature and the objective is mathematically unsatisfying • When the relationship of a function of two or more features is far more relevant than the original features • When there is missing data • When the data is time-series, especially when the previous time period’s objective is known • When the data can’t be used for machine learning in the obvious way (e.g., timestamps, text data) Rule of thumb: Every bit of work you do in feature engineering is a bit that the model doesn’t have to ﬁgure out. #VSSML16 Feature Engineering September 2016 11 / 50
12. 12. Aside: Flatline • Feature engineering is the most important topic in real-world machine learning • BigML has its own domain speciﬁc language for it, Flatline, and we’ll use it for our examples here • Two things to note right off the bat: Flatline uses lisp-like “preﬁx notation” You get the value for a given feature using a function f So, to create a new column in your dataset with the sum of feature1 and feature2: (+ (f "feature1") (f "feature2")) #VSSML16 Feature Engineering September 2016 12 / 50
13. 13. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 13 / 50
14. 14. Statistical Aggregations • Many times you have a bunch of features that all “mean” the same thing Pixels in the wavelet transform of an image Did or did not make a purchase for day n − 1 to day n − 30 • Easiest thing is sum, average, or count (especially with sparse data) • The all and all-but ﬁeld selectors are helpful here: (/ (+ (all-but "PurchasesM-0") (count (all-but "PurchasesM-0")))) #VSSML16 Feature Engineering September 2016 14 / 50
15. 15. Better Categories • Some categories are helped by collapsing them Categories with elaborate hierarchies (occupation) “Numeric” categories with too many levels (good, very good, amazingly good) Any category with a natural grouping (country, US state) • Group categories with cond: (cond (= "GRUNT" (f "job")) "Worker" (> (occurrences (f "job") "Chief") 0) "Fancy Person" "Everyone Else") • Consider converting them to a numeric if they are ordinal (cond (= (f "test1") "poor") 0 (= (f "test1") "fair") 1 (= (f "test1") "good") 2 (= (f "test1") "excellent") 3) #VSSML16 Feature Engineering September 2016 15 / 50
16. 16. Binning or Discretization • You can also turn a numeric variable into a categorical one • Main idea is that you make bins and put the values into each one (e.g., low, middle, high) • Good to give the model information about potential noise (say, body temperature) • There are whole bunch of ways to do it (in the interface): Quartiles Deciles Any generic percentiles • Note: This includes the objective itself! Turns a regression problem into a classiﬁcation problem Might turn a hard problem into an easy one Might be more what you actually care about #VSSML16 Feature Engineering September 2016 16 / 50
17. 17. Linearization • Not important for decision trees (transformations that preserve ordering have no effect) • Can be important for logistic regression and clustering • Common and simple cases are exp and log (log (f "test")) • Use cases Monetary amounts (salaries, proﬁts) Medical tests Various engagement metrics (e.g., website activity) In general, hockey stick distributions #VSSML16 Feature Engineering September 2016 17 / 50
18. 18. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 18 / 50
19. 19. Missing Data: Why? • Occasionally, a feature value might be missing from one or more instances in your data • This could be for a number of diverse reasons: Random noise (corruption, formatting) Feature is not computable (mathematical reasons) Collection errors (network errors, systems down) Missing for a reason (test not performed, value doesn’t apply) • Key question: Does the missing value have semantics? #VSSML16 Feature Engineering September 2016 19 / 50
20. 20. How Machine Learning Algorithms Treat Missing Data • The standard treatment of missing values by decision trees is that they’re just due to random noise (unless you choose the BigML “or missing” splits. Plug: This isn’t available in a lot of other packages.) They’re essentially “ignored” during tree construction (bit of an oversimpliﬁcation) This means that features that have missing values are less likely to be chose for a split than those that aren’t • Can we “repair” the features? #VSSML16 Feature Engineering September 2016 20 / 50
21. 21. Missing Value Replacement • Simplest thing to do is just replace the missing value with a common thing Mean or median - For symmetric distributions Mode - Especially for features where the “default” value is incredibly common (e.g., word counts, medical tests) • Such a common operation that it’s built into the interface • Also available in Flatline: (if (missing? "test1") (mean "test1") (f "test1")) #VSSML16 Feature Engineering September 2016 21 / 50
22. 22. Missing Value Induction • But we can do better than the mean, can’t we? (Spoiler: Yes) • If only we had an algorithm that could predict a value given the other feature values for the same instance HEY WAIT A MINUTE • Train a model to predict your missing values Training set is all points with the value non-missing Predict for points that have the training value missing Remember not to use your objective as part of the missing value predictor • Some good news: You probably don’t know or care what your performance is! • If you’re modeling with a technique that’s robust to missing values, you can model every column without getting into a “cycle” #VSSML16 Feature Engineering September 2016 22 / 50
23. 23. Constructing Features From Missing Data • Maybe a more interesting thing to do is to use missing data as a feature • Does your missing value have semantics? Feature is incomputable? Presence/absence is more telling than actual value • Easy to make a binary feature • You could also a quasi-numeric feature with a multilevel categorical with Flatline’s cond operator: (cond (missing? "test1") "not performed" (< (f "test1") 10) "low value" (< (f "test1") 20) "medium value" "high value") #VSSML16 Feature Engineering September 2016 23 / 50
24. 24. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 24 / 50
25. 25. Time-series Prediction • Occasionally a time series prediction problem comes at you as a 1-d prediction problem • The objective: Predict the value of the sequence given history. • But . . . there aren’t any features! #VSSML16 Feature Engineering September 2016 25 / 50
26. 26. Some Very Simple Time-series Data • Closing Prices for the S&P 500 • A useless objective! • No features! • What are we going to do? (Spoiler: Either drink away our sorrows or FEATURE ENGINEERING) Price 2019.32 2032.43 2015.11 2043.93 2060.50 2085.38 2092.93 #VSSML16 Feature Engineering September 2016 26 / 50
27. 27. A Better Objective: Percent Change • Going to be really difﬁcult to predict actual closing price. Why? Price gets larger over long time periods If we train on historical data, the future price will be out of range • Predicting the percent change from the previous close is a more stationary and more relevant objective • In ﬂatline, we can get the previous ﬁeld value by passing an index to the f function: (/ (- (f "price") (f "price" -1)) (f "price" -1)) #VSSML16 Feature Engineering September 2016 27 / 50
28. 28. Features: Delta from Previous Day (Week, Month, . . .) • Percent change over the last n days • Remember these are features so don’t include the objective day - you won’t know it! (/ (- (f "price" -1) (f "price" -10)) (f "price" -10)) • Note that this could be anything, and exactly what it should be is domain-speciﬁc #VSSML16 Feature Engineering September 2016 28 / 50
29. 29. Features: Above/Below Moving Average • The avg-window function makes it easy to conduct a moving average: (avg-window "price" -50 -1) • How far are we off the moving average? (let (ma50 (avg-window "price" -50 -1)) (/ (- (f "price" -1) ma50) ma50)) #VSSML16 Feature Engineering September 2016 29 / 50
30. 30. Features: Recent Volatility • Let’s do the standard deviation of a window: (let (win-mean (avg-window "price" -10 -1)) (map (square (- _ win-mean)) (window "price" -10 -1))) • With that, it’s easy to get the standard deviation: (let (win-mean (avg-window "price" -10 -1) sq-errs (map (square (- _ win-mean)) (window "price" -10 -1))) (sqrt (/ (+ sq-errs) (- (count sq-errs) 1)))) • This is a reasonably nice measure of volatility of the objective over the last n time periods. #VSSML16 Feature Engineering September 2016 30 / 50
31. 31. Go Nuts! • Not hard to imagine all sorts of interesting features: Moving average crosses Breaking out of a range All with different time parameters • One of the difﬁculties of feature engineering is dealing with this exponential explosion • Makes it spectacularly easy to keep wasting effort (or losing money) #VSSML16 Feature Engineering September 2016 31 / 50
32. 32. Some Caveats • The regularity in time of the points has to match your training data • You have to keep track of past points to compute your windows • Really easy to get information leakage by including your objective in a window computation (and can be very hard to detect)! • Did I mention how awful information leakage can be here? • WHAT ABOUT INFORMATION LEAKAGE #VSSML16 Feature Engineering September 2016 32 / 50
33. 33. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 33 / 50
34. 34. A Nearly Useless Datatype • There’s no easy way to include timestamps in our models (really just a formatted text ﬁeld) • What about epoch time? Usually not what we want. Weather forecasting? Activity prediction? Energy usage? • A date time is really a collection of features #VSSML16 Feature Engineering September 2016 34 / 50
35. 35. An Opportunity for Automatic Feature Engineering • Timestamps are usually found in a fairly small (okay, massive) number of standard formats • Once parsed into epoch time, we can automatically extract a bunch of features: Date features - month, day, year, day of week Time features - hour, minute, second, millisecond • We do this “for free” in BigML • You can also specify a speciﬁc format in Flatline #VSSML16 Feature Engineering September 2016 35 / 50
36. 36. Useful Features May Be Buried Even More Deeply • Important to remember that the computer doesn’t have the information about time that you do • Example - Working Hours Need to know if it’s between, say, 9:00 and 17:00 Also need to know if it’s Saturday or Sunday (let (hour (f "SomeDay.hour") day (f "SomeDay.day-of-week")) (and (<= 9 hour 18) (< day 6))) • Example - Is Daylight Need to know hour of day Also need to know day of year #VSSML16 Feature Engineering September 2016 36 / 50
37. 37. Go Nuts! • Date Features endOfMonth? - Important feature of lots of clerical work nationalHoliday? - What it says on the box duringWorldCup? - Certain behaviors (e.g., TV watching) might be different during this time • Time Features isRushHour? - Between 7 and 9am on a weekday mightBeSleeping? - From midnight to 6am mightBeDrinking? - Weekend evenings (or Wednesday at 1am, if that’s your thing) • There are a ton of these things that are spectacularly domain dependent (think contract rolls in futures trading) #VSSML16 Feature Engineering September 2016 37 / 50
38. 38. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 38 / 50
39. 39. Bag of Words • The standard way that BigML processes text is to create one feature for each word found for any instance in the text ﬁeld. • This is the so-called “bag of words” approach • Called this because all notion of sequence goes away after processing • In this case, any notion of correlation also disappears as the features are independent. #VSSML16 Feature Engineering September 2016 39 / 50
40. 40. Tokenization • Tokenization seems trivial Except if you have numbers or special characters in the tokens What about hyphens? Apostrophes? • Do we want to do n-grams? • Keep only tokens that occur a certain amount of time (not too rare, not too frequent) • Note that this is more difﬁcult with languages that don’t have clear word boundaries #VSSML16 Feature Engineering September 2016 40 / 50
41. 41. Word Stemming • Now we have a list of tokens • But sometimes we get “forms” of the same term playing, plays, play conﬁrm, conﬁrmed, conﬁrmation • We can use “stemming” to map these different forms back to the same root • Most western languages have a reasonable set of rules #VSSML16 Feature Engineering September 2016 41 / 50
42. 42. Other Textual Features • Example: Length of text (length (f "textFeature")) • Contains certain strings • Dollar amounts? Dates? Salutations? Please and Thank you? • Flatline has full regular expression capabilities #VSSML16 Feature Engineering September 2016 42 / 50
43. 43. Latent Dirichlet Allocation • Learn word distributions for topics • Infer topic scores for each document • Use the topic scores as features to a model #VSSML16 Feature Engineering September 2016 43 / 50
44. 44. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 44 / 50
45. 45. Feature Construction as Projection • Feature construction means increasing the space in which learning happens • An other set of techniques typically replaces the feature space • Often these techniques are called dimensionality reduction, and the models that are learned are a new basis for that data. • Why would you do this? New, possibly unrelated hypothesis space Speed Better visualization #VSSML16 Feature Engineering September 2016 45 / 50
46. 46. Principle Components Analysis • Find the axis that preserves the maximum amount of variance from the data • Find the axis, orthogonal to the ﬁrst, that preserves the next largest amount variance, and so on • In spite of this description, this isn’t an iterative algorithm (it can be solved with a matrix decomposition) • Projecting the data into the new space is accomplished with a matrix multiplication • Resulting features are a linear combination of the old features #VSSML16 Feature Engineering September 2016 46 / 50
47. 47. Distance to Cluster Centroids • Do a k-Means clustering • Compute the distances from each point to the each cluster centroid • Ta-da! k new features • Lots of variations on this theme: Normalized / Unnormalized, and by what? Average the class distributions of the resulting clusters Take the number of points / spread of each cluster into account #VSSML16 Feature Engineering September 2016 47 / 50
48. 48. Stacked Generalization: Classiﬁers as Features • Idea: Use model scores as input to a “meta-classiﬁer” • Algorithm: Split the training data into “base” and “meta” subsets Learn several different classiﬁers on the “base” subset Compute predictions for the “meta” subset with the “base” classiﬁers Use the scores on the “meta” subset as features for a classiﬁer learned on that subset • Meta-classiﬁer learns when the predictions of each of the “base” classiﬁers are to be preferred. • Pop quiz: Why do we need to split the data? #VSSML16 Feature Engineering September 2016 48 / 50
49. 49. Caveats (Again) • There’s obviously a lot of representational power here • Surprising sometimes how little it helps (remember, the upstream algorithm are already trying to do this) • Really easy to get bogged down in details • Some thoughts that might help you avoid that: Something really good is likely to be good with little effort Think of a decaying exponential where the vertical axis is “chance of dramatic improvement” and the horizontal one is “amount of tweaking” Have faith in both your own intelligence and creativity If you miss something really important, you’ll probably come back to it • Use Flatline as inspiration; it was with data scientists in mind #VSSML16 Feature Engineering September 2016 49 / 50
50. 50. It’s over! Questions #VSSML16 Feature Engineering September 2016 50 / 50