Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The How and Why of
Feature Engineering
Alice Zheng, Dato
March 29, 2016
Strata + Hadoop World, San Jose
1
2
My journey so far
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML to...
3
Machine learning is great!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, hi...
5
If machine learning were hairstyles
Images courtesy of “A visual history of ancient hairdos” and “An animated history of...
6
Making sense of feature engineering
• Feature generation
• Feature cleaning and transformation
• How well do they work?
...
Feature Generation
Feature: An individual measurable property of a phenomenon being observed.
⎯ Christopher Bishop, “Patte...
8
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
S...
9
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
the...
10
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralb...
11
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
3...
12
Representing audio
Raw Audio
Spectrogram
features
Classify:
Music or voice?
Type of instrument
t=0 t=1 t=2
6.1917 -0.34...
13
Feature generation for audio, image, text
I fell in love the instant I
laid my eyes on that
puppy. His big eyes and
pla...
Feature Cleaning and Transformation
15
Auto-generated features are noisy
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,32...
16
Auto-generated features are noisy
Rank Word Doc Count Rank Word Doc Count
357,480 cmtk8xyqg 1 357,470 attractif 1
357,4...
17
Feature cleaning
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g ...
18
Feature cleaning
• Frequency-based pruning
19
Stopwords vs. frequency filters
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to...
20
Tf-Idf: Automatic “soft” filter
• Tf-idf = term frequency x inverse document
frequency
• Tf = Number of times a terms a...
21
Visualizing bag-of-words
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
I have a dog
and I have a pen...
22
Visualizing tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4...
23
Visualizing tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pe...
24
Algebraically, tf-idf = column scaling
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
idf = log (N/L0 norm of word column)
25
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
Algebraically, tf-idf = column scaling
Multiply word column with scalar (idf of word)
26
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
Algebraically, tf-idf = column scaling
Multiply word column with scalar (idf of word)
27
Algebraically, tf-idf = column scaling
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
Multiply word column with scalar (idf of word)
28
Other types of column scaling
• L2 scaling = divide column by L2 norm
How well do they work?
30
Classify reviews using logistic regression
• Classify business category of Yelp reviews
• Bag-of-words vs. L2 normaliza...
31
Observations
• l2 regularization made no difference (with proper tuning)
• L2 normalization made no difference on accur...
A Peek Under the Hood
33
Linear classification
Feature 2
Feature 1
Find the best line to separate two classes
Algebraically–solve linear systems
Data matrix
Weightvector
Labels
How a matrix works
Any matrix
Left
singular
vectors
Singular values Right
singular
vectors
How a matrix works
Any matrix
Project
ScaleProject
How a matrix works
Null space
Singular value = 0
Null space = part of the input space that is
squashed by the matrix
Column space
Singular value ≠ 0
Column space = the non-zero part of the
output space
Effect of column scaling
Scaled columns
Effect of column scaling
Scaled columns Singular values change
(but zeros stay zero)
Singular vectors may also change
42
Effect of column scaling
• Changes the singular values and vectors, but not the rank
of the null space or column space
...
43
Mystery resolved
• Tf-idf can emphasize some columns while zeroing out
others—the uninformative features
• L2 normaliza...
44
Take-away points
• Many tricks for feature generation and transformation
• Features interact with models, making their ...
Upcoming SlideShare
Loading in …5
×

The How and Why of Feature Engineering

10,497 views

Published on

Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.

Published in: Science

The How and Why of Feature Engineering

  1. 1. The How and Why of Feature Engineering Alice Zheng, Dato March 29, 2016 Strata + Hadoop World, San Jose 1
  2. 2. 2 My journey so far Shortage of expertise and good tools in the market. Applied machine learning/ data science Build ML tools Write a book
  3. 3. 3 Machine learning is great! Model data. Make predictions. Build intelligent applications. Play chess and go!
  4. 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  5. 5. 5 If machine learning were hairstyles Images courtesy of “A visual history of ancient hairdos” and “An animated history of 20th century hairstyles.” Models Magnificent, ornate, high-maintenance Feature engineering Street smart, ad-hoc, hacky
  6. 6. 6 Making sense of feature engineering • Feature generation • Feature cleaning and transformation • How well do they work? • Why?
  7. 7. Feature Generation Feature: An individual measurable property of a phenomenon being observed. ⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
  8. 8. 8 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  9. 9. 9 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 0 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  10. 10. 10 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  11. 11. 11 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  12. 12. 12 Representing audio Raw Audio Spectrogram features Classify: Music or voice? Type of instrument t=0 t=1 t=2 6.1917 -0.3411 1.2418 0.2205 0.0214 0.4503 1.0423 0.2214 -1.0017 -0.2340 -0.0392 -0.2617 0.2750 0.0226 0.1229 0.0653 0.0428 -0.4721 0.3169 0.0541 -0.1033 -0.2970 -0.0627 0.1960 Time series of dense vectors
  13. 13. 13 Feature generation for audio, image, text I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … “Human native” Conceptually abstract Low Semantic content in data High Higher Difficulty of feature generation Lower
  14. 14. Feature Cleaning and Transformation
  15. 15. 15 Auto-generated features are noisy Rank Word Doc Count Rank Word Doc Count 1 the 1,416,058 11 was 929,703 2 and 1,381,324 12 this 844,824 3 a 1,263,126 13 but 822,313 4 i 1,230,214 14 my 786,595 5 to 1,196,238 15 that 777,045 6 it 1,027,835 16 with 775,044 7 of 1,025,638 17 on 735,419 8 for 993,430 18 they 720,994 9 is 988,547 19 you 701,015 10 in 961,518 20 have 692,749 Most popular words in Yelp reviews dataset (~ 6M reviews).
  16. 16. 16 Auto-generated features are noisy Rank Word Doc Count Rank Word Doc Count 357,480 cmtk8xyqg 1 357,470 attractif 1 357,479 tangified 1 357,469 chappagetti 1 357,478 laaaaaaasts 1 357,468 herdy 1 357,477 bailouts 1 357,467 csmpus 1 357,476 feautred 1 357,466 costoso 1 357,475 résine 1 357,465 freebased 1 357,474 chilyl 1 357,464 tikme 1 357,473 cariottis 1 357,463 traditionresort 1 357,472 enfeebled 1 357,462 jallisco 1 357,471 sparklely 1 357,461 zoawan 1 Least popular words in Yelp reviews dataset (~ 6M reviews).
  17. 17. 17 Feature cleaning • Popular words and rare words are not helpful • Manually defined blacklist – stopwords a b c d e f g h i able be came definitely each far get had ie about became can described edu few gets happens if above because cannot despite eg fifth getting hardly ignored according become cant did eight first given has immediately accordingly becomes cause different either five gives have in across becoming causes do else followed go having inasmuch … … … … … … … … …
  18. 18. 18 Feature cleaning • Frequency-based pruning
  19. 19. 19 Stopwords vs. frequency filters No training required Stopwords Frequency filters Can be exhaustive Inflexible Adapts to data Also deals with rare words Needs tuning, hard to control Both require manual attention
  20. 20. 20 Tf-Idf: Automatic “soft” filter • Tf-idf = term frequency x inverse document frequency • Tf = Number of times a terms appears in a document • Idf = log(# total docs / # docs containing word w) • Large for uncommon words, small for popular words • Discounts popular words, highlights rare words
  21. 21. 21 Visualizing bag-of-words puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten I have a dog and I have a pen 1
  22. 22. 22 Visualizing tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  23. 23. 23 Visualizing tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy
  24. 24. 24 Algebraically, tf-idf = column scaling w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN idf = log (N/L0 norm of word column)
  25. 25. 25 w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN Algebraically, tf-idf = column scaling Multiply word column with scalar (idf of word)
  26. 26. 26 w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN Algebraically, tf-idf = column scaling Multiply word column with scalar (idf of word)
  27. 27. 27 Algebraically, tf-idf = column scaling w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN Multiply word column with scalar (idf of word)
  28. 28. 28 Other types of column scaling • L2 scaling = divide column by L2 norm
  29. 29. How well do they work?
  30. 30. 30 Classify reviews using logistic regression • Classify business category of Yelp reviews • Bag-of-words vs. L2 normalization vs. tf-idf • Model: logistic regression
  31. 31. 31 Observations • l2 regularization made no difference (with proper tuning) • L2 normalization made no difference on accuracy • Tf-idf did better, but barely • But they are both column scaling methods! Why the difference?
  32. 32. A Peek Under the Hood
  33. 33. 33 Linear classification Feature 2 Feature 1 Find the best line to separate two classes
  34. 34. Algebraically–solve linear systems Data matrix Weightvector Labels
  35. 35. How a matrix works Any matrix Left singular vectors Singular values Right singular vectors
  36. 36. How a matrix works Any matrix Project ScaleProject
  37. 37. How a matrix works
  38. 38. Null space Singular value = 0 Null space = part of the input space that is squashed by the matrix
  39. 39. Column space Singular value ≠ 0 Column space = the non-zero part of the output space
  40. 40. Effect of column scaling Scaled columns
  41. 41. Effect of column scaling Scaled columns Singular values change (but zeros stay zero) Singular vectors may also change
  42. 42. 42 Effect of column scaling • Changes the singular values and vectors, but not the rank of the null space or column space • … unless the scaling factor is zero - Could only happen with tf-idf • L2 scaling improves the condition number (therefore the solver converges faster)
  43. 43. 43 Mystery resolved • Tf-idf can emphasize some columns while zeroing out others—the uninformative features • L2 normalization makes all features equal in “size” - Improves the condition number of the matrix - Solver converges faster
  44. 44. 44 Take-away points • Many tricks for feature generation and transformation • Features interact with models, making their effects difficult to predict • But so much fun to play with! • New book coming out: Mastering feature engineering - More tricks, intuition, analysis @RainyData

×