Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Feature Engineering
HJ van Veen - Data Scientist at Nubank
Feature Engineering
• Better data beats big data

• Applied Machine Learning is data infra,
feature engineering, modeling
...
Feature Types
Categorical "female", "teacher_ID_115"
Numbers -0.734, 58, 71165.80
Temporal 21-12-2012, 18:15, "Domingo"
Sp...
Onehot Encoding
• Encode k variables into a one-of-k
array sized k

• Bag of words

• Linear algorithms and neural network...
Hash Encoding
• Encode k variables into a one-of-h
array sized h

• Collisions

• Fast & Memory-friendly

• “NL” -> hash(“...
Label Encoding
• Give k variables a unique numerical ID

• Tree-based algorithms

• Dimensionality friendly

• "NL" -> uni...
Binary Encoding
• Uses binary representation of label ID

• Can encode over 4 billion categoricals
into 32 bits

• "NL" ->...
Count Encoding
• Replace variable with its count in the
train set

• Captures popularity of the variable

• "NL" -> 

coun...
Rank Count Encoding
• Uniquely rank a variables count in train
set

• Avoid collisions and outliers

• "NL" -> 

rank(coun...
Likelihood Encoding
• Replace variable by its target average

• Avoid overfit

• "NL" -> 

mean_of_target(“NL”)) -> 

[0.66]
Embedding Encoding
• Use a model to create an embedding

• Faster & More memory friendly

• “NL”, “F” -> 

nn_embed([“NL”,...
Numbers
• Imputation

• Binning, Rounding, Log-transforms

• Categorical encoding
Temporal
• Day of Week, Hour of day

• Trends

• Proximity to major events
Spatial
• Proximity to major cities

• Kriging, Clustering

• Fraud signals
Text
• TFIDF

• n-Grams

• Reducing dimensionality
Images
• Resize

• Rotate, skew, whiten

• Aggregate statistics
Missing
• Imputing

• Hardcoding

• Ignoring
Consolidation
• Common & Rare

• Spelling errors

• Cleaning
Expansion
• User agents

• Emails

• Hierarchical codes
Interactions
• Hardcode interactions

• Division, Multiplication, Addition,
Substraction, Combination, Similarity

• Tools...
Aggregate statistics
• Row & Column Statistics

• Counts

• Reading level
• Blacklist membership
Scaling
• Log transforms

• MinMax Scaling

• Z-Scoring / Standard Scaling
Meta-features
• Unsupervised

• Model stacking

• Feature stacking
Case Study
• Predict fraudsters from “name” and “email” form

• Expansion
• Temporal

• Aggregate Statistics
• Randomness
...
Conclusion
• Use XGBoost

• Label encode categorical variables

• Impute NaNs with -999, and set missing
parameter

• Use ...
Upcoming SlideShare
Loading in …5
×

Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017

907 views

Published on

Feature engineering is one of the most important, yet elusive, skills to master if you want to be a good data scientist. Machine learning competitions are hardly ever won with strong modeling techniques alone -- it is the combination of creative feature engineering and powerful modeling techniques that makes the difference. This tutorial will give the audience practical tips and tricks to improve the performance of machine learning algorithms. We will broadly look at feature engineering for applied machine learning, touching on subjects like: categorical vs. numerical variables, data cleaning, feature extraction, transformations, and imputation.

Published in: Technology
  • Be the first to comment

Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017

  1. 1. Feature Engineering HJ van Veen - Data Scientist at Nubank
  2. 2. Feature Engineering • Better data beats big data
 • Applied Machine Learning is data infra, feature engineering, modeling
 • Feature engineering is turning your data into something a model understands
 • Creativity, Inquisitive, Agility
  3. 3. Feature Types Categorical "female", "teacher_ID_115" Numbers -0.734, 58, 71165.80 Temporal 21-12-2012, 18:15, "Domingo" Spatial "Guarujá", latitude:51.865 Text "<h1>Eu falo português!</h1>" Images Grayscale, .jpg
  4. 4. Onehot Encoding • Encode k variables into a one-of-k array sized k
 • Bag of words
 • Linear algorithms and neural networks
 • “NL” -> one_of_k(“NL”) -> [0, 0, 0, 1]
  5. 5. Hash Encoding • Encode k variables into a one-of-h array sized h
 • Collisions
 • Fast & Memory-friendly
 • “NL” -> hash(“NL”) -> [0, 0, 1]
  6. 6. Label Encoding • Give k variables a unique numerical ID
 • Tree-based algorithms
 • Dimensionality friendly
 • "NL" -> unique_id("NL") -> [3]
  7. 7. Binary Encoding • Uses binary representation of label ID
 • Can encode over 4 billion categoricals into 32 bits
 • "NL" -> 
 binary(unique_id("NL")) -> 
 [1, 0, 1, 1, 1, 1, 1]
  8. 8. Count Encoding • Replace variable with its count in the train set
 • Captures popularity of the variable
 • "NL" -> 
 count_in_train(“NL”) -> 
 [5]
  9. 9. Rank Count Encoding • Uniquely rank a variables count in train set
 • Avoid collisions and outliers
 • "NL" -> 
 rank(count_in_train(“NL”)) -> 
 [7]
  10. 10. Likelihood Encoding • Replace variable by its target average
 • Avoid overfit
 • "NL" -> 
 mean_of_target(“NL”)) -> 
 [0.66]
  11. 11. Embedding Encoding • Use a model to create an embedding
 • Faster & More memory friendly
 • “NL”, “F” -> 
 nn_embed([“NL”, “F”])) -> 
 [0.66, 0.71, 0.05]
  12. 12. Numbers • Imputation
 • Binning, Rounding, Log-transforms
 • Categorical encoding
  13. 13. Temporal • Day of Week, Hour of day
 • Trends
 • Proximity to major events
  14. 14. Spatial • Proximity to major cities
 • Kriging, Clustering
 • Fraud signals
  15. 15. Text • TFIDF
 • n-Grams
 • Reducing dimensionality
  16. 16. Images • Resize
 • Rotate, skew, whiten
 • Aggregate statistics
  17. 17. Missing • Imputing
 • Hardcoding
 • Ignoring
  18. 18. Consolidation • Common & Rare
 • Spelling errors
 • Cleaning
  19. 19. Expansion • User agents
 • Emails
 • Hierarchical codes
  20. 20. Interactions • Hardcode interactions
 • Division, Multiplication, Addition, Substraction, Combination, Similarity
 • Tools & Processes
  21. 21. Aggregate statistics • Row & Column Statistics
 • Counts
 • Reading level • Blacklist membership
  22. 22. Scaling • Log transforms
 • MinMax Scaling
 • Z-Scoring / Standard Scaling
  23. 23. Meta-features • Unsupervised
 • Model stacking
 • Feature stacking
  24. 24. Case Study • Predict fraudsters from “name” and “email” form
 • Expansion • Temporal
 • Aggregate Statistics • Randomness
 • Interactions
  25. 25. Conclusion • Use XGBoost
 • Label encode categorical variables
 • Impute NaNs with -999, and set missing parameter
 • Use subsampling to quickly test new variables • Try everything until you reach a plateau or deadline

×