Feature Engineering

FEATURE ENGINEERING
HJ van Veen - Data Science - Nubank Brasil
–Peter Norvig
“More data beats clever algorithms, but better
data beats more data.”
Feature Engineering
• Most creative aspect of Data Science.
• Treat like any other creative endeavor, like writing a
comedy show:
• Hold brainstorming sessions
• Create templates / formula’s
• Check/revisit what worked before
Categorical Features
• Nearly always need some treatment
• High cardinality can create very sparse data
• Difficult to impute missing
Onehot encoding
• One-of-K encoding on an array of length K.
• Basic method: Used with most linear algorithms
• Dropping first column avoids collinearity
• Sparse format is memory-friendly
• Most current implementations don’t gracefully treat
missing, unseen variables
Onehot encoding
Sample: ["BR"]
country country=NL country=BR country=US
------- ---------- ---------- ----------
NL => [ 0, 1, 0]
BR
US
Encoded dense: [0, 1, 0]
Encoded sparse: 2:1
Hash encoding
• Does “OneHot-encoding” with arrays of a fixed
length.
• Avoids extremely sparse data
• May introduce collisions
• Can repeat with different hash functions and bag result for
small bump in accuracy
• Collisions usually degrade results, but may improve it.
• Gracefully deals with new variables (eg: new user-agents)
Hash encoding
Sample: ["BR"]
hash("BR") => `2`
country hash1 hash2 hash3 hash4 hash5
------- ----- ----- ----- ----- -----
NL => [ 0 1, 0 0, 0]
BR
US
Encoded dense: [0, 1, 0, 0, 0]
Encoded sparse: 2:1
Label encoding
• Give every categorical variable a unique numerical
ID
• Useful for non-linear tree-based algorithms
• Does not increase dimensionality
• Randomize the cat_var -> num_id mapping and
retrain, average, for small bump in accuracy.
Label encoding
Sample: ["Queenstown"]
city city
----------- ----
Cherbourg 1
Queenstown => 2
Southampton 3
Encoded: [2]
Count encoding
• Replace categorical variables with their count in the
train set
• Useful for both linear and non-linear algorithms
• Can be sensitive to outliers
• May add log-transform, works well with counts
• Replace unseen variables with `1`
• May give collisions: same encoding, different variables
Count encoding
Sample: ["A6GHBD78"]
teacher_ID teacher_ID
---------- ----------
DEADB33F 4
A6GHBD78 3
DEADB33F 4
FCKGWRHQ => 1
DEADB33F 4
A6GHBD78 3
A6GHBD78 3
DEADB33F 4
encoded: [3]
LabelCount encoding
• Rank categorical variables by count in train
set
• Useful for both linear and non-linear algorithms
• Not sensitive to outliers
• Won’t give same encoding to different variables
• Best of both worlds
LabelCount encoding
tld tld
--- ---
nl 3
nl 3
nl 3
nl => 3
de 2
de 2
fr 1
fr 1
Target encoding
• Encode categorical variables by their ratio of target (binary
classification or regression)
• Be careful to avoid overfit!
• Form of stacking: single-variable model which outputs average target
• Do in cross-validation manner
• Add smoothing to avoid setting variable encodings to 0.
• Add random noise to combat overfit
• When applied properly: Best encoding for both linear and non-linear
Target encoding
role y role
--------- - ----
manager | 1 0.5
engineer | 1 0.66
scientist | 1 => 1.
manager | 0 0.5
engineer | 0 0.66
engineer | 1 0.66
Category Embedding
• Use a Neural Network to create dense embeddings
from categorical variables.
• Map categorical variables in a function approximation
problem into Euclidean spaces
• Faster model training.
• Less memory overhead.
• Can give better accuracy than 1-hot encoded.
• https://arxiv.org/abs/1604.06737
Category Embedding
role role 3-D embedding
--------- —————————————————-
manager | [0.05, 0.10, 0.96]
engineer | [0.72, 0.66, 0.17]
scientist | [0.75, 0.62, 0.15]
manager | [0.05, 0.10, 0.96]
engineer | [0.72, 0.66, 0.17]
engineer | [0.72, 0.66, 0.17]
NaN encoding
• Give NaN values an explicit encoding instead
of ignoring
• NaN-values can hold information
• Be careful to avoid overfit!
• Use only when NaN-values in train and test set are
caused by the same, or when local validation proves
it holds signal
NaN encoding
Sample = [NaN]
UA UA=mobile UA=tablet UA=NaN
------- --------- --------- ------
mobile 0 0 1
tablet
mobile =>
NaN
mobile
Encoded = [0, 0, 1]
Polynomial encoding
• Encode interactions between categorical
variables
• Linear algorithms without interactions can not solve
the XOR problem
• A polynomial kernel *can* solve XOR
• Explodes the feature space: use FS, hashing and/or
VW
Polynomial encoding
A B y A=1*B=1 A=0*B=1 A=1*B=0 A=0*B=0 y
- - - ------- ------- ------- ------- -
1 1 | 1 1 0 0 0 | 1
0 1 | 0 => 0 1 0 0 | 0
1 0 | 0 0 0 1 0 | 0
0 0 | 1 0 0 0 1 | 1
Expansion encoding
• Create multiple categorical variables from a single variable
• Some high cardinality features, like user-agents, hold far more
information in them:
• is_mobile?
• is_latest_version?
• Operation_system
• Browser_build
• Etc.
Expansion encoding
Mozilla/5.0 (Macintosh; Intel Mac OS X
10_10_4) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/53.0.2785.143 Safari/537.36
|
v
UA1 UA2 UA3 UA4 UA5
------ ------------- ------- --- -------
Chrome 53.0.2785.143 Desktop Mac 10_10_4
Consolidation encoding
• Map different categorical variables to the
same variable
• Spelling errors, slightly different job descriptions,
full names vs. abbreviations
• Real data is messy, free text especially so
Expansion encoding
company_desc desc1 company_desc2
------------------ ----- -------------
Shell Shell Gas station
shel Shell Gas station
SHELL Shell Gas station
Shell Gasoline Shell Gas station
BP => BP Gas station
British Petr. BP Gas station
B&P BP Gas station
BP Gas Station BP Gas station
bp BP Gas station
Procter&Gamble P&G Manufacturer
–Andrew Ng
“Coming up with features is difficult, time-
consuming, requires expert knowledge. "Applied
machine learning" is basically feature
engineering.”
Numerical Features
• Can be more readily fed into algorithms
• Can constitute floats, counts, numbers
• Easier to impute missing data
Rounding
• Round numerical variables
• Form of lossy compression: retain most significant
features of the data.
• Sometimes too much precision is just noise
• Rounded variables can be treated as categorical
variables
• Can apply log-transform before rounding
Rounding
age age1 age2
------- ---- ----
23.6671 23 2
23.8891 23 2
22.1261 => 22 2
19.5506 19 1
18.2114 18 1
Binning
• Put numerical variables into a bin and
encode with bin-ID
• Binning can be set pragmatically, by quantiles,
evenly, or use models to find optimal bins
• Can work gracefully with variables outside of
ranges seen in the train set
Binning
risk_score rs[-inf,33] rs[33,66] rs[66,inf]
---------- ----------- --------- ----------
15 1 0 0
77 0 0 1
78 => 0 0 1
55 0 1 0
42 0 1 0
Binning
Scaling
• Scale to numerical variables into a certain
range
• Standard (Z) Scaling
• MinMax Scaling
• Root scaling
• Log scaling
Imputation
• Impute missing variables
• Hardcoding can be combined with imputation
• Mean: Very basic
• Median: More robust to outliers
• Ignoring: just postpones the problem
• Using a model: Can expose algorithmic bias
Imputation
wage hours gender | y wage hours | gender_y
---- ----- ------ | - ---- ----- | --------
1600 40 0 | 1 1600 40 | 0
2200 50 1 | 1 2200 50 | 1
1800 36 0 | 0 => 1800 36 | 0
2100 45 1 | 0 2100 45 | ?
2050 60 NaN | 0 2050 60 | ?
1650 36 0 | 1 1650 36 | ?
Interactions
• Specifically encodes the interactions between
numerical variables
• Try: Substraction, Addition, Multiplication, Divison
• Use: Feature selection by statistical tests, or trained
model feature importances
• Ignore: Human intuition; weird interactions can
give significant improvement!
–Pedro Domingos
“…some machine learning projects succeed and
some fail. What makes the difference? Easily the
most important factor is the features used.”
Non-linear encoding for linear algo’s
• Hardcode non-linearities to improve linear
algorithms
• Polynomial kernel
• Leafcoding (random forest embeddings)
• Genetic algorithms
• Locally Linear Embedding, Spectral Embedding, t-
SNE
Row statistics
• Create statistics on a row of data
• Number of NaN’s,
• Number of 0’s
• Number of negative values
• Mean, Max, Min, Skewness, etc.
Xavier Conort
“The algorithms we used are very standard for
Kagglers. […] We spent most of our efforts in
feature engineering. [...] We were also very
careful to discard features likely to expose us to
the risk of over-fitting our model.”
Temporal Variables
• Temporal variables, like dates, need better local
validation schemes (like backtesting)
• Easy to make mistakes here
• Lots of opportunity for major improvements
Projecting to a circle
• Turn single features, like day_of_week, into
two coordinates on a circle
• Ensures that distance between max and min is the
same as min and min +1.
• Use for day_of_week, day_of_month, hour_of_day,
etc.
Trendlines
• Instead of encoding: total spend, encode
things like: Spend in last week, spend in last
month, spend in last year.
• Gives a trend to the algorithm: two customers with
equal spend, can have wildly different behavior —
one customer may be starting to spend more, while
the other is starting to decline spending.
Closeness to major events
• Hardcode categorical features like:
date_3_days_before_holidays:1
• Try: National holidays, major sport events,
weekends, first Saturday of month, etc.
• These factors can have major influence on spending
behavior.
Scott Locklin
“feature engineering is another topic which
doesn’t seem to merit any review papers or
books, or even chapters in books, but it is
absolutely vital to ML success. […] Much of the
success of machine learning is actually success
in engineering features that a learner can
understand.”
Spatial Variables
• Spatial variables are variables that encode a location
in space
• Examples include: GPS-coordinates, cities,
countries, addresses
Categorizing location
• Kriging
• K-means clustering
• Raw latitude longitude
• Convert cities to latitude longitude
• Add zip codes to streetnames
Closeness to hubs
• Find closeness between a location to a major
hub
• Small towns inherit some of the culture/context of
nearby big cities
• Phone location can be mapped to nearby businesses
and supermarkets
Spatial fraudulent behavior
• Location event data can be indicative of
suspicious behavior
• Impossible travel speed: Multiple simultaneous
transactions in different countries
• Spending in different town than home or shipping
address
• Never spending at the same location
Shayne Miel
“you have to turn your inputs into things the
algorithm can understand”
Exploration
• Data exploration can find data health issues,
outliers, noise, feature engineering ideas,
feature cleaning ideas.
• Can use: Console, Notebook, Pandas
• Try simple stats: Min, max
• Incorporate the target so find correlation between
signal.
Iteration / Debugging
• Feature engineering is an iterative process:
Make your pipelines suitable for fast
iteration.
• Use sub-linear debugging: Output intermediate
information on the process, do spurious logging.
• Use tools that allow for fast experimentation
• More ideas will fail, than ideas will work
Label Engineering
• Can treat a label/target/dependent variable as a
feature of the data and vice versa.
• Log-transform: y -> log(y+1) | exp(y_pred) - 1
• Square-transform
• Box-Cox transform
• Create a score, to turn binary target in regression.
• Train regressor to predict a feature not available in test set.
Francois Chollet
“Developing good models requires iterating
many times on your initial ideas, up until the
deadline; you can always improve your models
further. Your final models will typically share
little in common with the solutions you
envisioned when first approaching the problem,
because a-priori plans basically never survive
confrontation with experimental reality.”
Natural Language Processing
• Can use the same ideas from categorical features.
• Deep learning (automatic feature engineering)
increasingly eating this field, but shallow learning
with well-engineered features is still competitive.
• High sparsity in data introduces you to “curse of
dimensionality”
• Many opportunities for feature engineering:
Natural Language Processing
• Lowercasing,
• Removing non-alphanumeric,
• Repairing,
• Encoding punctuation marks,
• Tokenizing,
• Token-grams,
• skipgrams,
• char-grams,
• Removing stopwords,
• Removing rare words
• and very common words,
• Spelling Correction,
• Chopping,
• Stemming,
• Lemmatization,
• Document features,
• Entitity Insertion & Extraction
• Simplification,
• Word2Vec and GloVe / Doc2Vec,
• String Similarity,
• Reading level,
• Nearest Neighbors,
• TF*IDF,
• BayesSVM, Vectorization, LDA, LSA.
Cleaning
• Lowercasing: Make tokens independant of capitalisation:
“I work at NASA” -> “i work at nasa”.
• Unidecode: Convert accented characters to their ascii-
counterparts: “Memórias Póstumas de Brás Cubas” ->
“Memorias Postumas de Bras Cubas”
• Removing non-alphanumeric: Clean text by removing
anything not in [a-z] [A-Z] [0-9]. “Breaking! Amsterdam
(2009)” -> “Breaking Amsterdam 2009”
• Repairing: Fix encoding issues or trim intertoken spaces.
“C a s a C a f é” -> “Casa Café”
Tokenizing
• Encode punctuation marks: Hardcode “!” and “?” as tokens.
• Tokenize: Chop sentences up in word tokens.
• N-Grams: Encode consecutive tokens as tokens: “I like the
Beatles” -> [“I like”, “like the”, “the Beatles”]
• Skip-grams: Encode consecutive tokens, but skip a few: “I like
the Beatles” -> [“I the”, “like Beatles”]
• Char-grams: Same as N-grams, but character level: “Beatles” -
> [“Bea”, “eat”, “atl”, “tle”, “les”]
• Affixes: Same as char-grams, but only the postfixes and prefixes
Removing
• Stopwords: Remove words/tokens that appear in
stopword lists.
• Rare words: Remove words that only appear few
times in training set.
• Common words: Remove extremely common
words that may not be in a stopword list.
Roots
• Spelling correction: Change tokens to their
correct spelling.
• Chop: Take only the first n (8) characters of a
word.
• Stem: Reduce a word/token to its root. “cars” ->
“car”
• Lemmatize: Find semantic root “never be late” ->
“never are late”
Enrich
• Document features: Count number of spaces, tabs,
newlines, characters, tokens, etc.
• Entity insertion: Add more general specifications to
text “Microsoft releases Windows” -> “Microsoft
(company) releases Windows (application)”
• Parse Trees: Parse a sentence into logic form: “Alice hits
Bill” -> Alice/Noun_subject hits/Verb Bill/Noun_object.
• Reading level: Compute the reading level of a
document.
Similarities
• Token similarity: Count number of tokens that
appear in two texts.
• Compression distance: Look if one text can be
compressed better using another text.
• Levenshtein/Hamming/Jaccard Distance: Check
similarity between two strings, by looking at number of
operations needed to transform one in the other.
• Word2Vec / Glove: Check cosine similarity between
two averaged vectors.
TF-IDF
• Term Frequency: Reduces bias to long
documents.
• Inverse Document Frequency: Reduces bias to
common tokens.
• TF-IDF: Use to identify most important tokens in a
document, to remove unimportant tokens, or as a
preprocessing step to dimensionality reduction.
Dimensionality Reduction
• PCA: Reduce text to 50 or 100-dimensional vector.
• SVD: Reduce text to 50 or 100-dimensional vector.
• LDA: TF-IDF followed by SVD.
• LSA: Create topic vectors.
External models
• Sentiment Analyzers: Get a vector for negative
or positive sentiment for any text.
• Topic models: Use another dataset to create topic
vectors for a new dataset.
Hal Daume III
“So many papers: feature engineering is hard
and time consuming. instead here's 8 pages in
which we have to design a new weird neural net
to do the same thing”
Neural Networks & Deep Learning
• Neural networks claim end-to-end automatic
feature engineering.
• Feature engineering dying field?
• No! Moves the focus to architecture engineering
• And despite promise: computer vision uses features
like: HOG, SIFT, whitening, perturbation, image
pyramids, rotation, z-scaling, log-scaling, frame-
grams, external semantic data, etc.
Leakage / Golden Features
• Feature engineering can help exploit leakage.
• Reverse engineer:
• Reverse MD5 hash with rainbow tables.
• Reverse TF-IDF back to Term Frequency
• Encode order of samples data set.
• Encode file creation dates.
• Rule mining:
• Find simple rules (and encode these) to help your model.
Case Study: Quora Duplicate
Questions Dataset
• Classify ~440.000 question pairs as duplicate
or non-duplicate.
• Benchmark 1: 0.79 accuracy (Stacked Siamese
Nets)
• Benchmark 2: 0.82 accuracy (Neural Bag of
Words)
• Benchmark 3: 0.88 accuracy (Bilateral Multi-
Perspective Matching)
Case Study: Quora Duplicate
Questions Dataset
• First attempt: Simple bag of words with logistic
regression.
• 0.75 accuracy
• Second attempt: Polynomial feature interactions
between the tokens in both questions.
• 0.80 accuracy
Case Study: Quora Duplicate
Questions Dataset
• Third attempt: Use stemming with
SnowballStemmer from NLTK.
• 0.805 accuracy
• Fourth attempt: Add 2-grams
• 0.81 accuracy
Case Study: Quora Duplicate
Questions Dataset
• Fifth attempt: Add manual features
• Normalized difference in length between question pairs
• Normalized compression distance between question pairs.
• Cosine distance between averaged word2vec vectors for the
question pairs.
• Chargram co-occurence between question pairs.
• Token count of words: “which, what, where”
• 0.827 Accuracy
Case Study: Quora Duplicate
Questions Dataset
• Can you think of any more features to
engineer?
• External & pre-trained models?
• Search engine models?
• Logic based models?
Robert J. Bennett
This one went unusually smoothly. When I
finished it, I remarked to a friend that I felt like
an engineer who had designed a machine and
then sat back and realized it did everything I’d
set out to do.
Which made him say, quite emphatically, “No
engineer has ever felt this.”
Questions?
Resources & Further Reading
• Kaggle forums & kernels: Far0n, KazAnova, Fchollet, Abhishek, Gilberto Titericz, Leustagos, Owen Zhang, Gert
Jacobusse …
• Introduction: http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-
to-get-good-at-it/
• Books:
• Mastering Feature Engineering (Alice Zheng),
• Feature Extraction (Isabelle Guyon et al.)
• Blogs:
• https://smerity.com/articles/2016/architectures_are_the_new_feature_engineering.html
• http://hunch.net/~jl/projects/hash_reps/
• https://blogs.technet.microsoft.com/machinelearning/2014/09/24/online-learning-and-sub-linear-debugging/
• http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/
• http://blog.kaggle.com/2016/08/24/avito-duplicate-ads-detection-winners-interview-1st-place-team-devil-team-
stanislav-dmitrii/
• http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost
• Data: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
• Software: https://github.com/trevorstephens/gplearn
1 of 76

Recommended

Feature Engineering by
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
10.7K views45 slides
General Tips for participating Kaggle Competitions by
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
85.7K views74 slides
Kaggle presentation by
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
13.4K views58 slides
Winning Kaggle 101: Introduction to Stacking by
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
8.8K views21 slides
Feature Engineering by
Feature Engineering Feature Engineering
Feature Engineering odsc
9.8K views25 slides
Feature Engineering - Getting most out of data for predictive models - TDC 2017 by
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
10.1K views75 slides

More Related Content

What's hot

Kaggle and data science by
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
1.1K views29 slides
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기 by
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
23.1K views82 slides
Variational Autoencoder by
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
7.7K views26 slides
Feature Engineering for ML - Dmitry Larko, H2O.ai by
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
2.4K views40 slides
Tips for data science competitions by
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
85.6K views32 slides
Decision Trees by
Decision TreesDecision Trees
Decision TreesStudent
12.4K views48 slides

What's hot(20)

Kaggle and data science by Akira Shibata
Kaggle and data scienceKaggle and data science
Kaggle and data science
Akira Shibata1.1K views
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기 by NAVER Engineering
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering23.1K views
Variational Autoencoder by Mark Chang
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang7.7K views
Feature Engineering for ML - Dmitry Larko, H2O.ai by Sri Ambati
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati2.4K views
Tips for data science competitions by Owen Zhang
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang85.6K views
Decision Trees by Student
Decision TreesDecision Trees
Decision Trees
Student12.4K views
Introduction to Generative Adversarial Networks (GANs) by Appsilon Data Science
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science13.3K views
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da... by Simplilearn
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Simplilearn6.1K views
Deep Learning for Recommender Systems RecSys2017 Tutorial by Alexandros Karatzoglou
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
Feature Engineering - Getting most out of data for predictive models by Gabriel Moreira
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira36.3K views
Supervised learning and Unsupervised learning by Usama Fayyaz
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
Usama Fayyaz5K views
Feature selection by Dong Guo
Feature selectionFeature selection
Feature selection
Dong Guo14.1K views
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning by akira-ai
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
akira-ai380 views
Machine Learning - Splitting Datasets by Andrew Ferlitsch
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Andrew Ferlitsch4.6K views
Data preprocessing using Machine Learning by Gopal Sakarkar
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar2.7K views
Feature Engineering in Machine Learning by Knoldus Inc.
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Knoldus Inc.1.9K views
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8 by Hakky St
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hakky St3.9K views
Dimensionality Reduction by mrizwan969
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan96912.6K views

Viewers also liked

Matrix Factorisation (and Dimensionality Reduction) by
Matrix Factorisation (and Dimensionality Reduction)Matrix Factorisation (and Dimensionality Reduction)
Matrix Factorisation (and Dimensionality Reduction)HJ van Veen
5.1K views23 slides
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017 by
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017PAPIs.io
1.8K views25 slides
李俊良/Feature Engineering in Machine Learning by
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning台灣資料科學年會
10.3K views31 slides
CSPO、CSM研修に参加して by
CSPO、CSM研修に参加してCSPO、CSM研修に参加して
CSPO、CSM研修に参加してArata Fujimura
10.4K views74 slides
ユーザーストーリー駆動開発で行こう。 by
ユーザーストーリー駆動開発で行こう。ユーザーストーリー駆動開発で行こう。
ユーザーストーリー駆動開発で行こう。toshihiro ichitani
120.8K views66 slides
fabric8 ... and Docker, Kubernetes & OpenShift by
fabric8 ... and Docker, Kubernetes & OpenShiftfabric8 ... and Docker, Kubernetes & OpenShift
fabric8 ... and Docker, Kubernetes & OpenShiftroland.huss
9.6K views46 slides

Viewers also liked(20)

Matrix Factorisation (and Dimensionality Reduction) by HJ van Veen
Matrix Factorisation (and Dimensionality Reduction)Matrix Factorisation (and Dimensionality Reduction)
Matrix Factorisation (and Dimensionality Reduction)
HJ van Veen5.1K views
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017 by PAPIs.io
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
PAPIs.io1.8K views
CSPO、CSM研修に参加して by Arata Fujimura
CSPO、CSM研修に参加してCSPO、CSM研修に参加して
CSPO、CSM研修に参加して
Arata Fujimura10.4K views
ユーザーストーリー駆動開発で行こう。 by toshihiro ichitani
ユーザーストーリー駆動開発で行こう。ユーザーストーリー駆動開発で行こう。
ユーザーストーリー駆動開発で行こう。
toshihiro ichitani120.8K views
fabric8 ... and Docker, Kubernetes & OpenShift by roland.huss
fabric8 ... and Docker, Kubernetes & OpenShiftfabric8 ... and Docker, Kubernetes & OpenShift
fabric8 ... and Docker, Kubernetes & OpenShift
roland.huss9.6K views
Docker Meetup - Melbourne 2015 - Kubernetes Deep Dive by Ken Thompson
Docker Meetup - Melbourne 2015 - Kubernetes Deep DiveDocker Meetup - Melbourne 2015 - Kubernetes Deep Dive
Docker Meetup - Melbourne 2015 - Kubernetes Deep Dive
Ken Thompson2K views
Cloud Computing Fundamental by Dony Riyanto
Cloud Computing FundamentalCloud Computing Fundamental
Cloud Computing Fundamental
Dony Riyanto1.8K views
Scalable Python with Docker, Kubernetes, OpenShift by Aarno Aukia
Scalable Python with Docker, Kubernetes, OpenShiftScalable Python with Docker, Kubernetes, OpenShift
Scalable Python with Docker, Kubernetes, OpenShift
Aarno Aukia2.6K views
Achieving Cost and Resource Efficiency through Docker, OpenShift and Kubernetes by Dean Delamont
Achieving Cost and Resource Efficiency through Docker, OpenShift and KubernetesAchieving Cost and Resource Efficiency through Docker, OpenShift and Kubernetes
Achieving Cost and Resource Efficiency through Docker, OpenShift and Kubernetes
Dean Delamont1.3K views
開発モデルの作り方(守破離の破) by Arata Fujimura
開発モデルの作り方(守破離の破)開発モデルの作り方(守破離の破)
開発モデルの作り方(守破離の破)
Arata Fujimura11.5K views
OpenShift Enterprise 3.1 vs kubernetes by Samuel Terburg
OpenShift Enterprise 3.1 vs kubernetesOpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetes
Samuel Terburg42.1K views
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd... by Kai Wähner
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Microservices, Containers, Docker and a Cloud-Native Architecture in the Midd...
Kai Wähner63.3K views
Prophet入門【Python編】Facebookの時系列予測ツール by hoxo_m
Prophet入門【Python編】Facebookの時系列予測ツールProphet入門【Python編】Facebookの時系列予測ツール
Prophet入門【Python編】Facebookの時系列予測ツール
hoxo_m64.6K views
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author by Vivian S. Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang89.2K views
How to assess & hire Java developers accurately? by HackerEarth
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
HackerEarth699 views
Tda presentation by HJ van Veen
Tda presentationTda presentation
Tda presentation
HJ van Veen2.9K views

Similar to Feature Engineering

30334823 my sql-cluster-performance-tuning-best-practices by
30334823 my sql-cluster-performance-tuning-best-practices30334823 my sql-cluster-performance-tuning-best-practices
30334823 my sql-cluster-performance-tuning-best-practicesDavid Dhavan
307 views40 slides
What’s eating python performance by
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
734 views47 slides
Machine Learning with ML.NET and Azure - Andy Cross by
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
301 views120 slides
Boosting the Performance of your Rails Apps by
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsMatt Kuklinski
13.8K views31 slides
PraveenBOUT++ by
PraveenBOUT++PraveenBOUT++
PraveenBOUT++Praveen Narayanan
101 views23 slides
Moving Toward Deep Learning Algorithms on HPCC Systems by
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
269 views80 slides

Similar to Feature Engineering(20)

30334823 my sql-cluster-performance-tuning-best-practices by David Dhavan
30334823 my sql-cluster-performance-tuning-best-practices30334823 my sql-cluster-performance-tuning-best-practices
30334823 my sql-cluster-performance-tuning-best-practices
David Dhavan307 views
What’s eating python performance by Piotr Przymus
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus734 views
Machine Learning with ML.NET and Azure - Andy Cross by Andrew Flatters
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters301 views
Boosting the Performance of your Rails Apps by Matt Kuklinski
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
Matt Kuklinski13.8K views
Moving Toward Deep Learning Algorithms on HPCC Systems by HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
HPCC Systems269 views
How to Achieve Scale with MongoDB by MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
MongoDB11.2K views
Validation and-design-in-a-small-team-environment by Obsidian Software
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environment
Obsidian Software531 views
Validation and Design in a Small Team Environment by DVClub
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team Environment
DVClub903 views
Machine learning for IoT - unpacking the blackbox by Ivo Andreev
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
Ivo Andreev21.6K views
A Production Quality Sketching Library for the Analysis of Big Data by Databricks
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks366 views
Oracle Query Optimizer - An Introduction by adryanbub
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
adryanbub1.6K views
Scaling MongoDB by MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
MongoDB3K views
Cassandra Basics, Counters and Time Series Modeling by Vassilis Bekiaris
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris7.7K views
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems by adrianionel
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
adrianionel2.7K views
Ehsan parallel accelerator-dec2015 by Christian Peel
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
Christian Peel2.2K views
C# 101: Intro to Programming with C# by Hawkman Academy
C# 101: Intro to Programming with C#C# 101: Intro to Programming with C#
C# 101: Intro to Programming with C#
Hawkman Academy16.6K views

Recently uploaded

2_DVD_ASIC_Design_FLow.pdf by
2_DVD_ASIC_Design_FLow.pdf2_DVD_ASIC_Design_FLow.pdf
2_DVD_ASIC_Design_FLow.pdfUsha Mehta
14 views24 slides
LFA-NPG-Paper.pdf by
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfharinsrikanth
40 views13 slides
SNMPx by
SNMPxSNMPx
SNMPxAmatullahbutt
12 views12 slides
IWISS Catalog 2022 by
IWISS Catalog 2022IWISS Catalog 2022
IWISS Catalog 2022Iwiss Tools Co.,Ltd
24 views66 slides
What is Whirling Hygrometer.pdf by
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdfIIT KHARAGPUR
11 views3 slides
9_DVD_Dynamic_logic_circuits.pdf by
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdfUsha Mehta
21 views32 slides

Recently uploaded(20)

2_DVD_ASIC_Design_FLow.pdf by Usha Mehta
2_DVD_ASIC_Design_FLow.pdf2_DVD_ASIC_Design_FLow.pdf
2_DVD_ASIC_Design_FLow.pdf
Usha Mehta14 views
What is Whirling Hygrometer.pdf by IIT KHARAGPUR
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdf
IIT KHARAGPUR 11 views
9_DVD_Dynamic_logic_circuits.pdf by Usha Mehta
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdf
Usha Mehta21 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi6 views
Multi-objective distributed generation integration in radial distribution sy... by IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 views
NEW SUPPLIERS SUPPLIES (copie).pdf by georgesradjou
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdf
georgesradjou7 views
7_DVD_Combinational_MOS_Logic_Circuits.pdf by Usha Mehta
7_DVD_Combinational_MOS_Logic_Circuits.pdf7_DVD_Combinational_MOS_Logic_Circuits.pdf
7_DVD_Combinational_MOS_Logic_Circuits.pdf
Usha Mehta50 views
CHI-SQUARE ( χ2) TESTS.pptx by ssusera597c5
CHI-SQUARE ( χ2) TESTS.pptxCHI-SQUARE ( χ2) TESTS.pptx
CHI-SQUARE ( χ2) TESTS.pptx
ssusera597c520 views
MSA Website Slideshow (16).pdf by msaucla
MSA Website Slideshow (16).pdfMSA Website Slideshow (16).pdf
MSA Website Slideshow (16).pdf
msaucla39 views
cloud computing-virtualization.pptx by RajaulKarim20
cloud computing-virtualization.pptxcloud computing-virtualization.pptx
cloud computing-virtualization.pptx
RajaulKarim2082 views
Machine Element II Course outline.pdf by odatadese1
Machine Element II Course outline.pdfMachine Element II Course outline.pdf
Machine Element II Course outline.pdf
odatadese16 views
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Informed search algorithms.pptx by Dr.Shweta
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
Dr.Shweta12 views

Feature Engineering

  • 1. FEATURE ENGINEERING HJ van Veen - Data Science - Nubank Brasil
  • 2. –Peter Norvig “More data beats clever algorithms, but better data beats more data.”
  • 3. Feature Engineering • Most creative aspect of Data Science. • Treat like any other creative endeavor, like writing a comedy show: • Hold brainstorming sessions • Create templates / formula’s • Check/revisit what worked before
  • 4. Categorical Features • Nearly always need some treatment • High cardinality can create very sparse data • Difficult to impute missing
  • 5. Onehot encoding • One-of-K encoding on an array of length K. • Basic method: Used with most linear algorithms • Dropping first column avoids collinearity • Sparse format is memory-friendly • Most current implementations don’t gracefully treat missing, unseen variables
  • 6. Onehot encoding Sample: ["BR"] country country=NL country=BR country=US ------- ---------- ---------- ---------- NL => [ 0, 1, 0] BR US Encoded dense: [0, 1, 0] Encoded sparse: 2:1
  • 7. Hash encoding • Does “OneHot-encoding” with arrays of a fixed length. • Avoids extremely sparse data • May introduce collisions • Can repeat with different hash functions and bag result for small bump in accuracy • Collisions usually degrade results, but may improve it. • Gracefully deals with new variables (eg: new user-agents)
  • 8. Hash encoding Sample: ["BR"] hash("BR") => `2` country hash1 hash2 hash3 hash4 hash5 ------- ----- ----- ----- ----- ----- NL => [ 0 1, 0 0, 0] BR US Encoded dense: [0, 1, 0, 0, 0] Encoded sparse: 2:1
  • 9. Label encoding • Give every categorical variable a unique numerical ID • Useful for non-linear tree-based algorithms • Does not increase dimensionality • Randomize the cat_var -> num_id mapping and retrain, average, for small bump in accuracy.
  • 10. Label encoding Sample: ["Queenstown"] city city ----------- ---- Cherbourg 1 Queenstown => 2 Southampton 3 Encoded: [2]
  • 11. Count encoding • Replace categorical variables with their count in the train set • Useful for both linear and non-linear algorithms • Can be sensitive to outliers • May add log-transform, works well with counts • Replace unseen variables with `1` • May give collisions: same encoding, different variables
  • 12. Count encoding Sample: ["A6GHBD78"] teacher_ID teacher_ID ---------- ---------- DEADB33F 4 A6GHBD78 3 DEADB33F 4 FCKGWRHQ => 1 DEADB33F 4 A6GHBD78 3 A6GHBD78 3 DEADB33F 4 encoded: [3]
  • 13. LabelCount encoding • Rank categorical variables by count in train set • Useful for both linear and non-linear algorithms • Not sensitive to outliers • Won’t give same encoding to different variables • Best of both worlds
  • 14. LabelCount encoding tld tld --- --- nl 3 nl 3 nl 3 nl => 3 de 2 de 2 fr 1 fr 1
  • 15. Target encoding • Encode categorical variables by their ratio of target (binary classification or regression) • Be careful to avoid overfit! • Form of stacking: single-variable model which outputs average target • Do in cross-validation manner • Add smoothing to avoid setting variable encodings to 0. • Add random noise to combat overfit • When applied properly: Best encoding for both linear and non-linear
  • 16. Target encoding role y role --------- - ---- manager | 1 0.5 engineer | 1 0.66 scientist | 1 => 1. manager | 0 0.5 engineer | 0 0.66 engineer | 1 0.66
  • 17. Category Embedding • Use a Neural Network to create dense embeddings from categorical variables. • Map categorical variables in a function approximation problem into Euclidean spaces • Faster model training. • Less memory overhead. • Can give better accuracy than 1-hot encoded. • https://arxiv.org/abs/1604.06737
  • 18. Category Embedding role role 3-D embedding --------- —————————————————- manager | [0.05, 0.10, 0.96] engineer | [0.72, 0.66, 0.17] scientist | [0.75, 0.62, 0.15] manager | [0.05, 0.10, 0.96] engineer | [0.72, 0.66, 0.17] engineer | [0.72, 0.66, 0.17]
  • 19. NaN encoding • Give NaN values an explicit encoding instead of ignoring • NaN-values can hold information • Be careful to avoid overfit! • Use only when NaN-values in train and test set are caused by the same, or when local validation proves it holds signal
  • 20. NaN encoding Sample = [NaN] UA UA=mobile UA=tablet UA=NaN ------- --------- --------- ------ mobile 0 0 1 tablet mobile => NaN mobile Encoded = [0, 0, 1]
  • 21. Polynomial encoding • Encode interactions between categorical variables • Linear algorithms without interactions can not solve the XOR problem • A polynomial kernel *can* solve XOR • Explodes the feature space: use FS, hashing and/or VW
  • 22. Polynomial encoding A B y A=1*B=1 A=0*B=1 A=1*B=0 A=0*B=0 y - - - ------- ------- ------- ------- - 1 1 | 1 1 0 0 0 | 1 0 1 | 0 => 0 1 0 0 | 0 1 0 | 0 0 0 1 0 | 0 0 0 | 1 0 0 0 1 | 1
  • 23. Expansion encoding • Create multiple categorical variables from a single variable • Some high cardinality features, like user-agents, hold far more information in them: • is_mobile? • is_latest_version? • Operation_system • Browser_build • Etc.
  • 24. Expansion encoding Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36 | v UA1 UA2 UA3 UA4 UA5 ------ ------------- ------- --- ------- Chrome 53.0.2785.143 Desktop Mac 10_10_4
  • 25. Consolidation encoding • Map different categorical variables to the same variable • Spelling errors, slightly different job descriptions, full names vs. abbreviations • Real data is messy, free text especially so
  • 26. Expansion encoding company_desc desc1 company_desc2 ------------------ ----- ------------- Shell Shell Gas station shel Shell Gas station SHELL Shell Gas station Shell Gasoline Shell Gas station BP => BP Gas station British Petr. BP Gas station B&P BP Gas station BP Gas Station BP Gas station bp BP Gas station Procter&Gamble P&G Manufacturer
  • 27. –Andrew Ng “Coming up with features is difficult, time- consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.”
  • 28. Numerical Features • Can be more readily fed into algorithms • Can constitute floats, counts, numbers • Easier to impute missing data
  • 29. Rounding • Round numerical variables • Form of lossy compression: retain most significant features of the data. • Sometimes too much precision is just noise • Rounded variables can be treated as categorical variables • Can apply log-transform before rounding
  • 30. Rounding age age1 age2 ------- ---- ---- 23.6671 23 2 23.8891 23 2 22.1261 => 22 2 19.5506 19 1 18.2114 18 1
  • 31. Binning • Put numerical variables into a bin and encode with bin-ID • Binning can be set pragmatically, by quantiles, evenly, or use models to find optimal bins • Can work gracefully with variables outside of ranges seen in the train set
  • 32. Binning risk_score rs[-inf,33] rs[33,66] rs[66,inf] ---------- ----------- --------- ---------- 15 1 0 0 77 0 0 1 78 => 0 0 1 55 0 1 0 42 0 1 0
  • 34. Scaling • Scale to numerical variables into a certain range • Standard (Z) Scaling • MinMax Scaling • Root scaling • Log scaling
  • 35. Imputation • Impute missing variables • Hardcoding can be combined with imputation • Mean: Very basic • Median: More robust to outliers • Ignoring: just postpones the problem • Using a model: Can expose algorithmic bias
  • 36. Imputation wage hours gender | y wage hours | gender_y ---- ----- ------ | - ---- ----- | -------- 1600 40 0 | 1 1600 40 | 0 2200 50 1 | 1 2200 50 | 1 1800 36 0 | 0 => 1800 36 | 0 2100 45 1 | 0 2100 45 | ? 2050 60 NaN | 0 2050 60 | ? 1650 36 0 | 1 1650 36 | ?
  • 37. Interactions • Specifically encodes the interactions between numerical variables • Try: Substraction, Addition, Multiplication, Divison • Use: Feature selection by statistical tests, or trained model feature importances • Ignore: Human intuition; weird interactions can give significant improvement!
  • 38. –Pedro Domingos “…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”
  • 39. Non-linear encoding for linear algo’s • Hardcode non-linearities to improve linear algorithms • Polynomial kernel • Leafcoding (random forest embeddings) • Genetic algorithms • Locally Linear Embedding, Spectral Embedding, t- SNE
  • 40. Row statistics • Create statistics on a row of data • Number of NaN’s, • Number of 0’s • Number of negative values • Mean, Max, Min, Skewness, etc.
  • 41. Xavier Conort “The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering. [...] We were also very careful to discard features likely to expose us to the risk of over-fitting our model.”
  • 42. Temporal Variables • Temporal variables, like dates, need better local validation schemes (like backtesting) • Easy to make mistakes here • Lots of opportunity for major improvements
  • 43. Projecting to a circle • Turn single features, like day_of_week, into two coordinates on a circle • Ensures that distance between max and min is the same as min and min +1. • Use for day_of_week, day_of_month, hour_of_day, etc.
  • 44. Trendlines • Instead of encoding: total spend, encode things like: Spend in last week, spend in last month, spend in last year. • Gives a trend to the algorithm: two customers with equal spend, can have wildly different behavior — one customer may be starting to spend more, while the other is starting to decline spending.
  • 45. Closeness to major events • Hardcode categorical features like: date_3_days_before_holidays:1 • Try: National holidays, major sport events, weekends, first Saturday of month, etc. • These factors can have major influence on spending behavior.
  • 46. Scott Locklin “feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.”
  • 47. Spatial Variables • Spatial variables are variables that encode a location in space • Examples include: GPS-coordinates, cities, countries, addresses
  • 48. Categorizing location • Kriging • K-means clustering • Raw latitude longitude • Convert cities to latitude longitude • Add zip codes to streetnames
  • 49. Closeness to hubs • Find closeness between a location to a major hub • Small towns inherit some of the culture/context of nearby big cities • Phone location can be mapped to nearby businesses and supermarkets
  • 50. Spatial fraudulent behavior • Location event data can be indicative of suspicious behavior • Impossible travel speed: Multiple simultaneous transactions in different countries • Spending in different town than home or shipping address • Never spending at the same location
  • 51. Shayne Miel “you have to turn your inputs into things the algorithm can understand”
  • 52. Exploration • Data exploration can find data health issues, outliers, noise, feature engineering ideas, feature cleaning ideas. • Can use: Console, Notebook, Pandas • Try simple stats: Min, max • Incorporate the target so find correlation between signal.
  • 53. Iteration / Debugging • Feature engineering is an iterative process: Make your pipelines suitable for fast iteration. • Use sub-linear debugging: Output intermediate information on the process, do spurious logging. • Use tools that allow for fast experimentation • More ideas will fail, than ideas will work
  • 54. Label Engineering • Can treat a label/target/dependent variable as a feature of the data and vice versa. • Log-transform: y -> log(y+1) | exp(y_pred) - 1 • Square-transform • Box-Cox transform • Create a score, to turn binary target in regression. • Train regressor to predict a feature not available in test set.
  • 55. Francois Chollet “Developing good models requires iterating many times on your initial ideas, up until the deadline; you can always improve your models further. Your final models will typically share little in common with the solutions you envisioned when first approaching the problem, because a-priori plans basically never survive confrontation with experimental reality.”
  • 56. Natural Language Processing • Can use the same ideas from categorical features. • Deep learning (automatic feature engineering) increasingly eating this field, but shallow learning with well-engineered features is still competitive. • High sparsity in data introduces you to “curse of dimensionality” • Many opportunities for feature engineering:
  • 57. Natural Language Processing • Lowercasing, • Removing non-alphanumeric, • Repairing, • Encoding punctuation marks, • Tokenizing, • Token-grams, • skipgrams, • char-grams, • Removing stopwords, • Removing rare words • and very common words, • Spelling Correction, • Chopping, • Stemming, • Lemmatization, • Document features, • Entitity Insertion & Extraction • Simplification, • Word2Vec and GloVe / Doc2Vec, • String Similarity, • Reading level, • Nearest Neighbors, • TF*IDF, • BayesSVM, Vectorization, LDA, LSA.
  • 58. Cleaning • Lowercasing: Make tokens independant of capitalisation: “I work at NASA” -> “i work at nasa”. • Unidecode: Convert accented characters to their ascii- counterparts: “Memórias Póstumas de Brás Cubas” -> “Memorias Postumas de Bras Cubas” • Removing non-alphanumeric: Clean text by removing anything not in [a-z] [A-Z] [0-9]. “Breaking! Amsterdam (2009)” -> “Breaking Amsterdam 2009” • Repairing: Fix encoding issues or trim intertoken spaces. “C a s a C a f é” -> “Casa Café”
  • 59. Tokenizing • Encode punctuation marks: Hardcode “!” and “?” as tokens. • Tokenize: Chop sentences up in word tokens. • N-Grams: Encode consecutive tokens as tokens: “I like the Beatles” -> [“I like”, “like the”, “the Beatles”] • Skip-grams: Encode consecutive tokens, but skip a few: “I like the Beatles” -> [“I the”, “like Beatles”] • Char-grams: Same as N-grams, but character level: “Beatles” - > [“Bea”, “eat”, “atl”, “tle”, “les”] • Affixes: Same as char-grams, but only the postfixes and prefixes
  • 60. Removing • Stopwords: Remove words/tokens that appear in stopword lists. • Rare words: Remove words that only appear few times in training set. • Common words: Remove extremely common words that may not be in a stopword list.
  • 61. Roots • Spelling correction: Change tokens to their correct spelling. • Chop: Take only the first n (8) characters of a word. • Stem: Reduce a word/token to its root. “cars” -> “car” • Lemmatize: Find semantic root “never be late” -> “never are late”
  • 62. Enrich • Document features: Count number of spaces, tabs, newlines, characters, tokens, etc. • Entity insertion: Add more general specifications to text “Microsoft releases Windows” -> “Microsoft (company) releases Windows (application)” • Parse Trees: Parse a sentence into logic form: “Alice hits Bill” -> Alice/Noun_subject hits/Verb Bill/Noun_object. • Reading level: Compute the reading level of a document.
  • 63. Similarities • Token similarity: Count number of tokens that appear in two texts. • Compression distance: Look if one text can be compressed better using another text. • Levenshtein/Hamming/Jaccard Distance: Check similarity between two strings, by looking at number of operations needed to transform one in the other. • Word2Vec / Glove: Check cosine similarity between two averaged vectors.
  • 64. TF-IDF • Term Frequency: Reduces bias to long documents. • Inverse Document Frequency: Reduces bias to common tokens. • TF-IDF: Use to identify most important tokens in a document, to remove unimportant tokens, or as a preprocessing step to dimensionality reduction.
  • 65. Dimensionality Reduction • PCA: Reduce text to 50 or 100-dimensional vector. • SVD: Reduce text to 50 or 100-dimensional vector. • LDA: TF-IDF followed by SVD. • LSA: Create topic vectors.
  • 66. External models • Sentiment Analyzers: Get a vector for negative or positive sentiment for any text. • Topic models: Use another dataset to create topic vectors for a new dataset.
  • 67. Hal Daume III “So many papers: feature engineering is hard and time consuming. instead here's 8 pages in which we have to design a new weird neural net to do the same thing”
  • 68. Neural Networks & Deep Learning • Neural networks claim end-to-end automatic feature engineering. • Feature engineering dying field? • No! Moves the focus to architecture engineering • And despite promise: computer vision uses features like: HOG, SIFT, whitening, perturbation, image pyramids, rotation, z-scaling, log-scaling, frame- grams, external semantic data, etc.
  • 69. Leakage / Golden Features • Feature engineering can help exploit leakage. • Reverse engineer: • Reverse MD5 hash with rainbow tables. • Reverse TF-IDF back to Term Frequency • Encode order of samples data set. • Encode file creation dates. • Rule mining: • Find simple rules (and encode these) to help your model.
  • 70. Case Study: Quora Duplicate Questions Dataset • Classify ~440.000 question pairs as duplicate or non-duplicate. • Benchmark 1: 0.79 accuracy (Stacked Siamese Nets) • Benchmark 2: 0.82 accuracy (Neural Bag of Words) • Benchmark 3: 0.88 accuracy (Bilateral Multi- Perspective Matching)
  • 71. Case Study: Quora Duplicate Questions Dataset • First attempt: Simple bag of words with logistic regression. • 0.75 accuracy • Second attempt: Polynomial feature interactions between the tokens in both questions. • 0.80 accuracy
  • 72. Case Study: Quora Duplicate Questions Dataset • Third attempt: Use stemming with SnowballStemmer from NLTK. • 0.805 accuracy • Fourth attempt: Add 2-grams • 0.81 accuracy
  • 73. Case Study: Quora Duplicate Questions Dataset • Fifth attempt: Add manual features • Normalized difference in length between question pairs • Normalized compression distance between question pairs. • Cosine distance between averaged word2vec vectors for the question pairs. • Chargram co-occurence between question pairs. • Token count of words: “which, what, where” • 0.827 Accuracy
  • 74. Case Study: Quora Duplicate Questions Dataset • Can you think of any more features to engineer? • External & pre-trained models? • Search engine models? • Logic based models?
  • 75. Robert J. Bennett This one went unusually smoothly. When I finished it, I remarked to a friend that I felt like an engineer who had designed a machine and then sat back and realized it did everything I’d set out to do. Which made him say, quite emphatically, “No engineer has ever felt this.” Questions?
  • 76. Resources & Further Reading • Kaggle forums & kernels: Far0n, KazAnova, Fchollet, Abhishek, Gilberto Titericz, Leustagos, Owen Zhang, Gert Jacobusse … • Introduction: http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how- to-get-good-at-it/ • Books: • Mastering Feature Engineering (Alice Zheng), • Feature Extraction (Isabelle Guyon et al.) • Blogs: • https://smerity.com/articles/2016/architectures_are_the_new_feature_engineering.html • http://hunch.net/~jl/projects/hash_reps/ • https://blogs.technet.microsoft.com/machinelearning/2014/09/24/online-learning-and-sub-linear-debugging/ • http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/ • http://blog.kaggle.com/2016/08/24/avito-duplicate-ads-detection-winners-interview-1st-place-team-devil-team- stanislav-dmitrii/ • http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost • Data: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs • Software: https://github.com/trevorstephens/gplearn