SlideShare a Scribd company logo
1 of 50
Download to read offline
Feature Engineering
#VSSML16
September 2016
#VSSML16 Feature Engineering September 2016 1 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 2 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 3 / 50
Should I Drive?
• Building a predictive model to
recommend driving (or not)
• Have data from the beginning
and ending of trip, and
whether or not there are paved
roads between the two points
• Tracked human-made
decisions for several hundred
trips
#VSSML16 Feature Engineering September 2016 4 / 50
A Simple Function
• Create a predictive model to
emulate a simple Boolean
function
• Features are Boolean
variables
• Objective is the inputs X’ored
together (true if the number of
ones is odd and false
otherwise)
#VSSML16 Feature Engineering September 2016 5 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 6 / 50
What?
• So we should just give up,
yes?
• All the model knows about are
the features we provide for it
• In the cases above the
features are broken
#VSSML16 Feature Engineering September 2016 7 / 50
Broken?
• The features we’ve provided aren’t useful for prediction
Latitude and longitude don’t correlate especially well with drivability
Any single feature in the Xor problem doesn’t predict the outcome
(and, in fact, changing any one feature changes the class)
In both cases, the same feature value has different semantics in the
presence of other features
• Machine learning algorithms, in general, rely on some statistically
detectable relationship between the features and the class
• The nature of the relationship is particular to the algorithm
#VSSML16 Feature Engineering September 2016 8 / 50
How to Confuse a Machine Learning Algorithm
• Remember that machine learning algorithms are searching for a
classifier in a particular hypothesis space
• Decision Trees
Thresholds on individual features
Are you able to set a meaningful threshold on any of your input
features?
• Logistic Regression
Weighted combinations of features
Can a good model be made on a weighted average of your input
features?
#VSSML16 Feature Engineering September 2016 9 / 50
Feature Engineering
• Feature engineering: The
process of transforming raw
input data into machine
learning ready-data
• Alternatively: Using your
existing features and some
math to make new features
that models will “like”
• Not covered here, but
important: Going out and
getting better information
• “Applied machine learning” is
domain-specific feature
engineering and evaluation!
#VSSML16 Feature Engineering September 2016 10 / 50
Some Good Times to Engineer Features
• When the relationship between the feature and the objective is
mathematically unsatisfying
• When the relationship of a function of two or more features is far
more relevant than the original features
• When there is missing data
• When the data is time-series, especially when the previous time
period’s objective is known
• When the data can’t be used for machine learning in the obvious
way (e.g., timestamps, text data)
Rule of thumb: Every bit of work you do in feature engineering is a bit
that the model doesn’t have to figure out.
#VSSML16 Feature Engineering September 2016 11 / 50
Aside: Flatline
• Feature engineering is the most important topic in real-world
machine learning
• BigML has its own domain specific language for it, Flatline, and
we’ll use it for our examples here
• Two things to note right off the bat:
Flatline uses lisp-like “prefix notation”
You get the value for a given feature using a function f
So, to create a new column in your dataset with the sum of
feature1 and feature2:
(+ (f "feature1") (f "feature2"))
#VSSML16 Feature Engineering September 2016 12 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 13 / 50
Statistical Aggregations
• Many times you have a bunch of features that all “mean” the same
thing
Pixels in the wavelet transform of an image
Did or did not make a purchase for day n − 1 to day n − 30
• Easiest thing is sum, average, or count (especially with sparse
data)
• The all and all-but field selectors are helpful here:
(/ (+ (all-but "PurchasesM-0") (count (all-but "PurchasesM-0"))))
#VSSML16 Feature Engineering September 2016 14 / 50
Better Categories
• Some categories are helped by collapsing them
Categories with elaborate hierarchies (occupation)
“Numeric” categories with too many levels (good, very good,
amazingly good)
Any category with a natural grouping (country, US state)
• Group categories with cond:
(cond (= "GRUNT" (f "job")) "Worker"
(> (occurrences (f "job") "Chief") 0) "Fancy Person"
"Everyone Else")
• Consider converting them to a numeric if they are ordinal
(cond (= (f "test1") "poor") 0
(= (f "test1") "fair") 1
(= (f "test1") "good") 2
(= (f "test1") "excellent") 3)
#VSSML16 Feature Engineering September 2016 15 / 50
Binning or Discretization
• You can also turn a numeric variable into a categorical one
• Main idea is that you make bins and put the values into each one
(e.g., low, middle, high)
• Good to give the model information about potential noise (say,
body temperature)
• There are whole bunch of ways to do it (in the interface):
Quartiles
Deciles
Any generic percentiles
• Note: This includes the objective itself!
Turns a regression problem into a classification problem
Might turn a hard problem into an easy one
Might be more what you actually care about
#VSSML16 Feature Engineering September 2016 16 / 50
Linearization
• Not important for decision trees (transformations that preserve
ordering have no effect)
• Can be important for logistic regression and clustering
• Common and simple cases are exp and log
(log (f "test"))
• Use cases
Monetary amounts (salaries, profits)
Medical tests
Various engagement metrics (e.g., website activity)
In general, hockey stick distributions
#VSSML16 Feature Engineering September 2016 17 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 18 / 50
Missing Data: Why?
• Occasionally, a feature value might be missing from one or more
instances in your data
• This could be for a number of diverse reasons:
Random noise (corruption, formatting)
Feature is not computable (mathematical reasons)
Collection errors (network errors, systems down)
Missing for a reason (test not performed, value doesn’t apply)
• Key question: Does the missing value have semantics?
#VSSML16 Feature Engineering September 2016 19 / 50
How Machine Learning Algorithms Treat Missing Data
• The standard treatment of
missing values by decision
trees is that they’re just due to
random noise (unless you
choose the BigML “or missing”
splits. Plug: This isn’t available
in a lot of other packages.)
They’re essentially “ignored”
during tree construction (bit
of an oversimplification)
This means that features
that have missing values are
less likely to be chose for a
split than those that aren’t
• Can we “repair” the features?
#VSSML16 Feature Engineering September 2016 20 / 50
Missing Value Replacement
• Simplest thing to do is just replace the missing value with a
common thing
Mean or median - For symmetric distributions
Mode - Especially for features where the “default” value is incredibly
common (e.g., word counts, medical tests)
• Such a common operation that it’s built into the interface
• Also available in Flatline:
(if (missing? "test1") (mean "test1") (f "test1"))
#VSSML16 Feature Engineering September 2016 21 / 50
Missing Value Induction
• But we can do better than the mean, can’t we? (Spoiler: Yes)
• If only we had an algorithm that could predict a value given the
other feature values for the same instance HEY WAIT A MINUTE
• Train a model to predict your missing values
Training set is all points with the value non-missing
Predict for points that have the training value missing
Remember not to use your objective as part of the missing value
predictor
• Some good news: You probably don’t know or care what your
performance is!
• If you’re modeling with a technique that’s robust to missing values,
you can model every column without getting into a “cycle”
#VSSML16 Feature Engineering September 2016 22 / 50
Constructing Features From Missing Data
• Maybe a more interesting thing to do is to use missing data as a
feature
• Does your missing value have semantics?
Feature is incomputable?
Presence/absence is more telling than actual value
• Easy to make a binary feature
• You could also a quasi-numeric feature with a multilevel
categorical with Flatline’s cond operator:
(cond (missing? "test1") "not performed"
(< (f "test1") 10) "low value"
(< (f "test1") 20) "medium value"
"high value")
#VSSML16 Feature Engineering September 2016 23 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 24 / 50
Time-series Prediction
• Occasionally a time series
prediction problem comes at
you as a 1-d prediction
problem
• The objective: Predict the
value of the sequence given
history.
• But . . . there aren’t any
features!
#VSSML16 Feature Engineering September 2016 25 / 50
Some Very Simple Time-series Data
• Closing Prices for the S&P 500
• A useless objective!
• No features!
• What are we going to do?
(Spoiler: Either drink away our
sorrows or FEATURE
ENGINEERING)
Price
2019.32
2032.43
2015.11
2043.93
2060.50
2085.38
2092.93
#VSSML16 Feature Engineering September 2016 26 / 50
A Better Objective: Percent Change
• Going to be really difficult to predict actual closing price. Why?
Price gets larger over long time periods
If we train on historical data, the future price will be out of range
• Predicting the percent change from the previous close is a more
stationary and more relevant objective
• In flatline, we can get the previous field value by passing an index
to the f function:
(/ (- (f "price") (f "price" -1)) (f "price" -1))
#VSSML16 Feature Engineering September 2016 27 / 50
Features: Delta from Previous Day (Week, Month, . . .)
• Percent change over the last n days
• Remember these are features so don’t include the objective day -
you won’t know it!
(/ (- (f "price" -1) (f "price" -10)) (f "price" -10))
• Note that this could be anything, and exactly what it should be is
domain-specific
#VSSML16 Feature Engineering September 2016 28 / 50
Features: Above/Below Moving Average
• The avg-window function makes it easy to conduct a moving
average:
(avg-window "price" -50 -1)
• How far are we off the moving average?
(let (ma50 (avg-window "price" -50 -1))
(/ (- (f "price" -1) ma50) ma50))
#VSSML16 Feature Engineering September 2016 29 / 50
Features: Recent Volatility
• Let’s do the standard deviation of a window:
(let (win-mean (avg-window "price" -10 -1))
(map (square (- _ win-mean)) (window "price" -10 -1)))
• With that, it’s easy to get the standard deviation:
(let (win-mean (avg-window "price" -10 -1)
sq-errs (map (square (- _ win-mean)) (window "price" -10 -1)))
(sqrt (/ (+ sq-errs) (- (count sq-errs) 1))))
• This is a reasonably nice measure of volatility of the objective over
the last n time periods.
#VSSML16 Feature Engineering September 2016 30 / 50
Go Nuts!
• Not hard to imagine all sorts of interesting features:
Moving average crosses
Breaking out of a range
All with different time parameters
• One of the difficulties of feature engineering is dealing with this
exponential explosion
• Makes it spectacularly easy to keep wasting effort (or losing
money)
#VSSML16 Feature Engineering September 2016 31 / 50
Some Caveats
• The regularity in time of the
points has to match your
training data
• You have to keep track of past
points to compute your
windows
• Really easy to get information
leakage by including your
objective in a window
computation (and can be very
hard to detect)!
• Did I mention how awful
information leakage can be
here?
• WHAT ABOUT
INFORMATION LEAKAGE
#VSSML16 Feature Engineering September 2016 32 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 33 / 50
A Nearly Useless Datatype
• There’s no easy way to include
timestamps in our models
(really just a formatted text
field)
• What about epoch time?
Usually not what we want.
Weather forecasting?
Activity prediction?
Energy usage?
• A date time is really a
collection of features
#VSSML16 Feature Engineering September 2016 34 / 50
An Opportunity for Automatic Feature Engineering
• Timestamps are usually found in a fairly small (okay, massive)
number of standard formats
• Once parsed into epoch time, we can automatically extract a
bunch of features:
Date features - month, day, year, day of week
Time features - hour, minute, second, millisecond
• We do this “for free” in BigML
• You can also specify a specific format in Flatline
#VSSML16 Feature Engineering September 2016 35 / 50
Useful Features May Be Buried Even More Deeply
• Important to remember that the computer doesn’t have the
information about time that you do
• Example - Working Hours
Need to know if it’s between, say, 9:00 and 17:00
Also need to know if it’s Saturday or Sunday
(let (hour (f "SomeDay.hour")
day (f "SomeDay.day-of-week"))
(and (<= 9 hour 18) (< day 6)))
• Example - Is Daylight
Need to know hour of day
Also need to know day of year
#VSSML16 Feature Engineering September 2016 36 / 50
Go Nuts!
• Date Features
endOfMonth? - Important feature of lots of clerical work
nationalHoliday? - What it says on the box
duringWorldCup? - Certain behaviors (e.g., TV watching) might be
different during this time
• Time Features
isRushHour? - Between 7 and 9am on a weekday
mightBeSleeping? - From midnight to 6am
mightBeDrinking? - Weekend evenings (or Wednesday at 1am, if
that’s your thing)
• There are a ton of these things that are spectacularly domain
dependent (think contract rolls in futures trading)
#VSSML16 Feature Engineering September 2016 37 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 38 / 50
Bag of Words
• The standard way that BigML processes text is to create one
feature for each word found for any instance in the text field.
• This is the so-called “bag of words” approach
• Called this because all notion of sequence goes away after
processing
• In this case, any notion of correlation also disappears as the
features are independent.
#VSSML16 Feature Engineering September 2016 39 / 50
Tokenization
• Tokenization seems trivial
Except if you have numbers or special characters in the tokens
What about hyphens? Apostrophes?
• Do we want to do n-grams?
• Keep only tokens that occur a certain amount of time (not too rare,
not too frequent)
• Note that this is more difficult with languages that don’t have clear
word boundaries
#VSSML16 Feature Engineering September 2016 40 / 50
Word Stemming
• Now we have a list of tokens
• But sometimes we get “forms” of the same term
playing, plays, play
confirm, confirmed, confirmation
• We can use “stemming” to map these different forms back to the
same root
• Most western languages have a reasonable set of rules
#VSSML16 Feature Engineering September 2016 41 / 50
Other Textual Features
• Example: Length of text
(length (f "textFeature"))
• Contains certain strings
• Dollar amounts? Dates? Salutations? Please and Thank you?
• Flatline has full regular expression capabilities
#VSSML16 Feature Engineering September 2016 42 / 50
Latent Dirichlet Allocation
• Learn word distributions for topics
• Infer topic scores for each document
• Use the topic scores as features to a model
#VSSML16 Feature Engineering September 2016 43 / 50
Outline
1 Some Unfortunate Examples
2 Feature Engineering
3 Mathematical Transformations
4 Missing Values
5 Series Features
6 Datetime Features
7 Text Features
8 Advanced Topics
#VSSML16 Feature Engineering September 2016 44 / 50
Feature Construction as Projection
• Feature construction means increasing the space in which
learning happens
• An other set of techniques typically replaces the feature space
• Often these techniques are called dimensionality reduction, and
the models that are learned are a new basis for that data.
• Why would you do this?
New, possibly unrelated hypothesis space
Speed
Better visualization
#VSSML16 Feature Engineering September 2016 45 / 50
Principle Components Analysis
• Find the axis that preserves the maximum amount of variance
from the data
• Find the axis, orthogonal to the first, that preserves the next
largest amount variance, and so on
• In spite of this description, this isn’t an iterative algorithm (it can be
solved with a matrix decomposition)
• Projecting the data into the new space is accomplished with a
matrix multiplication
• Resulting features are a linear combination of the old features
#VSSML16 Feature Engineering September 2016 46 / 50
Distance to Cluster Centroids
• Do a k-Means clustering
• Compute the distances from each point to the each cluster
centroid
• Ta-da! k new features
• Lots of variations on this theme:
Normalized / Unnormalized, and by what?
Average the class distributions of the resulting clusters
Take the number of points / spread of each cluster into account
#VSSML16 Feature Engineering September 2016 47 / 50
Stacked Generalization: Classifiers as Features
• Idea: Use model scores as input to a “meta-classifier”
• Algorithm:
Split the training data into “base” and “meta” subsets
Learn several different classifiers on the “base” subset
Compute predictions for the “meta” subset with the “base”
classifiers
Use the scores on the “meta” subset as features for a classifier
learned on that subset
• Meta-classifier learns when the predictions of each of the “base”
classifiers are to be preferred.
• Pop quiz: Why do we need to split the data?
#VSSML16 Feature Engineering September 2016 48 / 50
Caveats (Again)
• There’s obviously a lot of representational power here
• Surprising sometimes how little it helps (remember, the upstream
algorithm are already trying to do this)
• Really easy to get bogged down in details
• Some thoughts that might help you avoid that:
Something really good is likely to be good with little effort
Think of a decaying exponential where the vertical axis is “chance
of dramatic improvement” and the horizontal one is “amount of
tweaking”
Have faith in both your own intelligence and creativity
If you miss something really important, you’ll probably come back to
it
• Use Flatline as inspiration; it was with data scientists in mind
#VSSML16 Feature Engineering September 2016 49 / 50
It’s over!
Questions
#VSSML16 Feature Engineering September 2016 50 / 50

More Related Content

What's hot

Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsankit_ppt
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricksLiangjun Jiang
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycleRamjee Ganti
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023Ryan Collingwood
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learningsafa cimenli
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 

What's hot (20)

Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycle
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 

Similar to VSSML16 L6. Feature Engineering

VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BigML, Inc
 
Watson Analytics for HSE - Copy
Watson Analytics for HSE - CopyWatson Analytics for HSE - Copy
Watson Analytics for HSE - CopyAlexei Cherenkov
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
VSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsVSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsBigML, Inc
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and AzureMarco Parenzan
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkDatabricks
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBigML, Inc
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesPhilip Goddard
 
Advanced WhizzML Workflows
Advanced WhizzML WorkflowsAdvanced WhizzML Workflows
Advanced WhizzML WorkflowsBigML, Inc
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZCharles Vestur
 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2BigML, Inc
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingBigML, Inc
 
Booting into functional programming
Booting into functional programmingBooting into functional programming
Booting into functional programmingDhaval Dalal
 

Similar to VSSML16 L6. Feature Engineering (20)

VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
 
Watson Analytics for HSE - Copy
Watson Analytics for HSE - CopyWatson Analytics for HSE - Copy
Watson Analytics for HSE - Copy
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
VSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsVSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 Sessions
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and Azure
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
Machine Learning Goes Production
Machine Learning Goes ProductionMachine Learning Goes Production
Machine Learning Goes Production
 
Advanced WhizzML Workflows
Advanced WhizzML WorkflowsAdvanced WhizzML Workflows
Advanced WhizzML Workflows
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
Booting into functional programming
Booting into functional programmingBooting into functional programming
Booting into functional programming
 

More from BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

VSSML16 L6. Feature Engineering

  • 1. Feature Engineering #VSSML16 September 2016 #VSSML16 Feature Engineering September 2016 1 / 50
  • 2. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 2 / 50
  • 3. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 3 / 50
  • 4. Should I Drive? • Building a predictive model to recommend driving (or not) • Have data from the beginning and ending of trip, and whether or not there are paved roads between the two points • Tracked human-made decisions for several hundred trips #VSSML16 Feature Engineering September 2016 4 / 50
  • 5. A Simple Function • Create a predictive model to emulate a simple Boolean function • Features are Boolean variables • Objective is the inputs X’ored together (true if the number of ones is odd and false otherwise) #VSSML16 Feature Engineering September 2016 5 / 50
  • 6. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 6 / 50
  • 7. What? • So we should just give up, yes? • All the model knows about are the features we provide for it • In the cases above the features are broken #VSSML16 Feature Engineering September 2016 7 / 50
  • 8. Broken? • The features we’ve provided aren’t useful for prediction Latitude and longitude don’t correlate especially well with drivability Any single feature in the Xor problem doesn’t predict the outcome (and, in fact, changing any one feature changes the class) In both cases, the same feature value has different semantics in the presence of other features • Machine learning algorithms, in general, rely on some statistically detectable relationship between the features and the class • The nature of the relationship is particular to the algorithm #VSSML16 Feature Engineering September 2016 8 / 50
  • 9. How to Confuse a Machine Learning Algorithm • Remember that machine learning algorithms are searching for a classifier in a particular hypothesis space • Decision Trees Thresholds on individual features Are you able to set a meaningful threshold on any of your input features? • Logistic Regression Weighted combinations of features Can a good model be made on a weighted average of your input features? #VSSML16 Feature Engineering September 2016 9 / 50
  • 10. Feature Engineering • Feature engineering: The process of transforming raw input data into machine learning ready-data • Alternatively: Using your existing features and some math to make new features that models will “like” • Not covered here, but important: Going out and getting better information • “Applied machine learning” is domain-specific feature engineering and evaluation! #VSSML16 Feature Engineering September 2016 10 / 50
  • 11. Some Good Times to Engineer Features • When the relationship between the feature and the objective is mathematically unsatisfying • When the relationship of a function of two or more features is far more relevant than the original features • When there is missing data • When the data is time-series, especially when the previous time period’s objective is known • When the data can’t be used for machine learning in the obvious way (e.g., timestamps, text data) Rule of thumb: Every bit of work you do in feature engineering is a bit that the model doesn’t have to figure out. #VSSML16 Feature Engineering September 2016 11 / 50
  • 12. Aside: Flatline • Feature engineering is the most important topic in real-world machine learning • BigML has its own domain specific language for it, Flatline, and we’ll use it for our examples here • Two things to note right off the bat: Flatline uses lisp-like “prefix notation” You get the value for a given feature using a function f So, to create a new column in your dataset with the sum of feature1 and feature2: (+ (f "feature1") (f "feature2")) #VSSML16 Feature Engineering September 2016 12 / 50
  • 13. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 13 / 50
  • 14. Statistical Aggregations • Many times you have a bunch of features that all “mean” the same thing Pixels in the wavelet transform of an image Did or did not make a purchase for day n − 1 to day n − 30 • Easiest thing is sum, average, or count (especially with sparse data) • The all and all-but field selectors are helpful here: (/ (+ (all-but "PurchasesM-0") (count (all-but "PurchasesM-0")))) #VSSML16 Feature Engineering September 2016 14 / 50
  • 15. Better Categories • Some categories are helped by collapsing them Categories with elaborate hierarchies (occupation) “Numeric” categories with too many levels (good, very good, amazingly good) Any category with a natural grouping (country, US state) • Group categories with cond: (cond (= "GRUNT" (f "job")) "Worker" (> (occurrences (f "job") "Chief") 0) "Fancy Person" "Everyone Else") • Consider converting them to a numeric if they are ordinal (cond (= (f "test1") "poor") 0 (= (f "test1") "fair") 1 (= (f "test1") "good") 2 (= (f "test1") "excellent") 3) #VSSML16 Feature Engineering September 2016 15 / 50
  • 16. Binning or Discretization • You can also turn a numeric variable into a categorical one • Main idea is that you make bins and put the values into each one (e.g., low, middle, high) • Good to give the model information about potential noise (say, body temperature) • There are whole bunch of ways to do it (in the interface): Quartiles Deciles Any generic percentiles • Note: This includes the objective itself! Turns a regression problem into a classification problem Might turn a hard problem into an easy one Might be more what you actually care about #VSSML16 Feature Engineering September 2016 16 / 50
  • 17. Linearization • Not important for decision trees (transformations that preserve ordering have no effect) • Can be important for logistic regression and clustering • Common and simple cases are exp and log (log (f "test")) • Use cases Monetary amounts (salaries, profits) Medical tests Various engagement metrics (e.g., website activity) In general, hockey stick distributions #VSSML16 Feature Engineering September 2016 17 / 50
  • 18. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 18 / 50
  • 19. Missing Data: Why? • Occasionally, a feature value might be missing from one or more instances in your data • This could be for a number of diverse reasons: Random noise (corruption, formatting) Feature is not computable (mathematical reasons) Collection errors (network errors, systems down) Missing for a reason (test not performed, value doesn’t apply) • Key question: Does the missing value have semantics? #VSSML16 Feature Engineering September 2016 19 / 50
  • 20. How Machine Learning Algorithms Treat Missing Data • The standard treatment of missing values by decision trees is that they’re just due to random noise (unless you choose the BigML “or missing” splits. Plug: This isn’t available in a lot of other packages.) They’re essentially “ignored” during tree construction (bit of an oversimplification) This means that features that have missing values are less likely to be chose for a split than those that aren’t • Can we “repair” the features? #VSSML16 Feature Engineering September 2016 20 / 50
  • 21. Missing Value Replacement • Simplest thing to do is just replace the missing value with a common thing Mean or median - For symmetric distributions Mode - Especially for features where the “default” value is incredibly common (e.g., word counts, medical tests) • Such a common operation that it’s built into the interface • Also available in Flatline: (if (missing? "test1") (mean "test1") (f "test1")) #VSSML16 Feature Engineering September 2016 21 / 50
  • 22. Missing Value Induction • But we can do better than the mean, can’t we? (Spoiler: Yes) • If only we had an algorithm that could predict a value given the other feature values for the same instance HEY WAIT A MINUTE • Train a model to predict your missing values Training set is all points with the value non-missing Predict for points that have the training value missing Remember not to use your objective as part of the missing value predictor • Some good news: You probably don’t know or care what your performance is! • If you’re modeling with a technique that’s robust to missing values, you can model every column without getting into a “cycle” #VSSML16 Feature Engineering September 2016 22 / 50
  • 23. Constructing Features From Missing Data • Maybe a more interesting thing to do is to use missing data as a feature • Does your missing value have semantics? Feature is incomputable? Presence/absence is more telling than actual value • Easy to make a binary feature • You could also a quasi-numeric feature with a multilevel categorical with Flatline’s cond operator: (cond (missing? "test1") "not performed" (< (f "test1") 10) "low value" (< (f "test1") 20) "medium value" "high value") #VSSML16 Feature Engineering September 2016 23 / 50
  • 24. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 24 / 50
  • 25. Time-series Prediction • Occasionally a time series prediction problem comes at you as a 1-d prediction problem • The objective: Predict the value of the sequence given history. • But . . . there aren’t any features! #VSSML16 Feature Engineering September 2016 25 / 50
  • 26. Some Very Simple Time-series Data • Closing Prices for the S&P 500 • A useless objective! • No features! • What are we going to do? (Spoiler: Either drink away our sorrows or FEATURE ENGINEERING) Price 2019.32 2032.43 2015.11 2043.93 2060.50 2085.38 2092.93 #VSSML16 Feature Engineering September 2016 26 / 50
  • 27. A Better Objective: Percent Change • Going to be really difficult to predict actual closing price. Why? Price gets larger over long time periods If we train on historical data, the future price will be out of range • Predicting the percent change from the previous close is a more stationary and more relevant objective • In flatline, we can get the previous field value by passing an index to the f function: (/ (- (f "price") (f "price" -1)) (f "price" -1)) #VSSML16 Feature Engineering September 2016 27 / 50
  • 28. Features: Delta from Previous Day (Week, Month, . . .) • Percent change over the last n days • Remember these are features so don’t include the objective day - you won’t know it! (/ (- (f "price" -1) (f "price" -10)) (f "price" -10)) • Note that this could be anything, and exactly what it should be is domain-specific #VSSML16 Feature Engineering September 2016 28 / 50
  • 29. Features: Above/Below Moving Average • The avg-window function makes it easy to conduct a moving average: (avg-window "price" -50 -1) • How far are we off the moving average? (let (ma50 (avg-window "price" -50 -1)) (/ (- (f "price" -1) ma50) ma50)) #VSSML16 Feature Engineering September 2016 29 / 50
  • 30. Features: Recent Volatility • Let’s do the standard deviation of a window: (let (win-mean (avg-window "price" -10 -1)) (map (square (- _ win-mean)) (window "price" -10 -1))) • With that, it’s easy to get the standard deviation: (let (win-mean (avg-window "price" -10 -1) sq-errs (map (square (- _ win-mean)) (window "price" -10 -1))) (sqrt (/ (+ sq-errs) (- (count sq-errs) 1)))) • This is a reasonably nice measure of volatility of the objective over the last n time periods. #VSSML16 Feature Engineering September 2016 30 / 50
  • 31. Go Nuts! • Not hard to imagine all sorts of interesting features: Moving average crosses Breaking out of a range All with different time parameters • One of the difficulties of feature engineering is dealing with this exponential explosion • Makes it spectacularly easy to keep wasting effort (or losing money) #VSSML16 Feature Engineering September 2016 31 / 50
  • 32. Some Caveats • The regularity in time of the points has to match your training data • You have to keep track of past points to compute your windows • Really easy to get information leakage by including your objective in a window computation (and can be very hard to detect)! • Did I mention how awful information leakage can be here? • WHAT ABOUT INFORMATION LEAKAGE #VSSML16 Feature Engineering September 2016 32 / 50
  • 33. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 33 / 50
  • 34. A Nearly Useless Datatype • There’s no easy way to include timestamps in our models (really just a formatted text field) • What about epoch time? Usually not what we want. Weather forecasting? Activity prediction? Energy usage? • A date time is really a collection of features #VSSML16 Feature Engineering September 2016 34 / 50
  • 35. An Opportunity for Automatic Feature Engineering • Timestamps are usually found in a fairly small (okay, massive) number of standard formats • Once parsed into epoch time, we can automatically extract a bunch of features: Date features - month, day, year, day of week Time features - hour, minute, second, millisecond • We do this “for free” in BigML • You can also specify a specific format in Flatline #VSSML16 Feature Engineering September 2016 35 / 50
  • 36. Useful Features May Be Buried Even More Deeply • Important to remember that the computer doesn’t have the information about time that you do • Example - Working Hours Need to know if it’s between, say, 9:00 and 17:00 Also need to know if it’s Saturday or Sunday (let (hour (f "SomeDay.hour") day (f "SomeDay.day-of-week")) (and (<= 9 hour 18) (< day 6))) • Example - Is Daylight Need to know hour of day Also need to know day of year #VSSML16 Feature Engineering September 2016 36 / 50
  • 37. Go Nuts! • Date Features endOfMonth? - Important feature of lots of clerical work nationalHoliday? - What it says on the box duringWorldCup? - Certain behaviors (e.g., TV watching) might be different during this time • Time Features isRushHour? - Between 7 and 9am on a weekday mightBeSleeping? - From midnight to 6am mightBeDrinking? - Weekend evenings (or Wednesday at 1am, if that’s your thing) • There are a ton of these things that are spectacularly domain dependent (think contract rolls in futures trading) #VSSML16 Feature Engineering September 2016 37 / 50
  • 38. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 38 / 50
  • 39. Bag of Words • The standard way that BigML processes text is to create one feature for each word found for any instance in the text field. • This is the so-called “bag of words” approach • Called this because all notion of sequence goes away after processing • In this case, any notion of correlation also disappears as the features are independent. #VSSML16 Feature Engineering September 2016 39 / 50
  • 40. Tokenization • Tokenization seems trivial Except if you have numbers or special characters in the tokens What about hyphens? Apostrophes? • Do we want to do n-grams? • Keep only tokens that occur a certain amount of time (not too rare, not too frequent) • Note that this is more difficult with languages that don’t have clear word boundaries #VSSML16 Feature Engineering September 2016 40 / 50
  • 41. Word Stemming • Now we have a list of tokens • But sometimes we get “forms” of the same term playing, plays, play confirm, confirmed, confirmation • We can use “stemming” to map these different forms back to the same root • Most western languages have a reasonable set of rules #VSSML16 Feature Engineering September 2016 41 / 50
  • 42. Other Textual Features • Example: Length of text (length (f "textFeature")) • Contains certain strings • Dollar amounts? Dates? Salutations? Please and Thank you? • Flatline has full regular expression capabilities #VSSML16 Feature Engineering September 2016 42 / 50
  • 43. Latent Dirichlet Allocation • Learn word distributions for topics • Infer topic scores for each document • Use the topic scores as features to a model #VSSML16 Feature Engineering September 2016 43 / 50
  • 44. Outline 1 Some Unfortunate Examples 2 Feature Engineering 3 Mathematical Transformations 4 Missing Values 5 Series Features 6 Datetime Features 7 Text Features 8 Advanced Topics #VSSML16 Feature Engineering September 2016 44 / 50
  • 45. Feature Construction as Projection • Feature construction means increasing the space in which learning happens • An other set of techniques typically replaces the feature space • Often these techniques are called dimensionality reduction, and the models that are learned are a new basis for that data. • Why would you do this? New, possibly unrelated hypothesis space Speed Better visualization #VSSML16 Feature Engineering September 2016 45 / 50
  • 46. Principle Components Analysis • Find the axis that preserves the maximum amount of variance from the data • Find the axis, orthogonal to the first, that preserves the next largest amount variance, and so on • In spite of this description, this isn’t an iterative algorithm (it can be solved with a matrix decomposition) • Projecting the data into the new space is accomplished with a matrix multiplication • Resulting features are a linear combination of the old features #VSSML16 Feature Engineering September 2016 46 / 50
  • 47. Distance to Cluster Centroids • Do a k-Means clustering • Compute the distances from each point to the each cluster centroid • Ta-da! k new features • Lots of variations on this theme: Normalized / Unnormalized, and by what? Average the class distributions of the resulting clusters Take the number of points / spread of each cluster into account #VSSML16 Feature Engineering September 2016 47 / 50
  • 48. Stacked Generalization: Classifiers as Features • Idea: Use model scores as input to a “meta-classifier” • Algorithm: Split the training data into “base” and “meta” subsets Learn several different classifiers on the “base” subset Compute predictions for the “meta” subset with the “base” classifiers Use the scores on the “meta” subset as features for a classifier learned on that subset • Meta-classifier learns when the predictions of each of the “base” classifiers are to be preferred. • Pop quiz: Why do we need to split the data? #VSSML16 Feature Engineering September 2016 48 / 50
  • 49. Caveats (Again) • There’s obviously a lot of representational power here • Surprising sometimes how little it helps (remember, the upstream algorithm are already trying to do this) • Really easy to get bogged down in details • Some thoughts that might help you avoid that: Something really good is likely to be good with little effort Think of a decaying exponential where the vertical axis is “chance of dramatic improvement” and the horizontal one is “amount of tweaking” Have faith in both your own intelligence and creativity If you miss something really important, you’ll probably come back to it • Use Flatline as inspiration; it was with data scientists in mind #VSSML16 Feature Engineering September 2016 49 / 50
  • 50. It’s over! Questions #VSSML16 Feature Engineering September 2016 50 / 50