SlideShare a Scribd company logo
Data Transformation and Feature Engineering
Charles Parker
Allston Trading
2
• Oregon State University (Structured output spaces)

• Music recognition

• Real-time strategy game-playing

• Kodak Research Labs

• Media classification (audio, video)

• Document Classification

• Performance Evaluation

• BigML

• Allston Trading (applying machine learning to market data)
Full Disclosure
3
• But it’s “machine learning”!

• Your data sucks (or at least I hope it does) . . .

• Data is broken

• Data is incomplete

• . . . but you know about it!

• Make the problem easier

• Make the answer more obvious

• Don’t waste time modeling the obvious

• Until you find the right algorithm for it
Data Transformation
Your Data Sucks I: Broken Features
• Suppose you have a market data feature called
trade imbalance = (buy - sell) / total volume that
you calculate every five minutes

• Now suppose there are no trades over five minutes

• What to do?

• Point or feature removal

• Easy default
4
Your Data Sucks II: Missing Values
• Suppose you’re building a model
to predict the presence or
absence of cancer

• Each feature is a medical test

• Some are simple (height,
weight, temperature)

• Some are complex (blood
counts, CAT scan)

• Some patients have had all of
these done, some have not. 

• Does the presence or absence of
a CAT scan tell you something?
Should it be a feature?
5
Height Weight
Blood
Test
Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No
Simplifying Your Problem
• What about the class
variable?

• It’s just another feature, so it
can be engineered

• Change the problem

• Do you need so many
classes?

• Do you need to do a
regression?
6
Feature Engineering: What?
• Your data may be too “raw”
for learning

• Multimedia Data

• Raw text data

• Something must be done to
make the data “learnable”

• Compute edge histograms,
SIFT features

• Do word counts, latent
topic modeling
7
An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No
• Whether two points are
walking distance from
each other is not an
obvious function of the
latitude and longitude

• But it is an obvious
function of the distance
between the two points

• Unfortunately, that
function is quite
complicated

• Fortunately, you know it
already!
9
An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2
Distance
(km)
Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No
Feature Engineering
• One of the core (maybe the core)
competencies of a machine learning engineer

• Requires domain understanding

• Requires algorithm understanding

• If you do it really well, you eliminate the need
for machine learning entirely

• Gives you another path to success; you can
often substitute domain knowledge for
modeling expertise

• But what if you don’t have specific domain
knowledge?
11
Techniques I: Discretization
• Construct meaningful bins for a
continuous feature (two or more)

• Body temperature

• Credit score

• The new features are categorical
features, each category of which
has nice semantics

• Don’t make the algorithm waste
effort modeling things that you
already know about
12
Techniques II: Delta
• Sometimes, the difference between two features is
the important bit

• As it was in the distance example

• Also holds a lot in the time domain

• Example: Hiss in speech recognition

• Struggling? Just differentiate! (In all seriousness,
this sometimes works)
13
Techniques III: Windowing
• If points are distributed in time,
previous points in the same
window are often very informative

• Weather

• Stock prices

• Add this to a 1-d sequence of
points to get an instant machine
learning problem!

• Sensor data

• User behavior

• Maybe add some delta features?
14
Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one
(subtract the mean and divide by the standard deviation).

• Good for domains with heterogeneous but gaussian-distributed data sources

• Demographic data

• Medical testing

• Note that this isn’t in general effective for decision trees!

• Transformation is order preserving

• Decision tree splits rely only on ordering!

• Good for things like k-NN
15
Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0]
and [1, 1, 1] -> [0.57, 0.57, 0.57])

• Nice for sparse feature spaces like text

• Helps us tell the difference between documents and dictionaries

• We’ll come back to the idea of sparsity

• Note that this will effect decision trees

• Does not necessarily preserve order (co-dependency between
features)

• A lesson against over-generalization of technique!
16
What Do We Really Want?
• This is nice, but what ever happened to “machine
learning”?

• Construct a feature space in which “learning is
easy”, whatever that means

• The space must preserve “important aspects of the
data”, whatever that means

• Are there general ways of posing this problem?
(Spoiler Alert: Yes)
17
Aside I: Projection
• A projection is a one-to-
one mapping from one
feature space to another

• We want a function f(x)
that projects a point x
into a space where a
good classifier is obvious

• The axes (features) in
your new space are
called your new basis
18
f(x)
f(x)
A Hack Projection: Distance to Cluster
• Do clustering on your data

• For each point, compute the
distance to each cluster centroid

• These distances are your new
features

• The new space can be either
higher or lower dimensional than
your new space

• For highly clustered data, this
can be a fairly powerful feature
space
19
Principle Components Analysis
• Find the axis through
the data with the
highest variance

• Repeat for the next
orthogonal axis and so
on, until you run out of
data or dimensions

• Each axis is a feature
20
PCA is Nice!
• Generally quite fast (matrix decomposition)

• Features are linear combinations of originals (which
means you can project test data into the space)

• Features are linearly independent (great for some
algorithms)

• Data can often be “explained” with just the first few
components (so this can be “dimensionality
reduction”)
21
Spectral Embeddings
• Two of the seminal ones
are Isomap and LLE

• Generally, compute the
nearest neighbor matrix
and use this to create the
embedding

• Pro: Pretty spectacular
results

• Con: No projection matrix
22
Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method

• Create an objective function that preserves neighbor
relationships

• Neighbor distances (unsupervised)

• Closest points of the same class (supervised)

• Clever search for a projection matrix that satisfies this
objective (usually an elaborate sort of gradient descent)

• I’ve had some success with these
23
Aside II: Sparsity
• Machine learning is essentially compression, and
constantly plays at the edges of this idea

• Minimum description length

• Bayesian information criteria

• L1 and L2 regularization

• Sparse representations are easily compressed

• So does that mean they’re more powerful?
24
Sparsity I: Text Data
• Text data is inherently sparse

• The fact that we choose a small number of words to
use gives a document its semantics

• Text features are incredibly powerful in the grand
scheme of feature spaces

• One or two words allow us to do accurate
classification

• But those one or two words must be sparse
25
Sparsity II: EigenFaces
• Here are the first few
components of PCA applied to
a collection of face images

• A small number of these
explain a huge part of a huge
number of faces

• First components are like stop
words, last few (sparse)
components make recognition
easy
26
Sparsity III: The Fourier Transform
• Very complex waveform

• Turns out to be easily
expressible as a
combination of a few
(i.e., sparse) constant
frequency signals

• Such representations
make accurate speech
recognition possible
27
Sparse Coding
• Iterate

• Choose a basis

• Evaluate that basis based on how well you can use
it to reconstruct the input, and how sparse it is

• Take some sort of gradient step to improve that
evaluation

• Andrew Ng’s efficient sparse coding algorithms and
Hinton’s deep autoencoders are both flavors of this
28
The New Basis
• Text: Topics

• Audio: Frequency
Transform

• Visual: Pen Strokes
29
Another Hack: Totally Random Trees
• Train a bunch of decision trees

• With no objective!

• Each leaf is a feature

• Ta-da! Sparse basis

• This actually works
30
And More and More
• There are a ton a variations on these themes

• Dimensionality Reduction

• Metric Learning

• “Coding” or “Encoding”

• Nice canonical implementations can be found at:
http://lvdmaaten.github.io/drtoolbox/
31

More Related Content

What's hot

BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
Anas Jamil
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
Machine Learning Valencia
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
ananth
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
ananth
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
台灣資料科學年會
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learning
Maneesha Caldera
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and Deepnets
BigML, Inc
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
ananth
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
ananth
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
Jonathan Mugan
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
Xiang Zhang
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
ananth
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
台灣資料科學年會
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
ananth
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
Michael BENESTY
 
Machine learning
Machine learningMachine learning
Machine learning
eonx_32
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
CodeForFrankfurt
 
ML Basics
ML BasicsML Basics
ML Basics
SrujanaMerugu1
 

What's hot (20)

BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learning
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and Deepnets
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Machine learning
Machine learningMachine learning
Machine learning
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
 
ML Basics
ML BasicsML Basics
ML Basics
 

Viewers also liked

L1. State of the Art in Machine Learning
L1. State of the Art in Machine LearningL1. State of the Art in Machine Learning
L1. State of the Art in Machine Learning
Machine Learning Valencia
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
Alice Zheng
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
Alice Zheng
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
BigML, Inc
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
DataRobot
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Gregg Kalpan Resume
Gregg Kalpan ResumeGregg Kalpan Resume
Gregg Kalpan Resume
Gregg Kaplan
 
A field guide the machine learning zoo
A field guide the machine learning zoo A field guide the machine learning zoo
A field guide the machine learning zoo
Theodoros Vasiloudis
 
Introduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. SpearsIntroduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. Spearsbutest
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learningbutest
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
Meir Maor
 
Machine Learning - Where to Next?, May 2015
Machine Learning  - Where to Next?, May 2015Machine Learning  - Where to Next?, May 2015
Machine Learning - Where to Next?, May 2015
Peter Morgan
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
Machine Learning Valencia
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Data Con LA
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
Ra'Fat Al-Msie'deen
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 

Viewers also liked (20)

L1. State of the Art in Machine Learning
L1. State of the Art in Machine LearningL1. State of the Art in Machine Learning
L1. State of the Art in Machine Learning
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Gregg Kalpan Resume
Gregg Kalpan ResumeGregg Kalpan Resume
Gregg Kalpan Resume
 
A field guide the machine learning zoo
A field guide the machine learning zoo A field guide the machine learning zoo
A field guide the machine learning zoo
 
Introduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. SpearsIntroduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. Spears
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
Machine Learning - Where to Next?, May 2015
Machine Learning  - Where to Next?, May 2015Machine Learning  - Where to Next?, May 2015
Machine Learning - Where to Next?, May 2015
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 

Similar to L5. Data Transformation and Feature Engineering

NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
Yutaka Kawai
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0
Joe Xing
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
Maarten Smeets
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
ssuserf583ac
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
RohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
SreeVani74
 
CPP19 - Revision
CPP19 - RevisionCPP19 - Revision
CPP19 - Revision
Michael Heron
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
Smaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded ThingsSmaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded Things
NUS-ISS
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
John Popoola
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Cloudera, Inc.
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
MapR Technologies
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 

Similar to L5. Data Transformation and Feature Engineering (20)

NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
CPP19 - Revision
CPP19 - RevisionCPP19 - Revision
CPP19 - Revision
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
Smaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded ThingsSmaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded Things
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 

More from Machine Learning Valencia

From Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de MántarasFrom Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de Mántaras
Machine Learning Valencia
 
Artificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom DietterichArtificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom Dietterich
Machine Learning Valencia
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
Machine Learning Valencia
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
Machine Learning Valencia
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
Machine Learning Valencia
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
Machine Learning Valencia
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
Machine Learning Valencia
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIs
Machine Learning Valencia
 
L6. Unbalanced Datasets
L6. Unbalanced DatasetsL6. Unbalanced Datasets
L6. Unbalanced Datasets
Machine Learning Valencia
 

More from Machine Learning Valencia (9)

From Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de MántarasFrom Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de Mántaras
 
Artificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom DietterichArtificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom Dietterich
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIs
 
L6. Unbalanced Datasets
L6. Unbalanced DatasetsL6. Unbalanced Datasets
L6. Unbalanced Datasets
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 

L5. Data Transformation and Feature Engineering

  • 1. Data Transformation and Feature Engineering Charles Parker Allston Trading
  • 2. 2 • Oregon State University (Structured output spaces) • Music recognition • Real-time strategy game-playing • Kodak Research Labs • Media classification (audio, video) • Document Classification • Performance Evaluation • BigML • Allston Trading (applying machine learning to market data) Full Disclosure
  • 3. 3 • But it’s “machine learning”! • Your data sucks (or at least I hope it does) . . . • Data is broken • Data is incomplete • . . . but you know about it! • Make the problem easier • Make the answer more obvious • Don’t waste time modeling the obvious • Until you find the right algorithm for it Data Transformation
  • 4. Your Data Sucks I: Broken Features • Suppose you have a market data feature called trade imbalance = (buy - sell) / total volume that you calculate every five minutes • Now suppose there are no trades over five minutes • What to do? • Point or feature removal • Easy default 4
  • 5. Your Data Sucks II: Missing Values • Suppose you’re building a model to predict the presence or absence of cancer • Each feature is a medical test • Some are simple (height, weight, temperature) • Some are complex (blood counts, CAT scan) • Some patients have had all of these done, some have not. • Does the presence or absence of a CAT scan tell you something? Should it be a feature? 5 Height Weight Blood Test Cancer? 179 80 No 160 60 2,4 No 150 65 4,5 Yes 155 70 No
  • 6. Simplifying Your Problem • What about the class variable? • It’s just another feature, so it can be engineered • Change the problem • Do you need so many classes? • Do you need to do a regression? 6
  • 7. Feature Engineering: What? • Your data may be too “raw” for learning • Multimedia Data • Raw text data • Something must be done to make the data “learnable” • Compute edge histograms, SIFT features • Do word counts, latent topic modeling 7
  • 8. An Instructive Example • Build a model to determine if two geo-coordinates are walking distance from one another 8 Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk? 48.871507 2.354350 48.872111 2.354933 Yes 48.872111 2.354933 44.597422 -123.248367 No 48.872232 2.354211 48.872111 2.354933 Yes 44.597422 -123.248367 48.872232 2.354211 No
  • 9. • Whether two points are walking distance from each other is not an obvious function of the latitude and longitude • But it is an obvious function of the distance between the two points • Unfortunately, that function is quite complicated • Fortunately, you know it already! 9
  • 10. An Instructive Example • Build a model to determine if two geo-coordinates are walking distance from one another 10 Lat. 1 Long 1. Lat. 2 Long. 2 Distance (km) Can Walk? 48.871507 2.354350 48.872111 2.354933 2 Yes 48.872111 2.354933 44.597422 -123.248367 9059 No 48.872232 2.354211 48.872111 2.354933 5 Yes 44.597422 -123.248367 48.872232 2.354211 9056 No
  • 11. Feature Engineering • One of the core (maybe the core) competencies of a machine learning engineer • Requires domain understanding • Requires algorithm understanding • If you do it really well, you eliminate the need for machine learning entirely • Gives you another path to success; you can often substitute domain knowledge for modeling expertise • But what if you don’t have specific domain knowledge? 11
  • 12. Techniques I: Discretization • Construct meaningful bins for a continuous feature (two or more) • Body temperature • Credit score • The new features are categorical features, each category of which has nice semantics • Don’t make the algorithm waste effort modeling things that you already know about 12
  • 13. Techniques II: Delta • Sometimes, the difference between two features is the important bit • As it was in the distance example • Also holds a lot in the time domain • Example: Hiss in speech recognition • Struggling? Just differentiate! (In all seriousness, this sometimes works) 13
  • 14. Techniques III: Windowing • If points are distributed in time, previous points in the same window are often very informative • Weather • Stock prices • Add this to a 1-d sequence of points to get an instant machine learning problem! • Sensor data • User behavior • Maybe add some delta features? 14
  • 15. Techniques IV: Standardization • Constrain each feature to have a mean of zero and standard deviation of one (subtract the mean and divide by the standard deviation). • Good for domains with heterogeneous but gaussian-distributed data sources • Demographic data • Medical testing • Note that this isn’t in general effective for decision trees! • Transformation is order preserving • Decision tree splits rely only on ordering! • Good for things like k-NN 15
  • 16. Techniques V: Normalization • Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0] and [1, 1, 1] -> [0.57, 0.57, 0.57]) • Nice for sparse feature spaces like text • Helps us tell the difference between documents and dictionaries • We’ll come back to the idea of sparsity • Note that this will effect decision trees • Does not necessarily preserve order (co-dependency between features) • A lesson against over-generalization of technique! 16
  • 17. What Do We Really Want? • This is nice, but what ever happened to “machine learning”? • Construct a feature space in which “learning is easy”, whatever that means • The space must preserve “important aspects of the data”, whatever that means • Are there general ways of posing this problem? (Spoiler Alert: Yes) 17
  • 18. Aside I: Projection • A projection is a one-to- one mapping from one feature space to another • We want a function f(x) that projects a point x into a space where a good classifier is obvious • The axes (features) in your new space are called your new basis 18 f(x) f(x)
  • 19. A Hack Projection: Distance to Cluster • Do clustering on your data • For each point, compute the distance to each cluster centroid • These distances are your new features • The new space can be either higher or lower dimensional than your new space • For highly clustered data, this can be a fairly powerful feature space 19
  • 20. Principle Components Analysis • Find the axis through the data with the highest variance • Repeat for the next orthogonal axis and so on, until you run out of data or dimensions • Each axis is a feature 20
  • 21. PCA is Nice! • Generally quite fast (matrix decomposition) • Features are linear combinations of originals (which means you can project test data into the space) • Features are linearly independent (great for some algorithms) • Data can often be “explained” with just the first few components (so this can be “dimensionality reduction”) 21
  • 22. Spectral Embeddings • Two of the seminal ones are Isomap and LLE • Generally, compute the nearest neighbor matrix and use this to create the embedding • Pro: Pretty spectacular results • Con: No projection matrix 22
  • 23. Combination Methods • Large Margin Nearest Neighbor, Xing’s Method • Create an objective function that preserves neighbor relationships • Neighbor distances (unsupervised) • Closest points of the same class (supervised) • Clever search for a projection matrix that satisfies this objective (usually an elaborate sort of gradient descent) • I’ve had some success with these 23
  • 24. Aside II: Sparsity • Machine learning is essentially compression, and constantly plays at the edges of this idea • Minimum description length • Bayesian information criteria • L1 and L2 regularization • Sparse representations are easily compressed • So does that mean they’re more powerful? 24
  • 25. Sparsity I: Text Data • Text data is inherently sparse • The fact that we choose a small number of words to use gives a document its semantics • Text features are incredibly powerful in the grand scheme of feature spaces • One or two words allow us to do accurate classification • But those one or two words must be sparse 25
  • 26. Sparsity II: EigenFaces • Here are the first few components of PCA applied to a collection of face images • A small number of these explain a huge part of a huge number of faces • First components are like stop words, last few (sparse) components make recognition easy 26
  • 27. Sparsity III: The Fourier Transform • Very complex waveform • Turns out to be easily expressible as a combination of a few (i.e., sparse) constant frequency signals • Such representations make accurate speech recognition possible 27
  • 28. Sparse Coding • Iterate • Choose a basis • Evaluate that basis based on how well you can use it to reconstruct the input, and how sparse it is • Take some sort of gradient step to improve that evaluation • Andrew Ng’s efficient sparse coding algorithms and Hinton’s deep autoencoders are both flavors of this 28
  • 29. The New Basis • Text: Topics • Audio: Frequency Transform • Visual: Pen Strokes 29
  • 30. Another Hack: Totally Random Trees • Train a bunch of decision trees • With no objective! • Each leaf is a feature • Ta-da! Sparse basis • This actually works 30
  • 31. And More and More • There are a ton a variations on these themes • Dimensionality Reduction • Metric Learning • “Coding” or “Encoding” • Nice canonical implementations can be found at: http://lvdmaaten.github.io/drtoolbox/ 31