DutchMLSchool. Automating Decision Making

BigML, Inc #DutchMLSchool 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc

BigML, Inc #DutchMLSchool
Gaming the ML Performance
3
• Use ML to improve performance automatically
• OptiML
• Unsupervised Feature Engineering (PCA, Topic Models,
Clustering, Anomaly Detection, etc)
• Automated feature selection
• Use domain knowledge to improve performance manually
• Bespoke features (requires expertise)
• Fusions of models
• Manual feature selection
A Tale of Two Strategies…

what is Feature Engineering
4
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
• This is really, really important - more than algorithm selection!
• In fact, so important that BigML often does it
automatically
• ML Algorithms have no deeper understanding of data
• Numerical: have a natural order, can be scaled, etc
• Categorical: have discrete values, etc.
• The "magic" is the ability to ﬁnd patterns quickly and efﬁciently
• ML Algorithms only know what you tell/show it with data
• Medical: Kg and M, but BMI = Kg/M2 is better
• Lending: Debt and Income, but DTI is better
• Intuition can be risky: remember to prove it with an evaluation!

Built-in Transformations
5
2013-09-25 10:02
Date-Time Fields
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
• Date-Time ﬁelds have a lot of information "packed" into them
• Splitting out the time components allows ML algorithms to
discover time-based patterns.
DATE-TIME

6
Categorical Fields for Clustering/LR
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
• Clustering and Logistic Regression require numeric ﬁelds for
inputs
• Categorical values are transformed to numeric vectors
automatically*
• *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be conﬁgured.

7
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Text Fields
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
• Unstructured text contains a lot of potentially interesting
patterns
• Bag-of-words analysis happens automatically and extracts
the "interesting" tokens in the text
• Another option is Topic Modeling to extract thematic meaning

Help ML to Work Better
8
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
TEXT
title body
Breaking News… news covering…
… …
TEXT TEXT
When text is not actually unstructured
• In this case, the text ﬁeld has structure (key/value pairs)
• Extracting the structure as new features may allow the ML
algorithm to work better

FE Demo #1
9

Help ML to Work at all
10
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
… … …
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)

FE Demo #2
11

Feature Engineering
12
Discretization
Total Spend
7.342,99
304,12
4,56
345,87
8.546,32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”

FE Demo #3
13

Built-ins for FE
14
• Discretize: Converts a numeric value to categorical
• Replace missing values: fixed/max/mean/median/etc
• Normalize: Adjust a numeric value to a specific range of
values while preserving the distribution
• Math: Exponentiation, Logarithms, Squares, Roots, etc
• Types: Force a field value to categorical, integer, or real
• Random: Create random values for introducing noise
• Statistics: Mean, Population
• Refresh Fields:
• Types: recomputes field types. Ex: #classes > 1000
• Preferred: recomputes preferred status

Flatline Add Fields
15
Computing with Existing Features
Debt Income
10.134 100.000
85.234 134.000
8.112 21.500
0 45.900
17.534 52.000
NUM NUM
(/ (ﬁeld "Debt") (ﬁeld "Income"))
Debt
Income
Debt to Income Ratio
0,10
0,64
0,38
0
0,34
NUM

FE Demo #4
16

What is Flatline?
17
• DSL:
• Invented by BigML - Programmatic / Optimized for
speed
• Transforms datasets into new datasets
• Adding new fields / Filtering
• Transformations are written in lisp-style syntax
• Feature Engineering
• Computing new fields: (/ (field "Debt") (field
“Income”))
• Programmatic Filtering:
• Filtering datasets according to functions that evaluate
to true/false using the row of data as an input.
Flatline: a domain specific language for feature
engineering and programmatic filtering

Flatline
18
• Lisp style syntax: Operators come first
• Correct: (+ 1 2) => NOT Correct: (1 + 2)
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Limited programming language structures
• let, cond, if, map, list operators, */+-, etc.
• Built-in transformations
• statistics, strings, timestamps, windows

Flatline s-expressions
19
(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three payments
in a row

FE Demo #5
20

21
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
5 28000 320
6 31254 319
7 56544 323
8 44331 324
9 81111 287
10 65422 294
11 59999 300
12 45556 302
13 19899 301
Current - (4-day avg)
std dev
Shock: Deviations from a Trend
day-4 day-3 day-2 day-1 4davg
-
314 -
314 315 -
314 315 315 -
314 315 315 321 316,25
315 315 321 320 317,75
315 321 320 319 318,75

22
Current - (4-day avg)
std dev
Shock: Deviations from a Trend
Current : (ﬁeld “price”)
4-day avg: (avg-window “price” -4 -1)
std dev: (standard-deviation “price”)
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

FE Demo #6
23

Advanced s-expressions
24
( = (mod (ﬁeld "Highway Number")
2) 0)
Highway isEven?

25
( /
( mod
( -
( /
( epoch ( ﬁeld "date-ﬁeld" ))
1000
)
621300
)
2551443
)
2551442
)
Moon Phase%
https://gist.github.com/petersen-poul/0cf5022ed1768837fe13af72b2488329

Home Price Feature
26
Worth More
Worth Less

Home Price Feature
27
LATITUDE LONGITUDE REFERENCE
LATITUDE
REFERENCE
LONGITUDE
44,583 -123,296775 44,5638 -123,2794
44,604414 -123,296129 44,5638 -123,2794
44,600108 -123,29707 44,5638 -123,2794
44,603077 -123,295004 44,5638 -123,2794
44,589587 -123,301154 44,5638 -123,2794
Distance (m)
700
30,4
19,38
37,8
23,39

Haversine Formula
28
https://en.wikipedia.org/wiki/Haversine_formula

29
( let
( R 6371000
latA (to-radians {lat-ref})
latB (to-radians ( ﬁeld "LATITUDE" ) )
latD ( - latB latA )
longD ( to-radians ( - ( ﬁeld "LONGITUDE" ) {long-ref} )
)
a ( +
( square ( sin ( / latD 2 ) ) )
( *
(cos latA)
(cos latB)
(square ( sin ( / longD 2)))
)
)
c ( * 2 ( asin ( min (list 1 (sqrt a)))))
)
( * R c )
)
Distance Lat/Long <=> Ref (Haversine)

WhizzML + Flatline
30
HAVERSINE
FLATLINE
OUTPUT
DATASET
INPUT
DATASET
LONG Ref
LAT Ref
WHIZZML SCRIPT
https://bigml.com/gallery/scripts

31
JSON Parser???
• Remember, Flatline is not a full programming language
• No loops
• No accumulated values
• Code executes on one row at a time and has a limited
view into other rows
https://gist.github.com/petersen-poul/504c62ceaace76227cc6d8e0c5f1704b

Feature Engineering
32
Fix Missing Values in a “Meaningful” Way
F i l t e r
Zeros
Model  
insulin
Predict  
insulin
Select  
insulin
Fixed 
Dataset
Amended 
Dataset
Original 
Dataset
Clean 
Dataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))

FE Demo #7
33

Feature Selection
34

Feature Selection
35
• Model Summary
• Field Importance
• Algorithmic
• Best-First Feature Selection
• Boruta
• Leakage
• Tight Correlations (AD, Plot, Correlations)
• Test Data
• Perfect future knowledge
Care must be taken when creating features!

Feature Selection
36
Leakage
• sales pipeline where step n-1 has no other
outcome then step n.
• stock close predicts stock open
• churn retention: the worst rep is actually the best
(correlation != causation)
• cancer prediction where one input is a doctor
ordered test for the condition
• account ID predicts fraud (because only new
accounts are fraudsters)

Summary
37
• Feature Engineering: what is it / why it is important
• Automatic transformations: date-time, text, etc
• Built-in functions: filtering and feature engineering
• Discretization / Normalization / etc.
• Flatline: programmatic feature engineering / filtering
• Structure
• Examples: Adding fields / filtering
• When building features it is important to watch for leakage

BigML, Inc #DutchMLSchool 38
OptiML and Fusions
Automating Machine Learning
Poul Petersen
CIO, BigML, Inc

Title
39
Decreasing Interpretability / Better Representation / Longer Training
IncreasingDataSize/Complexity
Early Stage

Rapid Prototyping
Mid Stage

Proven Application
Late Stage

Critical Performance
DeepnetsSingle Tree Model
Logistic Regression Boosted Trees
Random

Decision Forest
Decision Forest
TO
O
H
AR
D

BigML Deepnets
40
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes signiﬁcant expert knowledge
• Solution:
• Metalearning (a good initial guess)
• Network search (try a bunch)
Remember this?

OptiML
41
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to ﬁnd ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but ﬁnds the optimum
machine learning algorithm and parameters for your data
automatically
Key Insight: We can solve any parameter selection
problem in a similar way.

The Challenge…
42
• We will start with a dataset from StumbleUpon
• Train/Test split with seed “bigml”
• Build and Evaluate:
• 1-click Model, LR, Ensemble, Deepnet
• Top model from OptiML output
• Compare the results using the phi coefﬁcient
• Explore other ideas for improving performance further

OptiML Demo
43

Results…
44
All scores are phi, evaluated against a holdout
• 1-Click Decision Tree: 0.36
• 1-Click LR: 0.47
• 1-Click Ensemble: 0.58
• Best OptiML Model (LR): 0.66
• 1-Click Deepnet: 0.67
•
What else can we try?

Fusions Inside
45
• Fuse any set of models into a new “fusion”

• Must have the same objective type

• Inputs and feature space can differ

• Weights can be added

• Give more importance to individual models

• Fusions can be fused as well

• Especially useful for fusing OptiML models
Key Insight: ML algorithms each have unique
strengths and weaknesses

Performance thru Diversity
46
Dataset
Optimized

Deepnet
Optimized

Ensemble
Optimized

Logistic Regression
Better?

Fusion Demo #1
47

Results…
48
• 1-Click LR: 0.47
• Fusion of top Model Types: 0.68

Fusions: Under the Hood
49
P(TRUE) = [56+(100-67)+2*78] / 4
Model Prediction Probability Weight
Ensemble TRUE %56 1 Fus ion
Deepnet FALSE %67 1 TRUE %61
Model TRUE %78 2
Classification
Model Prediction Error Weight
Ensemble 156,78 12,56 1 Fus ion
Deepnet 139,55 9,88 1 160,13 17,49
Model 172,10 23,76 2
Regression

Fusions: Like any BigML Model
50
• Fully accessible thru API and WhizzML

• Bindings have support for local predictions

Decision Boundary Smoothness
51
Single Tree:
• Outcome changes abruptly near decision
boundary

• And not at all parallel to the boundary

• This can be “surprising”
Single Tree + Deepnet:
• Keep the interpretability of the tree

• But with a more nuanced decision boundary

Feature Stability
52
Feature Importance: Different subsets of features may have similar modeling
performance
Fusing models gives better resilience against missing values as well as
ensuring that all relevant features are utilized.

Weighting over Time
53
1 Day
Data significance over time:
• Some data may change significance in different times

• Short-term user behavior versus long-term

• Weights can set to account for significance of time
1 Week
1 Month
w=8
w=4
w=2

Improved Class Separation
54
Consider a 3-class objective
• Really only care about “yes” versus “not yes”

• A single model may struggle to separate the two negative classes
Yes No Maybe
yes/no/maybe
yes/no
yes/maybe

Feature Space Optimization
55
Model Skills: Some ML algorithms “generally” do better
on some feature types:
• RDF for sparse text vectors

• LR/Deepnets for numeric features

• Trees for categorical features
Full
Numeric
Text

Fusions Demo #2
56

Results…
57
• 1-Click LR: 0.47
• Fusion of top Model Types: 0.68
• Custom Feature Fusion: 0.70

PCA
Principal Component Analysis
Poul Petersen
CIO, BigML
58

Issues with High Dimensionality
59
• Implicitly increases model complexity, prone to overﬁtting
• Requires more observations in order to generalize well
• Contains correlated or useless variables
• Data is diﬃcult to visualize
• Takes a longer time to train models or make predictions
Principal Component Analysis
addresses all of these issues

Other Approaches
60
MODEL Pruning, Node threshold
ENSEMBLE Bagging, Randomization
LOGISTIC
REGRESSION
L1 and L2 penalties
DEEPNET Dropout

Dimensionality Reduction
61
Feature Selection
• Preserves the original variables and selects a subset
• Often uses recursive methods or statistical thresholds
• Examples: RFE, Chi-Squared Test, Boruta
Feature Extraction
• Transforms original variables into variables better suited for modeling
• Examples: word vectors, clustering
• PCA falls into this category
Manual Approach

When to use PCA
62
1. You want to reduce the number of variables in your model, but
it is not clear which should be eliminated
2. You want to generate variables that are not correlated
3. You are okay with sacriﬁcing some amount of interpretability
for potential downstream performance gains

How Does PCA Work?
63
Each PC is a linear combination of original variables
PC1 = w1F1 + w2F2 + w3F3 + … + wNFN
PC2 = w1F1 + w2F2 + w3F3 + … + wNFN
PCN = w1F1 + w2F2 + w3F3 + … + wNFN
…

PCA Output
64
These principal components are not correlated

PCA Workﬂow
65
SOURCE DATASET
TRAIN
TEST

PCA Workﬂow
66
PCA
SOURCE DATASET
TRAIN
TEST

PCA Workﬂow
67
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA

PCA Workﬂow
68
NEW TRAIN
FEATURES
NEW TEST
FEATURES
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA

PCA Demo
69

BigML PCA
70
• Standard PCA only applies to numerical data
• BigML uses three diﬀerent data transformation methods in order to
handle diﬀerent data types
• Numeric data: Principal Component Analysis (PCA)
• Categorical data: Multiple Correspondence Analysis (MCA)
• Mixed data: Factorial Analysis of Mixed Data (FAMD)
• BigML will automatically handle numeric, text, items, and categorical
data without needing user input

Co-organized by: Sponsor:
Business Partners:

DutchMLSchool. Automating Decision Making

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DutchMLSchool. Automating Decision Making

Similar to DutchMLSchool. Automating Decision Making (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

DutchMLSchool. Automating Decision Making