3. BigML, Inc 3ML Crash Course - UI/Algorithms/Feature Engineering
The need for Machine Learning
• Can you find any pattern in this tiny data set?
Talk Text Purchases Data Age Churn?
148 72 0 33.6 50 TRUE
85 66 0 26.6 31 FALSE
183 64 0 23.3 32 TRUE
89 66 94 28.1 21 FALSE
115 0 0 35.3 29 FALSE
166 72 175 25.8 51 TRUE
100 0 0 30 32 TRUE
118 84 230 45.8 31 TRUE
171 110 240 45.4 54 TRUE
159 64 0 27.4 40 FALSE
…. but this is a simple example
4. BigML, Inc 4ML Crash Course - UI/Algorithms/Feature Engineering
Data Types
numeric
1 2 3
1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical
A B C
DATE-TIME2013-09-25 10:02
DATE-TIME
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
text / items
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
“some”
appears 2 times
appears 1 time
appears 1 time
appears 2 times
5. BigML, Inc 5ML Crash Course - UI/Algorithms/Feature Engineering
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
Bag of Words
6. BigML, Inc 6ML Crash Course - UI/Algorithms/Feature Engineering
Text Analysis
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
Model
The token “great”
occurs more than 3 times
The token “afraid”
occurs no more than once
7. BigML, Inc 7ML Crash Course - UI/Algorithms/Feature Engineering
DATASET
Evaluation
TRAIN SET
TEST SET
PREDICTIONS
METRICS
8. BigML, Inc 8ML Crash Course - UI/Algorithms/Feature Engineering
Ensembles
Diameter Color Shape Fruit
4 red round plum
5 red round apple
5 red round apple
6 red round plum
7 red round apple
Bagging!
Random Decision Forest!
All Data: “plum”
Sample 2: “apple”
Sample 3: “apple”
Sample 1: “plum”
}“apple”
What is a round, red 6cm fruit?
21. BigML, Inc 21ML Crash Course - UI/Algorithms/Feature Engineering
Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
22. BigML, Inc 22ML Crash Course - UI/Algorithms/Feature Engineering
Model Competence
MODEL
ANOMALY
DETECTOR
Prediction T T
Confidence
86% 84%
Anomaly
Score
0.5367 0.7124
Competent? Y N
At Training Time At Prediction Time
DATASET
23. BigML, Inc 23ML Crash Course - UI/Algorithms/Feature Engineering
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
{class = gas} amount < 100
{customer = Bob, account = 3421} zip = 46140
Rules:
Antecedent Consequent
24. BigML, Inc 24ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
Instances
A
C
Coverage
Percentage of instances
which match antecedent “A”
25. BigML, Inc 25ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
Instances
A
C
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
26. BigML, Inc 26ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
Coverage
Support
Instances
A
C
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
27. BigML, Inc 27ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
28. BigML, Inc 28ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
Independent
A
C
C
Observed
A
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
29. BigML, Inc 29ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
30. BigML, Inc 30ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
Independent
A
C
C
Observed
A
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
31. BigML, Inc 31ML Crash Course - UI/Algorithms/Feature Engineering
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
32. BigML, Inc 32ML Crash Course - UI/Algorithms/Feature Engineering
Machine Learning Secret
“…the largest improvements in accuracy often came from
quick experiments, feature engineering, and model tuning
rather than applying fundamentally different algorithms.”
Facebook FBLearner 2016
Feature Engineering: applying domain knowledge of
the data to create features that make machine
learning algorithms work better or at all.
33. BigML, Inc 33ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
2013-09-25 10:02
DATE-TIME
Automatic Date Transformation
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
34. BigML, Inc 34ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
Automatic Categorical Transformation
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
35. BigML, Inc 35ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Automatic Text Transformation
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
36. BigML, Inc 36ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
TEXT
Better representation
title body
Breaking News… news covering…
… …
TEXT TEXT
37. BigML, Inc 37ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
Discretization
Total Spend
7,342.99
304.12
4.56
345.87
8,546.32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Middle 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”
38. BigML, Inc 38ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
Combinations of Multiple Features
Kg M2
101.4 3.24
85.2 2.8
56.2 2.9
136.1 3.6
95.9 4.1
NUM NUM
BMI
31.17
30.4
19.38
37.8
23.39
NUM
Kg
M2
39. BigML, Inc 39ML Crash Course - UI/Algorithms/Feature Engineering
Feature Engineering
Flatline
• BigML’s Domain-Specific Language (DSL) for
Transforming Datasets
• Limited programming language structures
• let, cond, if, maps, list operators, */+-
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Built-in transformations
• statistics, strings, timestamps, windows