VSSML18. Introduction to Machine Learning and the BigML Platform

Valencian Summer School in Machine Learning
4th edition
September 13-14, 2018

BigML, Inc 2
Intro to BigML
Making Machine Learning Beautifully Simple
<AUTHOR>
<TITLE>, BigML, Inc
Poul Petersen
CIO, BigML, Inc

BigML, Inc 3Introduction to ML and BigML Platform
Sampling the Audience
Expert: Published papers at KDD, ICML, NIPS, etc or
developed own ML algorithms used at large scale
Aﬁcionado: Understands pros/cons of different
techniques and/or can tweak algorithms as needed
Practitioner: Very familiar with ML packages (Weka,
Scikit, BigML, etc.)
Newbie: Just taking Coursera ML class or reading an
introductory book to ML
Absolute beginner: ML sounds like science ﬁction

What is Machine Learning?
Finding patterns in data that can be used to
make inference
predictive models

BigMLCOMPLEXITYOFTASKS
TIME20th century 21st century
-
+
A NEW PROGRAMMING PARADIGM

Traditional Programming
LOST BAGGAGE POLICY

Programming with ML
PREDICT FLIGHT DELAYS

Programming with ML

Modeling Churn

What just happened?
Churn
Data
How many
support calls?
Model Prediction:
Churn=yes

Some Terminology…
Churn
Data
Model Prediction:
Churn=yes
Training
Data
• Modeling
• Clustering
• Anomaly Detection
• Association Discovery
ML
Resource
ML
Platform
“Consume” the model
or
“put into production”
• Dashboard
• Custom Application
• Wearable / Edge device
• Batch Process

A Brief History of BigML
• BigML Mission: To make Machine
Learning Beautifully Simple
• BigML Founded in Corvallis,
Oregon in 2011 - long before ML
was "cool"
• You’ve never heard of it?
• Most innovative city in the United
States!

A Brief History of BigML

BigML Platform
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE

BigML Platform
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE
On-Premises

Machine Learning Tasks
Question ML Resource
Based on previous loan outcomes,
Will this customer default on a loan?
Models / Ensembles 
Deepnets / LR
How many customers will apply for a
loan next month?
Models, Ensembles,
Deepnets
Based on trends, how much of this
product will I sell next month?
Time Series
Is the consumption of this product
unusual?
Anomaly Detection
Is the behavior of these customers
similar?
Clusters
Are these products purchased
together?
Association
Discovery
What are the thematic contents of
these documents?
Topic Models

Lending Club Loan Lifecycle
“Closed”“Open”
Fully Paid
Late
16-30
Days
Late
31-120
Days
Charged
Oﬀ
Default
Current
In Grace
Period
( if ( = ( field "loan_status" ) "Fully Paid" ) "good", "bad" )
s3://bigml-public/csv/lc_sample.csv.gz

Modeling Default

What just happened?
• We started with loan data from Lending Club pulled from S3
• At the Source step, we fixed some datatypes
• At the Dataset step, we used Flatline to filter and create the
loan "quality" feature
• When configuring the Model, we set some advanced settings:
• removed correlated features: int_rate
• enabled objective balancing (wait… why?)

Weighting
Instance Rate Payment Status Predict Confidence
1 23 % 134 Paid Paid 20 %
2 23 % 134 Paid Paid 25 %
3 23 % 134 Paid Paid 30 %
... ... ... ... ...
1000 23 % 134 Paid Paid 99,5 %
1001 23 % 134 Default Paid 99,4 %
Problem: Default is “more important”

but occurs less often than Paid

What just happened?
• We started with loan data from Lending Club pulled from S3
• At the Source step, we fixed some datatypes
• At the Dataset step, we used Flatline to filter and create the
loan "quality" feature
• When configuring the Model, we set some advanced settings:
• removed correlated features: int_rate
• enabled objective balancing
• We explored the Model to see what factors predict default.
• We deployed the Model into a voice interface to make
Predictions.
Question: Can we trust this model?

Evaluations
DATASET
TRAIN SET
TEST SET
PREDICTIONS
METRICS
?
?
?
?
?
?

Evaluation Metrics
• Imagine we have a model that can predict a person’s dominant
hand, that is for any individual it predicts left / right
• Deﬁne the positive class
• This selection is arbitrary
• It is the class you are interested in!
• The negative class is the “other” class (or others)
• For this example, we choose : left

Evaluation Metrics
• We choose the positive class: left
• True Positive (TP)
• We predicted left and the correct answer was left
• True Negative (TN)
• We predicted right and the correct answer was right
• False Positive (FP)
• Predicted left but the correct answer was right
• False Negative (FN)
• Predict right but the correct answer was left

Evaluation Metrics
True Positive: Correctly predicted the positive class
True Negative: Correctly predicted the negative class
False Positive: Incorrectly predicted the positive class
False Negative: Incorrectly predicted the negative class

Accuracy
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Ex: 90% of people are right-handed and 10% are left
• A silly model which always predicts right handed is
90% accurate

Accuracy
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 0
FP = 0
TN = 7
FN = 3
= Left
= RightPositive

Class
Negative

Class TP + TN
Total
= 70%

Precision
TP
TP + FP
• “accuracy” or “purity” of positive class
• How well you did separating the positive class from the
negative class
• If Precision = 1 then no FP.
• You may have missed some left handers, but of the
ones you identiﬁed, all are left handed. No mistakes.
• If Precision = 0 then no TP
• None of the left handers you identiﬁed are actually left
handed. All mistakes.

Precision
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FP
= 50%

Recall
TP
TP + FN
• percentage of positive class correctly identified
• A measure of how well you identified all of the positive
class examples
• If Recall = 1 then no FN → All left handers identified
• There may be FP, so precision could be <1
• If Recall = 0 then no TP → No left handers identified

Recall
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FN
= 66%

f-Measure
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• If f-measure = 1 then Recall == Precision == 1
• If Precision OR Recall is small then the f-measure is small

f-Measure
Classiﬁed as
Fraud
Classiﬁed as
Not Fraud
R = 66%
P = 50%
f = 57%
Positive

Class
Negative

Class
= Left
= Right

Phi Coefﬁcient
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• If -1 then predictions are opposite reality
• =0 no correlation between predictions and reality
• =1 then predictions are always correct

Phi Coefficient
Classified as
Fraud
Classified as
Not Fraud
TP = 2
FP = 2
TN = 5
FN = 1
Phi = 0.356
Positive

Class
Negative

Class
= Left
= Right

Evaluations

What just happened?
• We split the Lending Club data into training and test Datasets
• We created a Model and Evaluation
• Looking at the Accuracy, we saw that the Model was
performing well but because of unbalanced classes
• The resulting Model did well at predicting good loans
• But bad loans are "more important"
• We tried different weights to increase the Recall of bad loans:
• objective balancing: equal consideration
• class weights: bad = 1000, good = 1
• Finally, we explored the impact of changing the probability
threshold

Evaluation
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
• Don’t forget that accuracy can be mis-leading!
• Mostly useless with unbalanced classes (left/right?)
• Use weighting, operating points, other tricks…

What else can we try?
• Rather than build a single model…
• Combine the output of several typically
“weaker” models into a powerful ensemble…
• How do we create unique models from the
same training dataset?
Ensembles!

Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER

Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER

Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
ERROR 
PREDICTION 1
ERROR
PREDICTION 2
ERROR
PREDICTION 3
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…

Ensembles

Logistic Regression

Logistic Regression
????

Logistic Regression
𝑥➝-∞
𝑙(𝑥)➝0
• Fits logistic function to probability of  
output class
• Result is a set of coefﬁcients,  
one for each feature * class
𝑥➝∞
𝑙(𝑥)➝1
Goal
1
1 ＋ 𝒆− 𝑥
𝑙(𝑥) ＝
Logistic Function
𝑃≈0 𝑃≈10<𝑃<1
𝑓(𝑿)＝𝛽0＋𝞫·𝑿＝𝛽0＋𝛽1 𝑥1＋⋯＋𝛽𝑖 𝑥𝑖
𝑃(𝑿)＝
1
1＋𝑒−𝑓(𝑿)

Logistic Regression

Logistic Level Up
Outputs
Inputs

Logistic Level Up
wi
Class “a”, logistic(w, b)

Deepnets
Outputs
Inputs
Hidden layer

Deepnets
n
hidden nodes?

Deepnets
n
hidden

layers?

BigML Deepnets
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes signiﬁcant expert knowledge
• Solution: Metalearning (a good initial guess)
• Solution: Network search (try a bunch)

Deepnets

OptiML
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to ﬁnd ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but ﬁnds the optimum
machine learning algorithm and parameters for your data
automatically

Fusions
• Similar to an Ensemble, but we can mix different model types
• Logistic Regression, plus a Deepnet for example
• You can also create a fusion with different training sets!
• Last week, plus last month data, etc
• Or a Fusion of OptiML models
• Combines the “best of the best”

OptiML & Fusions

What just happened?
• We applied several different classiﬁcation methods to the
Lending Club training data
• Decision Forest
• Random Decision Forest: Examined effect of threshold
• Boosted Trees
• Logistic Regression
• Deepnets
• OptiML: Optimized for Recall of bad
• Fusion: Created from top OptiML models
• Then we created an evaluation of each one of the methods
using the test dataset
• We compared the evaluations using a ROC curve

Supervised Learning
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classiﬁcation
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label
We need different Evaluation Metrics…

Regression - Fitting a Line
Data Points
Model

Mean Absolute Error
e1
e2
e7
e6
e5
e4
e3
MAE = |e1|+|e2|+ … +|en|
n

Mean Squared Error
e1
e2
e7
e6
e5
e4
e3
MSE = (e1)2
+(e2)2
+ … +(en)2
n

MSE versus MAE
• For both MAE & MSE: Smaller is better, but values are
unbounded
• MSE is always larger than or equal to MAE

R-Squared Error
Data Points
Model
Mean

R-Squared Error
Mean
v1
v2
v3 v4 v5
v7
v6

R-Squared Error
e1
e2
e7
e6
e5
e4
e3
Mean
v1
v2
v3 v4 v5
v7
v6
MSEmodel
MSEmean
RSE = 1 -

R-Squared Error
• RSE: measure of how much better the model is than
always predicting the mean
• < 0 model is worse then mean
• MSEmodel > MSEmean
• = 0 model is no better than the mean
• MSEmodel = MSEmean
• ➞ 1 model ﬁts the data “perfectly”
• MSEmodel = 0 (or MSEmean >> MSEmodel)
MSEmodel
MSEmean
RSE = 1 -

Regression Evaluation

What just happened?
• We started with the open loans in the Lending Club dataset
• Performed a train/test split
• Built a model to predict the int_rate without grade/sub-grade
• Evaluated this model

Time Series
Year Pineapple Harvest
1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend

Exponential Smoothing
For training values 𝒙𝒕
Smoothing Factor 0 < α < 1
Predicted values 𝒔𝒕
𝒔𝒕 = α·𝒙𝒕 + ⟮1-α⟯·𝒔𝒕-1
Weight
0
0,05
0,1
0,15
0,2
1 3 5 7 9 11 13
Each new value in the series depends on all previous
values with a decaying weight
Idea:

Time Series

Supervised Learning
Label
Training

Unsupervised Learning
1. Training data provides “examples” - no speciﬁc “outcome”
2. The machine tries to ﬁnd “interesting” patterns in the data
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51

The Sally 6788 sign food 26339 51
Anomaly Detection
unusual

The Sally 6788 sign food 26339 51
Clustering
similar

Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
Association Discovery

Let’s build a recommender
Typical way to shop for a home…

Recommender Idea
?
?
?
?
Preference
Model
Preference
Data
Sample
… then use the Preference Model to
filter all the homes on the market
All Homes
Forsale

Recommender Problem #1
What if there are really unusual homes in the data?
• A mansion with 20 bathrooms
• A home with no bedrooms
• A lot size that is smaller than the home?
We don’t want to show these as suggestions
because they are unusual….
How do we detect anomalies?

Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)

Anomaly Detection

What just happened?
• We wanted to ﬁnd and remove unusual houses.
• We created an Anomaly Detector and examined
the top anomalies.
• We found some unusual houses to remove and
discovered bad data (missing values) that we want
to ﬁx.

A clever way to ﬁx missing data
Let’s use Machine Learning…
BEDS BATHS
SQFT PRICE BEDS BATHS
3.125 $530.000 5 3
2.100 $460.000 2
1.200 $250.000 3
3.950 $610.000 6 4
4
1.5

WhizzML

What just happened?
• We had a Dataset with missing values.
• We wanted to apply an algorithm to ﬁx the missing
values with Machine Learning
• Rather than write the algorithm, we found what we
needed in the WhizzML public gallery.
• Now that we have cloned the Script we can use it
again and again.
• We can write new ones too!

• How can we avoid showing essentially the
same house over and over?
All Homes
?
?
?
Sample
Modern

• How can we avoid showing essentially the
same house over and over?
All Homes
Modern
Lots of
Land
• Great! What if we don’t know how to group
them? Or how many groups?
?
sample
?
sample

Clustering

What just happened?
• Since we don’t know how many groups of homes
there should be, we used G-means Clustering to ﬁnd
the optimum number of groups of homes
• Our recommender will use these groups to create a
better sampling for user preference
• We also tried to understand the home clusters using
“model clusters” but the models were difﬁcult to
interpret

Understanding Clusters Better
If SQFT >= 3,125 THEN “Cluster 1”
What if we could get rules like…
SQFT PRICE BEDS BATHS CLUSTER
3.125 $530.000 5 3 Cluster 1
2.100 $460.000 4 2 Cluster 3
1.200 $250.000 3 1,5 Cluster 5
3.950 $610.000 6 4 Cluster 1

Association Discovery

What just happened?
• We used a Batch Centroid to add the Cluster
assignment of each home as a feature to the Dataset
• We use Association Discovery to ﬁnd “interesting”
relationships between the features including the Cluster
assignment

There is much more interesting information than just the
number of BEDS, BATHS, etc.
• Unfortunately, these "remarks" are not available in the
Redﬁn download
• Adding them to our dataset requires crawling the
website
• Like most ML projects, preparing the data is 80% of
the difﬁculty (fortunately I already did it!)

Topic Modeling

What just happened?
• We extending the home dataset with the syndicated
remarks text ﬁeld
• We built a model to predict sale price and explored how
key words discovered in the remarks impacted price
• We used topic modeling to create a deeper thematic
understanding of the remarks
• Homes that are "in-town" or "out-of-town"
• We extended the dataset with ﬁelds that represent for
each home how related they are to each of these topics
• This will allow our clustering to group homes by a deeper
meaning than just BEDS, BATHS, etc

Recommender Idea
?
?
Modern
Lots of
Land
Small
?
?
?
?
Preference
Model
Preference
Data

House Recommender

VSSML18. Introduction to Machine Learning and the BigML Platform

VSSML18. Introduction to Machine Learning and the BigML Platform

More Related Content

Similar to VSSML18. Introduction to Machine Learning and the BigML Platform

More from BigML, Inc

Recently uploaded

VSSML18. Introduction to Machine Learning and the BigML Platform