Valencian Summer School in Machine Learning
4th edition
September 13-14, 2018
BigML, Inc 2
Intro to BigML
Making Machine Learning Beautifully Simple
<AUTHOR>
<TITLE>, BigML, Inc
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Introduction to ML and BigML Platform
Sampling the Audience
Expert: Published papers at KDD, ICML, NIPS, etc or
developed own ML algorithms used at large scale
Acionado: Understands pros/cons of different
techniques and/or can tweak algorithms as needed
Practitioner: Very familiar with ML packages (Weka,
Scikit, BigML, etc.)
Newbie: Just taking Coursera ML class or reading an
introductory book to ML
Absolute beginner: ML sounds like science ction
BigML, Inc 4Introduction to ML and BigML Platform
What is Machine Learning?
Finding patterns in data that can be used to
make inference
predictive models
BigML, Inc 5Introduction to ML and BigML Platform
BigMLCOMPLEXITYOFTASKS
TIME20th century 21st century
-
+
A NEW PROGRAMMING PARADIGM
BigML, Inc 6Introduction to ML and BigML Platform
Traditional Programming
LOST BAGGAGE POLICY
BigML, Inc 7Introduction to ML and BigML Platform
Programming with ML
PREDICT FLIGHT DELAYS
BigML, Inc 8Introduction to ML and BigML Platform
Programming with ML
BigML, Inc 9Introduction to ML and BigML Platform
Modeling Churn
BigML, Inc 10Introduction to ML and BigML Platform
What just happened?
Churn
Data
How many
support calls?
Model Prediction:
Churn=yes
BigML, Inc 11Introduction to ML and BigML Platform
Some Terminology…
Churn
Data
Model Prediction:
Churn=yes
Training
Data
• Modeling
• Clustering
• Anomaly Detection
• Association Discovery
ML
Resource
ML
Platform
“Consume” the model
or
“put into production”
• Dashboard
• Custom Application
• Wearable / Edge device
• Batch Process
BigML, Inc 12Introduction to ML and BigML Platform
A Brief History of BigML
• BigML Mission: To make Machine
Learning Beautifully Simple
• BigML Founded in Corvallis,
Oregon in 2011 - long before ML
was "cool"
• You’ve never heard of it?
• Most innovative city in the United
States!
BigML, Inc 13Introduction to ML and BigML Platform
A Brief History of BigML
BigML, Inc 14Introduction to ML and BigML Platform
BigML Platform
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE
BigML, Inc 15Introduction to ML and BigML Platform
BigML Platform
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE
On-Premises
BigML, Inc 16Introduction to ML and BigML Platform
Machine Learning Tasks
Question ML Resource
Based on previous loan outcomes,
Will this customer default on a loan?
Models / Ensembles

Deepnets / LR
How many customers will apply for a
loan next month?
Models, Ensembles,
Deepnets
Based on trends, how much of this
product will I sell next month?
Time Series
Is the consumption of this product
unusual?
Anomaly Detection
Is the behavior of these customers
similar?
Clusters
Are these products purchased
together?
Association
Discovery
What are the thematic contents of
these documents?
Topic Models
BigML, Inc 17Introduction to ML and BigML Platform
Lending Club Loan Lifecycle
“Closed”“Open”
Fully Paid
Late
16-30
Days
Late
31-120
Days
Charged
Off
Default
Current
In Grace
Period
( if ( = ( field "loan_status" ) "Fully Paid" ) "good", "bad" )
s3://bigml-public/csv/lc_sample.csv.gz
BigML, Inc 18Introduction to ML and BigML Platform
Modeling Default
BigML, Inc 19Introduction to ML and BigML Platform
What just happened?
• We started with loan data from Lending Club pulled from S3
• At the Source step, we fixed some datatypes
• At the Dataset step, we used Flatline to filter and create the
loan "quality" feature
• When configuring the Model, we set some advanced settings:
• removed correlated features: int_rate
• enabled objective balancing (wait… why?)
BigML, Inc 20Introduction to ML and BigML Platform
Weighting
Instance Rate Payment Status Predict Confidence
1 23 % 134 Paid Paid 20 %
2 23 % 134 Paid Paid 25 %
3 23 % 134 Paid Paid 30 %
... ... ... ... ...
1000 23 % 134 Paid Paid 99,5 %
1001 23 % 134 Default Paid 99,4 %
Problem: Default is “more important” 

but occurs less often than Paid
BigML, Inc 21Introduction to ML and BigML Platform
What just happened?
• We started with loan data from Lending Club pulled from S3
• At the Source step, we fixed some datatypes
• At the Dataset step, we used Flatline to filter and create the
loan "quality" feature
• When configuring the Model, we set some advanced settings:
• removed correlated features: int_rate
• enabled objective balancing
• We explored the Model to see what factors predict default.
• We deployed the Model into a voice interface to make
Predictions.
Question: Can we trust this model?
BigML, Inc 22Introduction to ML and BigML Platform
Evaluations
DATASET
TRAIN SET
TEST SET
PREDICTIONS
METRICS
?
?
?
?
?
?
BigML, Inc 23Introduction to ML and BigML Platform
Evaluation Metrics
• Imagine we have a model that can predict a person’s dominant
hand, that is for any individual it predicts left / right
• Define the positive class
• This selection is arbitrary
• It is the class you are interested in!
• The negative class is the “other” class (or others)
• For this example, we choose : left
BigML, Inc 24Introduction to ML and BigML Platform
Evaluation Metrics
• We choose the positive class: left
• True Positive (TP)
• We predicted left and the correct answer was left
• True Negative (TN)
• We predicted right and the correct answer was right
• False Positive (FP)
• Predicted left but the correct answer was right
• False Negative (FN)
• Predict right but the correct answer was left
BigML, Inc 25Introduction to ML and BigML Platform
Evaluation Metrics
True Positive: Correctly predicted the positive class
True Negative: Correctly predicted the negative class
False Positive: Incorrectly predicted the positive class
False Negative: Incorrectly predicted the negative class
BigML, Inc 26Introduction to ML and BigML Platform
Accuracy
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Ex: 90% of people are right-handed and 10% are left
• A silly model which always predicts right handed is
90% accurate
BigML, Inc 27Introduction to ML and BigML Platform
Accuracy
Classied as
Left Handed
Classied as
Right Handed
TP = 0
FP = 0
TN = 7
FN = 3
= Left
= RightPositive

Class
Negative

Class TP + TN
Total
= 70%
BigML, Inc 28Introduction to ML and BigML Platform
Precision
TP
TP + FP
• “accuracy” or “purity” of positive class
• How well you did separating the positive class from the
negative class
• If Precision = 1 then no FP.
• You may have missed some left handers, but of the
ones you identied, all are left handed. No mistakes.
• If Precision = 0 then no TP
• None of the left handers you identified are actually left
handed. All mistakes.
BigML, Inc 29Introduction to ML and BigML Platform
Precision
Classied as
Left Handed
Classied as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FP
= 50%
BigML, Inc 30Introduction to ML and BigML Platform
Recall
TP
TP + FN
• percentage of positive class correctly identified
• A measure of how well you identified all of the positive
class examples
• If Recall = 1 then no FN → All left handers identified
• There may be FP, so precision could be <1
• If Recall = 0 then no TP → No left handers identified
BigML, Inc 31Introduction to ML and BigML Platform
Recall
Classied as
Left Handed
Classied as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FN
= 66%
BigML, Inc 32Introduction to ML and BigML Platform
f-Measure
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• If f-measure = 1 then Recall == Precision == 1
• If Precision OR Recall is small then the f-measure is small
BigML, Inc 33Introduction to ML and BigML Platform
f-Measure
Classied as
Fraud
Classied as
Not Fraud
R = 66%
P = 50%
f = 57%
Positive

Class
Negative

Class
= Left
= Right
BigML, Inc 34Introduction to ML and BigML Platform
Phi Coefcient
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• If -1 then predictions are opposite reality
• =0 no correlation between predictions and reality
• =1 then predictions are always correct
BigML, Inc 35Introduction to ML and BigML Platform
Phi Coefcient
Classied as
Fraud
Classied as
Not Fraud
TP = 2
FP = 2
TN = 5
FN = 1
Phi = 0.356
Positive

Class
Negative

Class
= Left
= Right
BigML, Inc 36Introduction to ML and BigML Platform
Evaluations
BigML, Inc 37Introduction to ML and BigML Platform
What just happened?
• We split the Lending Club data into training and test Datasets
• We created a Model and Evaluation
• Looking at the Accuracy, we saw that the Model was
performing well but because of unbalanced classes
• The resulting Model did well at predicting good loans
• But bad loans are "more important"
• We tried different weights to increase the Recall of bad loans:
• objective balancing: equal consideration
• class weights: bad = 1000, good = 1
• Finally, we explored the impact of changing the probability
threshold
BigML, Inc 38Introduction to ML and BigML Platform
Evaluation
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
• Don’t forget that accuracy can be mis-leading!
• Mostly useless with unbalanced classes (left/right?)
• Use weighting, operating points, other tricks…
BigML, Inc 39Introduction to ML and BigML Platform
What else can we try?
• Rather than build a single model…
• Combine the output of several typically
“weaker” models into a powerful ensemble…
• How do we create unique models from the
same training dataset?
Ensembles!
BigML, Inc 40Introduction to ML and BigML Platform
Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER
BigML, Inc 41Introduction to ML and BigML Platform
Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER
BigML, Inc 42Introduction to ML and BigML Platform
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
ERROR

PREDICTION 1
ERROR
PREDICTION 2
ERROR
PREDICTION 3
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc 43Introduction to ML and BigML Platform
Ensembles
BigML, Inc 44Introduction to ML and BigML Platform
Logistic Regression
BigML, Inc 45Introduction to ML and BigML Platform
Logistic Regression
????
BigML, Inc 46Introduction to ML and BigML Platform
Logistic Regression
𝑥➝-∞
𝑙(𝑥)➝0
• Fits logistic function to probability of 

output class
• Result is a set of coefficients, 

one for each feature * class
𝑥➝∞
𝑙(𝑥)➝1
Goal
1
1 + 𝒆− 𝑥
𝑙(𝑥) =
Logistic Function
𝑃≈0 𝑃≈10<𝑃<1
𝑓(𝑿)=𝛽0+𝞫·𝑿=𝛽0+𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖
𝑃(𝑿)=
1
1+𝑒−𝑓(𝑿)
BigML, Inc 47Introduction to ML and BigML Platform
Logistic Regression
BigML, Inc 48Introduction to ML and BigML Platform
Logistic Level Up
Outputs
Inputs
BigML, Inc 49Introduction to ML and BigML Platform
Logistic Level Up
wi
Class “a”, logistic(w, b)
BigML, Inc 50Introduction to ML and BigML Platform
Deepnets
Outputs
Inputs
Hidden layer
BigML, Inc 51Introduction to ML and BigML Platform
Deepnets
n
hidden nodes?
BigML, Inc 52Introduction to ML and BigML Platform
Deepnets
n
hidden 

layers?
BigML, Inc 53Introduction to ML and BigML Platform
BigML Deepnets
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes significant expert knowledge
• Solution: Metalearning (a good initial guess)
• Solution: Network search (try a bunch)
BigML, Inc 54Introduction to ML and BigML Platform
Deepnets
BigML, Inc 55Introduction to ML and BigML Platform
OptiML
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to find ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but finds the optimum
machine learning algorithm and parameters for your data
automatically
BigML, Inc 56Introduction to ML and BigML Platform
Fusions
• Similar to an Ensemble, but we can mix different model types
• Logistic Regression, plus a Deepnet for example
• You can also create a fusion with different training sets!
• Last week, plus last month data, etc
• Or a Fusion of OptiML models
• Combines the “best of the best”
BigML, Inc 57Introduction to ML and BigML Platform
OptiML & Fusions
BigML, Inc 58Introduction to ML and BigML Platform
What just happened?
• We applied several different classification methods to the
Lending Club training data
• Decision Forest
• Random Decision Forest: Examined effect of threshold
• Boosted Trees
• Logistic Regression
• Deepnets
• OptiML: Optimized for Recall of bad
• Fusion: Created from top OptiML models
• Then we created an evaluation of each one of the methods
using the test dataset
• We compared the evaluations using a ROC curve
BigML, Inc 59Introduction to ML and BigML Platform
Supervised Learning
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classication
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label
We need different Evaluation Metrics…
BigML, Inc 60Introduction to ML and BigML Platform
Regression - Fitting a Line
Data Points
Model
BigML, Inc 61Introduction to ML and BigML Platform
Mean Absolute Error
e1
e2
e7
e6
e5
e4
e3
MAE = |e1|+|e2|+ … +|en|
n
BigML, Inc 62Introduction to ML and BigML Platform
Mean Squared Error
e1
e2
e7
e6
e5
e4
e3
MSE = (e1)2
+(e2)2
+ … +(en)2
n
BigML, Inc 63Introduction to ML and BigML Platform
MSE versus MAE
• For both MAE & MSE: Smaller is better, but values are
unbounded
• MSE is always larger than or equal to MAE
BigML, Inc 64Introduction to ML and BigML Platform
R-Squared Error
Data Points
Model
Mean
BigML, Inc 65Introduction to ML and BigML Platform
R-Squared Error
Mean
v1
v2
v3 v4 v5
v7
v6
BigML, Inc 66Introduction to ML and BigML Platform
R-Squared Error
e1
e2
e7
e6
e5
e4
e3
Mean
v1
v2
v3 v4 v5
v7
v6
MSEmodel
MSEmean
RSE = 1 -
BigML, Inc 67Introduction to ML and BigML Platform
R-Squared Error
• RSE: measure of how much better the model is than
always predicting the mean
• < 0 model is worse then mean
• MSEmodel > MSEmean
• = 0 model is no better than the mean
• MSEmodel = MSEmean
• ➞ 1 model fits the data “perfectly”
• MSEmodel = 0 (or MSEmean >> MSEmodel)
MSEmodel
MSEmean
RSE = 1 -
BigML, Inc 68Introduction to ML and BigML Platform
Regression Evaluation
BigML, Inc 69Introduction to ML and BigML Platform
What just happened?
• We started with the open loans in the Lending Club dataset
• Performed a train/test split
• Built a model to predict the int_rate without grade/sub-grade
• Evaluated this model
BigML, Inc 70Introduction to ML and BigML Platform
Time Series
Year Pineapple Harvest
1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend
BigML, Inc 71Introduction to ML and BigML Platform
Exponential Smoothing
For training values 𝒙𝒕
Smoothing Factor 0 < Îą < 1
Predicted values 𝒔𝒕
𝒔𝒕 = α·𝒙𝒕 + ⟮1-α⟯·𝒔𝒕-1
Weight
0
0,05
0,1
0,15
0,2
1 3 5 7 9 11 13
Each new value in the series depends on all previous
values with a decaying weight
Idea:
BigML, Inc 72Introduction to ML and BigML Platform
Time Series
BigML, Inc 73Introduction to ML and BigML Platform
Supervised Learning
Label
Training
BigML, Inc 74Introduction to ML and BigML Platform
Unsupervised Learning
1. Training data provides “examples” - no specific “outcome”
2. The machine tries to find “interesting” patterns in the data
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 75Introduction to ML and BigML Platform
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Anomaly Detection
unusual
BigML, Inc 76Introduction to ML and BigML Platform
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Clustering
similar
BigML, Inc 77Introduction to ML and BigML Platform
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
Association Discovery
BigML, Inc 78Introduction to ML and BigML Platform
Let’s build a recommender
Typical way to shop for a home…
BigML, Inc 79Introduction to ML and BigML Platform
Recommender Idea
?
?
?
?
Preference
Model
Preference
Data
Sample
… then use the Preference Model to
filter all the homes on the market
All Homes
Forsale
BigML, Inc 80Introduction to ML and BigML Platform
Recommender Problem #1
What if there are really unusual homes in the data?
• A mansion with 20 bathrooms
• A home with no bedrooms
• A lot size that is smaller than the home?
We don’t want to show these as suggestions
because they are unusual….
How do we detect anomalies?
BigML, Inc 81Introduction to ML and BigML Platform
Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
BigML, Inc 82Introduction to ML and BigML Platform
Anomaly Detection
BigML, Inc 83Introduction to ML and BigML Platform
What just happened?
• We wanted to find and remove unusual houses.
• We created an Anomaly Detector and examined
the top anomalies.
• We found some unusual houses to remove and
discovered bad data (missing values) that we want
to x.
BigML, Inc 84Introduction to ML and BigML Platform
A clever way to x missing data
Let’s use Machine Learning…
BEDS BATHS
SQFT PRICE BEDS BATHS
3.125 $530.000 5 3
2.100 $460.000 2
1.200 $250.000 3
3.950 $610.000 6 4
4
1.5
BigML, Inc 85Introduction to ML and BigML Platform
WhizzML
BigML, Inc 86Introduction to ML and BigML Platform
What just happened?
• We had a Dataset with missing values.
• We wanted to apply an algorithm to fix the missing
values with Machine Learning
• Rather than write the algorithm, we found what we
needed in the WhizzML public gallery.
• Now that we have cloned the Script we can use it
again and again.
• We can write new ones too!
BigML, Inc 87Introduction to ML and BigML Platform
Recommender Problem #2
• How can we avoid showing essentially the
same house over and over?
All Homes
?
?
?
Sample
Modern
BigML, Inc 88Introduction to ML and BigML Platform
Recommender Problem #2
• How can we avoid showing essentially the
same house over and over?
All Homes
Modern
Lots of
Land
• Great! What if we don’t know how to group
them? Or how many groups?
?
sample
?
sample
BigML, Inc 89Introduction to ML and BigML Platform
Clustering
BigML, Inc 90Introduction to ML and BigML Platform
What just happened?
• Since we don’t know how many groups of homes
there should be, we used G-means Clustering to nd
the optimum number of groups of homes
• Our recommender will use these groups to create a
better sampling for user preference
• We also tried to understand the home clusters using
“model clusters” but the models were difficult to
interpret
BigML, Inc 91Introduction to ML and BigML Platform
Understanding Clusters Better
If SQFT >= 3,125 THEN “Cluster 1”
What if we could get rules like…
SQFT PRICE BEDS BATHS CLUSTER
3.125 $530.000 5 3 Cluster 1
2.100 $460.000 4 2 Cluster 3
1.200 $250.000 3 1,5 Cluster 5
3.950 $610.000 6 4 Cluster 1
BigML, Inc 92Introduction to ML and BigML Platform
Association Discovery
BigML, Inc 93Introduction to ML and BigML Platform
What just happened?
• We used a Batch Centroid to add the Cluster
assignment of each home as a feature to the Dataset
• We use Association Discovery to find “interesting”
relationships between the features including the Cluster
assignment
BigML, Inc 94Introduction to ML and BigML Platform
Recommender Problem #3
There is much more interesting information than just the
number of BEDS, BATHS, etc.
• Unfortunately, these "remarks" are not available in the
Redn download
• Adding them to our dataset requires crawling the
website
• Like most ML projects, preparing the data is 80% of
the difculty (fortunately I already did it!)
BigML, Inc 95Introduction to ML and BigML Platform
Topic Modeling
BigML, Inc 96Introduction to ML and BigML Platform
What just happened?
• We extending the home dataset with the syndicated
remarks text eld
• We built a model to predict sale price and explored how
key words discovered in the remarks impacted price
• We used topic modeling to create a deeper thematic
understanding of the remarks
• Homes that are "in-town" or "out-of-town"
• We extended the dataset with fields that represent for
each home how related they are to each of these topics
• This will allow our clustering to group homes by a deeper
meaning than just BEDS, BATHS, etc
BigML, Inc 97Introduction to ML and BigML Platform
Recommender Idea
?
?
Modern
Lots of
Land
Small
?
?
?
?
Preference
Model
Preference
Data
BigML, Inc 98Introduction to ML and BigML Platform
House Recommender
VSSML18. Introduction to Machine Learning and the BigML Platform

VSSML18. Introduction to Machine Learning and the BigML Platform

  • 1.
    Valencian Summer Schoolin Machine Learning 4th edition September 13-14, 2018
  • 2.
    BigML, Inc 2 Introto BigML Making Machine Learning Beautifully Simple <AUTHOR> <TITLE>, BigML, Inc Poul Petersen CIO, BigML, Inc
  • 3.
    BigML, Inc 3Introductionto ML and BigML Platform Sampling the Audience Expert: Published papers at KDD, ICML, NIPS, etc or developed own ML algorithms used at large scale Acionado: Understands pros/cons of different techniques and/or can tweak algorithms as needed Practitioner: Very familiar with ML packages (Weka, Scikit, BigML, etc.) Newbie: Just taking Coursera ML class or reading an introductory book to ML Absolute beginner: ML sounds like science ction
  • 4.
    BigML, Inc 4Introductionto ML and BigML Platform What is Machine Learning? Finding patterns in data that can be used to make inference predictive models
  • 5.
    BigML, Inc 5Introductionto ML and BigML Platform BigMLCOMPLEXITYOFTASKS TIME20th century 21st century - + A NEW PROGRAMMING PARADIGM
  • 6.
    BigML, Inc 6Introductionto ML and BigML Platform Traditional Programming LOST BAGGAGE POLICY
  • 7.
    BigML, Inc 7Introductionto ML and BigML Platform Programming with ML PREDICT FLIGHT DELAYS
  • 8.
    BigML, Inc 8Introductionto ML and BigML Platform Programming with ML
  • 9.
    BigML, Inc 9Introductionto ML and BigML Platform Modeling Churn
  • 10.
    BigML, Inc 10Introductionto ML and BigML Platform What just happened? Churn Data How many support calls? Model Prediction: Churn=yes
  • 11.
    BigML, Inc 11Introductionto ML and BigML Platform Some Terminology… Churn Data Model Prediction: Churn=yes Training Data • Modeling • Clustering • Anomaly Detection • Association Discovery ML Resource ML Platform “Consume” the model or “put into production” • Dashboard • Custom Application • Wearable / Edge device • Batch Process
  • 12.
    BigML, Inc 12Introductionto ML and BigML Platform A Brief History of BigML • BigML Mission: To make Machine Learning Beautifully Simple • BigML Founded in Corvallis, Oregon in 2011 - long before ML was "cool" • You’ve never heard of it? • Most innovative city in the United States!
  • 13.
    BigML, Inc 13Introductionto ML and BigML Platform A Brief History of BigML
  • 14.
    BigML, Inc 14Introductionto ML and BigML Platform BigML Platform Web-based Frontend Visualizations Distributed Machine Learning Backend SOURCE SERVER DATASET SERVER MODEL SERVER PREDICTION SERVER EVALUATION SERVER SAMPLE SERVER WHIZZML SERVER Tools - https://bigml.com/tools REST API - https://bigml.com/api Smart Infrastructure (auto-deployable, auto-scalable) SERVERS EVENTS GEARMAN QUEUE DESIRED TOPOLOGY AWS COSTS RUNQUEUE SCALER BUSY SCALER AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY ACTUAL TOPOLOGY MESSAGE QUEUE
  • 15.
    BigML, Inc 15Introductionto ML and BigML Platform BigML Platform Web-based Frontend Visualizations Distributed Machine Learning Backend SOURCE SERVER DATASET SERVER MODEL SERVER PREDICTION SERVER EVALUATION SERVER SAMPLE SERVER WHIZZML SERVER Tools - https://bigml.com/tools REST API - https://bigml.com/api Smart Infrastructure (auto-deployable, auto-scalable) SERVERS EVENTS GEARMAN QUEUE DESIRED TOPOLOGY AWS COSTS RUNQUEUE SCALER BUSY SCALER AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY ACTUAL TOPOLOGY MESSAGE QUEUE On-Premises
  • 16.
    BigML, Inc 16Introductionto ML and BigML Platform Machine Learning Tasks Question ML Resource Based on previous loan outcomes, Will this customer default on a loan? Models / Ensembles
 Deepnets / LR How many customers will apply for a loan next month? Models, Ensembles, Deepnets Based on trends, how much of this product will I sell next month? Time Series Is the consumption of this product unusual? Anomaly Detection Is the behavior of these customers similar? Clusters Are these products purchased together? Association Discovery What are the thematic contents of these documents? Topic Models
  • 17.
    BigML, Inc 17Introductionto ML and BigML Platform Lending Club Loan Lifecycle “Closed”“Open” Fully Paid Late 16-30 Days Late 31-120 Days Charged Off Default Current In Grace Period ( if ( = ( field "loan_status" ) "Fully Paid" ) "good", "bad" ) s3://bigml-public/csv/lc_sample.csv.gz
  • 18.
    BigML, Inc 18Introductionto ML and BigML Platform Modeling Default
  • 19.
    BigML, Inc 19Introductionto ML and BigML Platform What just happened? • We started with loan data from Lending Club pulled from S3 • At the Source step, we fixed some datatypes • At the Dataset step, we used Flatline to filter and create the loan "quality" feature • When configuring the Model, we set some advanced settings: • removed correlated features: int_rate • enabled objective balancing (wait… why?)
  • 20.
    BigML, Inc 20Introductionto ML and BigML Platform Weighting Instance Rate Payment Status Predict Confidence 1 23 % 134 Paid Paid 20 % 2 23 % 134 Paid Paid 25 % 3 23 % 134 Paid Paid 30 % ... ... ... ... ... 1000 23 % 134 Paid Paid 99,5 % 1001 23 % 134 Default Paid 99,4 % Problem: Default is “more important” but occurs less often than Paid
  • 21.
    BigML, Inc 21Introductionto ML and BigML Platform What just happened? • We started with loan data from Lending Club pulled from S3 • At the Source step, we fixed some datatypes • At the Dataset step, we used Flatline to filter and create the loan "quality" feature • When configuring the Model, we set some advanced settings: • removed correlated features: int_rate • enabled objective balancing • We explored the Model to see what factors predict default. • We deployed the Model into a voice interface to make Predictions. Question: Can we trust this model?
  • 22.
    BigML, Inc 22Introductionto ML and BigML Platform Evaluations DATASET TRAIN SET TEST SET PREDICTIONS METRICS ? ? ? ? ? ?
  • 23.
    BigML, Inc 23Introductionto ML and BigML Platform Evaluation Metrics • Imagine we have a model that can predict a person’s dominant hand, that is for any individual it predicts left / right • Define the positive class • This selection is arbitrary • It is the class you are interested in! • The negative class is the “other” class (or others) • For this example, we choose : left
  • 24.
    BigML, Inc 24Introductionto ML and BigML Platform Evaluation Metrics • We choose the positive class: left • True Positive (TP) • We predicted left and the correct answer was left • True Negative (TN) • We predicted right and the correct answer was right • False Positive (FP) • Predicted left but the correct answer was right • False Negative (FN) • Predict right but the correct answer was left
  • 25.
    BigML, Inc 25Introductionto ML and BigML Platform Evaluation Metrics True Positive: Correctly predicted the positive class True Negative: Correctly predicted the negative class False Positive: Incorrectly predicted the positive class False Negative: Incorrectly predicted the negative class
  • 26.
    BigML, Inc 26Introductionto ML and BigML Platform Accuracy TP + TN Total • “Percentage correct” - like an exam • If Accuracy = 1 then no mistakes • If Accuracy = 0 then all mistakes • Intuitive but not always useful • Watch out for unbalanced classes! • Ex: 90% of people are right-handed and 10% are left • A silly model which always predicts right handed is 90% accurate
  • 27.
    BigML, Inc 27Introductionto ML and BigML Platform Accuracy Classied as Left Handed Classied as Right Handed TP = 0 FP = 0 TN = 7 FN = 3 = Left = RightPositive Class Negative Class TP + TN Total = 70%
  • 28.
    BigML, Inc 28Introductionto ML and BigML Platform Precision TP TP + FP • “accuracy” or “purity” of positive class • How well you did separating the positive class from the negative class • If Precision = 1 then no FP. • You may have missed some left handers, but of the ones you identified, all are left handed. No mistakes. • If Precision = 0 then no TP • None of the left handers you identified are actually left handed. All mistakes.
  • 29.
    BigML, Inc 29Introductionto ML and BigML Platform Precision Classied as Left Handed Classied as Right Handed TP = 2 FP = 2 TN = 5 FN = 1 Positive Class Negative Class = Left = Right TP TP + FP = 50%
  • 30.
    BigML, Inc 30Introductionto ML and BigML Platform Recall TP TP + FN • percentage of positive class correctly identified • A measure of how well you identified all of the positive class examples • If Recall = 1 then no FN → All left handers identified • There may be FP, so precision could be <1 • If Recall = 0 then no TP → No left handers identified
  • 31.
    BigML, Inc 31Introductionto ML and BigML Platform Recall Classied as Left Handed Classied as Right Handed TP = 2 FP = 2 TN = 5 FN = 1 Positive Class Negative Class = Left = Right TP TP + FN = 66%
  • 32.
    BigML, Inc 32Introductionto ML and BigML Platform f-Measure 2 * Recall * Precision Recall + Precision • harmonic mean of Recall & Precision • If f-measure = 1 then Recall == Precision == 1 • If Precision OR Recall is small then the f-measure is small
  • 33.
    BigML, Inc 33Introductionto ML and BigML Platform f-Measure Classied as Fraud Classied as Not Fraud R = 66% P = 50% f = 57% Positive Class Negative Class = Left = Right
  • 34.
    BigML, Inc 34Introductionto ML and BigML Platform Phi Coefficient __________TP*TN_-_FP*FN__________ SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)] • Returns a value between -1 and 1 • If -1 then predictions are opposite reality • =0 no correlation between predictions and reality • =1 then predictions are always correct
  • 35.
    BigML, Inc 35Introductionto ML and BigML Platform Phi Coefcient Classied as Fraud Classied as Not Fraud TP = 2 FP = 2 TN = 5 FN = 1 Phi = 0.356 Positive Class Negative Class = Left = Right
  • 36.
    BigML, Inc 36Introductionto ML and BigML Platform Evaluations
  • 37.
    BigML, Inc 37Introductionto ML and BigML Platform What just happened? • We split the Lending Club data into training and test Datasets • We created a Model and Evaluation • Looking at the Accuracy, we saw that the Model was performing well but because of unbalanced classes • The resulting Model did well at predicting good loans • But bad loans are "more important" • We tried different weights to increase the Recall of bad loans: • objective balancing: equal consideration • class weights: bad = 1000, good = 1 • Finally, we explored the impact of changing the probability threshold
  • 38.
    BigML, Inc 38Introductionto ML and BigML Platform Evaluation • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split • Even a train/test split may not be enough! • Might get a “lucky” split • Solution is to repeat several times (formally to cross validate) • Don’t forget that accuracy can be mis-leading! • Mostly useless with unbalanced classes (left/right?) • Use weighting, operating points, other tricks…
  • 39.
    BigML, Inc 39Introductionto ML and BigML Platform What else can we try? • Rather than build a single model… • Combine the output of several typically “weaker” models into a powerful ensemble… • How do we create unique models from the same training dataset? Ensembles!
  • 40.
    BigML, Inc 40Introductionto ML and BigML Platform Decision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION COMBINER
  • 41.
    BigML, Inc 41Introductionto ML and BigML Platform Random Decision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 SAMPLE 1 PREDICTION COMBINER
  • 42.
    BigML, Inc 42Introductionto ML and BigML Platform Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 ERROR
 PREDICTION 1 ERROR PREDICTION 2 ERROR PREDICTION 3 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 43.
    BigML, Inc 43Introductionto ML and BigML Platform Ensembles
  • 44.
    BigML, Inc 44Introductionto ML and BigML Platform Logistic Regression
  • 45.
    BigML, Inc 45Introductionto ML and BigML Platform Logistic Regression ????
  • 46.
    BigML, Inc 46Introductionto ML and BigML Platform Logistic Regression 𝑥➝-∞ 𝑙(𝑥)➝0 • Fits logistic function to probability of 
 output class • Result is a set of coefficients, 
 one for each feature * class 𝑥➝∞ 𝑙(𝑥)➝1 Goal 1 1 + 𝒆− 𝑥 𝑙(𝑥) = Logistic Function 𝑃≈0 𝑃≈10<𝑃<1 𝑓(𝑿)=𝛽0+𝞫·𝑿=𝛽0+𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖 𝑃(𝑿)= 1 1+𝑒−𝑓(𝑿)
  • 47.
    BigML, Inc 47Introductionto ML and BigML Platform Logistic Regression
  • 48.
    BigML, Inc 48Introductionto ML and BigML Platform Logistic Level Up Outputs Inputs
  • 49.
    BigML, Inc 49Introductionto ML and BigML Platform Logistic Level Up wi Class “a”, logistic(w, b)
  • 50.
    BigML, Inc 50Introductionto ML and BigML Platform Deepnets Outputs Inputs Hidden layer
  • 51.
    BigML, Inc 51Introductionto ML and BigML Platform Deepnets n hidden nodes?
  • 52.
    BigML, Inc 52Introductionto ML and BigML Platform Deepnets n hidden layers?
  • 53.
    BigML, Inc 53Introductionto ML and BigML Platform BigML Deepnets • The success of a Deepnet is dependent on getting the right network structure for the dataset • But, there are too many parameters: • Nodes, layers, activation function, learning rate, etc… • And setting them takes significant expert knowledge • Solution: Metalearning (a good initial guess) • Solution: Network search (try a bunch)
  • 54.
    BigML, Inc 54Introductionto ML and BigML Platform Deepnets
  • 55.
    BigML, Inc 55Introductionto ML and BigML Platform OptiML • Each resource has several parameters that impact quality • Number of trees, missing splits, nodes, weight • Rather than trial and error, we can use ML to find ideal parameters • Why not make the model type, Decision Tree, Boosted Tree, etc, a parameter as well? • Similar to Deepnet network search, but finds the optimum machine learning algorithm and parameters for your data automatically
  • 56.
    BigML, Inc 56Introductionto ML and BigML Platform Fusions • Similar to an Ensemble, but we can mix different model types • Logistic Regression, plus a Deepnet for example • You can also create a fusion with different training sets! • Last week, plus last month data, etc • Or a Fusion of OptiML models • Combines the “best of the best”
  • 57.
    BigML, Inc 57Introductionto ML and BigML Platform OptiML & Fusions
  • 58.
    BigML, Inc 58Introductionto ML and BigML Platform What just happened? • We applied several different classification methods to the Lending Club training data • Decision Forest • Random Decision Forest: Examined effect of threshold • Boosted Trees • Logistic Regression • Deepnets • OptiML: Optimized for Recall of bad • Fusion: Created from top OptiML models • Then we created an evaluation of each one of the methods using the test dataset • We compared the evaluations using a ROC curve
  • 59.
    BigML, Inc 59Introductionto ML and BigML Platform Supervised Learning animal state … proximity action tiger hungry … close run elephant happy … far take picture … … … … … Classification animal state … proximity min_kmh tiger hungry … close 70 hippo angry … far 10 … …. … … … Regression label We need different Evaluation Metrics…
  • 60.
    BigML, Inc 60Introductionto ML and BigML Platform Regression - Fitting a Line Data Points Model
  • 61.
    BigML, Inc 61Introductionto ML and BigML Platform Mean Absolute Error e1 e2 e7 e6 e5 e4 e3 MAE = |e1|+|e2|+ … +|en| n
  • 62.
    BigML, Inc 62Introductionto ML and BigML Platform Mean Squared Error e1 e2 e7 e6 e5 e4 e3 MSE = (e1)2 +(e2)2 + … +(en)2 n
  • 63.
    BigML, Inc 63Introductionto ML and BigML Platform MSE versus MAE • For both MAE & MSE: Smaller is better, but values are unbounded • MSE is always larger than or equal to MAE
  • 64.
    BigML, Inc 64Introductionto ML and BigML Platform R-Squared Error Data Points Model Mean
  • 65.
    BigML, Inc 65Introductionto ML and BigML Platform R-Squared Error Mean v1 v2 v3 v4 v5 v7 v6
  • 66.
    BigML, Inc 66Introductionto ML and BigML Platform R-Squared Error e1 e2 e7 e6 e5 e4 e3 Mean v1 v2 v3 v4 v5 v7 v6 MSEmodel MSEmean RSE = 1 -
  • 67.
    BigML, Inc 67Introductionto ML and BigML Platform R-Squared Error • RSE: measure of how much better the model is than always predicting the mean • < 0 model is worse then mean • MSEmodel > MSEmean • = 0 model is no better than the mean • MSEmodel = MSEmean • ➞ 1 model fits the data “perfectly” • MSEmodel = 0 (or MSEmean >> MSEmodel) MSEmodel MSEmean RSE = 1 -
  • 68.
    BigML, Inc 68Introductionto ML and BigML Platform Regression Evaluation
  • 69.
    BigML, Inc 69Introductionto ML and BigML Platform What just happened? • We started with the open loans in the Lending Club dataset • Performed a train/test split • Built a model to predict the int_rate without grade/sub-grade • Evaluated this model
  • 70.
    BigML, Inc 70Introductionto ML and BigML Platform Time Series Year Pineapple Harvest 1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Trend
  • 71.
    BigML, Inc 71Introductionto ML and BigML Platform Exponential Smoothing For training values 𝒙𝒕 Smoothing Factor 0 < α < 1 Predicted values 𝒔𝒕 𝒔𝒕 = α·𝒙𝒕 + ⟮1-α⟯·𝒔𝒕-1 Weight 0 0,05 0,1 0,15 0,2 1 3 5 7 9 11 13 Each new value in the series depends on all previous values with a decaying weight Idea:
  • 72.
    BigML, Inc 72Introductionto ML and BigML Platform Time Series
  • 73.
    BigML, Inc 73Introductionto ML and BigML Platform Supervised Learning Label Training
  • 74.
    BigML, Inc 74Introductionto ML and BigML Platform Unsupervised Learning 1. Training data provides “examples” - no specific “outcome” 2. The machine tries to find “interesting” patterns in the data date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 75.
    BigML, Inc 75Introductionto ML and BigML Platform Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Anomaly Detection unusual
  • 76.
    BigML, Inc 76Introductionto ML and BigML Platform Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Clustering similar
  • 77.
    BigML, Inc 77Introductionto ML and BigML Platform Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 {customer = Bob, account = 3421} {class = gas} Association Discovery
  • 78.
    BigML, Inc 78Introductionto ML and BigML Platform Let’s build a recommender Typical way to shop for a home…
  • 79.
    BigML, Inc 79Introductionto ML and BigML Platform Recommender Idea ? ? ? ? Preference Model Preference Data Sample … then use the Preference Model to filter all the homes on the market All Homes Forsale
  • 80.
    BigML, Inc 80Introductionto ML and BigML Platform Recommender Problem #1 What if there are really unusual homes in the data? • A mansion with 20 bathrooms • A home with no bedrooms • A lot size that is smaller than the home? We don’t want to show these as suggestions because they are unusual…. How do we detect anomalies?
  • 81.
    BigML, Inc 81Introductionto ML and BigML Platform Isolation Forest Grow a random decision tree until each instance is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  • 82.
    BigML, Inc 82Introductionto ML and BigML Platform Anomaly Detection
  • 83.
    BigML, Inc 83Introductionto ML and BigML Platform What just happened? • We wanted to find and remove unusual houses. • We created an Anomaly Detector and examined the top anomalies. • We found some unusual houses to remove and discovered bad data (missing values) that we want to fix.
  • 84.
    BigML, Inc 84Introductionto ML and BigML Platform A clever way to fix missing data Let’s use Machine Learning… BEDS BATHS SQFT PRICE BEDS BATHS 3.125 $530.000 5 3 2.100 $460.000 2 1.200 $250.000 3 3.950 $610.000 6 4 4 1.5
  • 85.
    BigML, Inc 85Introductionto ML and BigML Platform WhizzML
  • 86.
    BigML, Inc 86Introductionto ML and BigML Platform What just happened? • We had a Dataset with missing values. • We wanted to apply an algorithm to fix the missing values with Machine Learning • Rather than write the algorithm, we found what we needed in the WhizzML public gallery. • Now that we have cloned the Script we can use it again and again. • We can write new ones too!
  • 87.
    BigML, Inc 87Introductionto ML and BigML Platform Recommender Problem #2 • How can we avoid showing essentially the same house over and over? All Homes ? ? ? Sample Modern
  • 88.
    BigML, Inc 88Introductionto ML and BigML Platform Recommender Problem #2 • How can we avoid showing essentially the same house over and over? All Homes Modern Lots of Land • Great! What if we don’t know how to group them? Or how many groups? ? sample ? sample
  • 89.
    BigML, Inc 89Introductionto ML and BigML Platform Clustering
  • 90.
    BigML, Inc 90Introductionto ML and BigML Platform What just happened? • Since we don’t know how many groups of homes there should be, we used G-means Clustering to find the optimum number of groups of homes • Our recommender will use these groups to create a better sampling for user preference • We also tried to understand the home clusters using “model clusters” but the models were difficult to interpret
  • 91.
    BigML, Inc 91Introductionto ML and BigML Platform Understanding Clusters Better If SQFT >= 3,125 THEN “Cluster 1” What if we could get rules like… SQFT PRICE BEDS BATHS CLUSTER 3.125 $530.000 5 3 Cluster 1 2.100 $460.000 4 2 Cluster 3 1.200 $250.000 3 1,5 Cluster 5 3.950 $610.000 6 4 Cluster 1
  • 92.
    BigML, Inc 92Introductionto ML and BigML Platform Association Discovery
  • 93.
    BigML, Inc 93Introductionto ML and BigML Platform What just happened? • We used a Batch Centroid to add the Cluster assignment of each home as a feature to the Dataset • We use Association Discovery to find “interesting” relationships between the features including the Cluster assignment
  • 94.
    BigML, Inc 94Introductionto ML and BigML Platform Recommender Problem #3 There is much more interesting information than just the number of BEDS, BATHS, etc. • Unfortunately, these "remarks" are not available in the Redfin download • Adding them to our dataset requires crawling the website • Like most ML projects, preparing the data is 80% of the difficulty (fortunately I already did it!)
  • 95.
    BigML, Inc 95Introductionto ML and BigML Platform Topic Modeling
  • 96.
    BigML, Inc 96Introductionto ML and BigML Platform What just happened? • We extending the home dataset with the syndicated remarks text field • We built a model to predict sale price and explored how key words discovered in the remarks impacted price • We used topic modeling to create a deeper thematic understanding of the remarks • Homes that are "in-town" or "out-of-town" • We extended the dataset with fields that represent for each home how related they are to each of these topics • This will allow our clustering to group homes by a deeper meaning than just BEDS, BATHS, etc
  • 97.
    BigML, Inc 97Introductionto ML and BigML Platform Recommender Idea ? ? Modern Lots of Land Small ? ? ? ? Preference Model Preference Data
  • 98.
    BigML, Inc 98Introductionto ML and BigML Platform House Recommender