Applied Machine Learning
Max Pagels, Machine Learning Partner
Job: Fourkind
Education: BSc, MSc comp. sci, University of Helsinki
Background: CS researcher, full-stack dev, front-end dev,
data scientist
Interests: Immediate-reward RL, ML reductions,
incremental/online learning, generative design
Some industries: maritime, insurance, ecommerce, gaming,
telecommunications, transportation, media, education,
logistics
Preface
Machine learning is sometimes at odds with business in
general
Machine learning is
sometimes at odds with
business in general
Machine learning is perhaps the best example of applying
the scientific method:
“It involves formulating hypotheses, via induction, based on
such observations; experimental and measurement-based
testing of deductions drawn from the hypotheses; and
refinement (or elimination) of the hypotheses based on the
experimental findings”
All machine learning projects are, in effect, a series of
experiments, where outcomes are uncertain.
Machine learning is
sometimes at odds with
business in general
A key tenant of the scientific method is that “failed”
experiments don’t equal failure.
Failed experiments add to the body of knowledge, and allow
us to do better in the future.
Machine learning is
sometimes at odds with
business in general
Alexander Fleming, year unknown
Machine learning is
sometimes at odds with
business in general
Unfortunately, business rarely looks at failed projects in the
same way scientists do. This can be hard to reconcile.
Machine learning is
sometimes at odds with
business in general
Project A: “let’s build a new webshop for our product”
Project B: “We lose 2 million each year because of wasted
inventory. Let’s solve that using ML”
How do we reconcile the scientific method with
the business world?
There’s no silver bullet. But by studying the experiences of others, and bringing ML
closer to what businesses care about, we can avoid some mistakes.
What follows are some observations. Some seem very obvious,
some not, but all still pose a challenge in practice.
Disclaimer: all of the following examples are based on personal experience, personal
failures, or personal opinion. Please consume with a healthy grain of salt.
In many cases, you don’t need machine learning in
order to solve a problem
In many cases, you don’t
need machine learning in
order to solve a problem
Data Scientist: “we built a model for predicting the channel
customers contact us in”
PO: “awesome, let’s take this to production!”
Data Scientist: “great, I’ll work with our engineers to make it
happen”
(development continues)
Engineer: “why don’t we just collect the correct channel
information when someone calls or emails us?”
PO: “...”
Data Scientist: “...”
In many cases, you don’t
need machine learning in
order to solve a problem
“Rule #1: Don’t be afraid to launch a product without
machine learning.
Machine learning is cool, but it requires data. Theoretically,
you can take data from a different problem and then tweak
the model for a new product, but this will likely
underperform basic heuristics. If you think that machine
learning will give you a 100% boost, then a heuristic will get
you 50% of the way there.”
Rules of Machine Learning: Best Practices for ML Engineering (Martin Zinkevich et al),
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
Sometimes, the data you already have is useless
Sometimes, the data you
already have is useless
Client: “we want to be able to predict who is most likely to be our
customer in the future”
Data Scientist: “OK, for whom would you like to be able to predict
that?”
Client: “for all people that aren’t already our customers”
Sometimes, the data you
already have is useless
Client: “we want to be able to predict who is most likely to be our
customer in the future”
Data Scientist: “OK, for whom would you like to be able to predict
that?”
Client: “for all people that aren’t already our customers”
Not understanding technical constraints can make a
machine learning project fail
Not understanding technical
constraints can make a
machine learning project
fail
Business: “let’s use machine learning to automatically assign
tickets to the proper technician”
Data Scientist: “sounds plausible, I’ll get to work”
(development continues)
Data Scientist: “here’s the best model I could make. In
simulation, it’s only wrong 0.1% of the time”
Business: “that’s unacceptable – it can’t assign work to the
wrong technician”
Data Scientist: “but it’s function approximation...by
definition, it can’t–”
Business: “no exceptions”
Data Scientist: “...”
Not understanding technical
constraints can make a
machine learning project
fail
Data Scientist: “I’ve made a non-parametric model for a
recommendation engine and now we need to deploy it to
production”
Engineer: “OK, where’s the data you need at prediction
time?”
Data Scientist: “Oh, some of it is in two data warehouses and
the rest is in S3”
Engineer: “We have to make that data accessible in an
operational DB. How much data are we talking about?”
Data Scientist: “Around 2 billion rows”
Engineer: “...”
Data Scientist: “Oh, and since the model is non-parametric
and in-memory, it needs 50GB of RAM to run and doesn’t
scale horizontally”
Not understanding technical
constraints can make a
machine learning project
fail
Not understanding technical
constraints can make a
machine learning project
fail
May encourage overly complex solutions
Not understanding technical
constraints can make a
machine learning project
fail
Not understanding technical
constraints can make a
machine learning project
fail
In machine learning, domain expertise means less than
you might think
In machine learning,
domain expertise means
less than you might think
Predicting customer churn in an eCommerce business
Data Scientist: “OK, i’ll start with these features, gridsearch a good
XGBClassifier and iterate from there”
Predicting if heavy machinery is likely to break down within
the next day
Data Scientist: “OK, i’ll start with these features, gridsearch a good
XGBClassifier and iterate from there”
In machine learning,
domain expertise means
less than you might think
Just about any image recognition task, regardless of industry:
Data Scientist: I’ll use a convnet
In machine learning,
domain expertise means
less than you might think
During planning
Business owner(s): “the model should take a,b,c,d,e,f & g into
account when making a decision”
Data Scientist: “OK”
During modelling
Data Scientist:
“let me use a,b,c,d,e,f & g and make a baseline model”
“hmm, these results aren’t great. I’ll add h,i,j & k”
“hmm. a, b, c & d have no predictive power – I’ll drop those”
Data Scientist: “all done!”
Business owner(s): “this model takes into account the stuff we
talked about, right?”
Data Scientist: “sure.”
In machine learning,
domain expertise means
less than you might think
Consult with stakeholders to understand the
problem and get an idea of what types of
data might be useful (include everyone’s
ideas)
Figure out what data is viable to get/use as
features
Through modelling, learn what type of data
is actually useful
Business heuristics? (Possibly) add them as features, never as
hard-coded logic. Let learning algorithms figure out if they are useful.
Always make a proof-of-concept
Always make a
proof-of-concept
Always make a
proof-of-concept
Hidden Technical Debt in Machine Learning (D. Sculley, Gary Holt, Daniel Golovin et al),
http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Machine Learning: The High-Interest Credit Card of Technical Debt (D. Sculley, Gary Holt, Daniel
Golovin, Eugene Davydov et al),
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf
Always make a
proof-of-concept
Machine learning projects can, and will, fail from time to time. To
start, make the simplest model possible, and test its effectiveness
using the simplest possible process. Adding surrounding
infrastructure without validating the approach first is asking for
trouble.
Always do a proper test to establish causality - or why
you need to take a financial risk
Always do a proper test to
establish causality - or why
you need to take a financial
risk
The gold standard for establishing causality is a randomised
controlled experiment (A/B-test), though other useful causal
inference methods also exist for situations where A/B-testing isn’t
possible.
Always do a proper test to
establish causality - or why
you need to take a financial
risk
During a controlled experiment, you are invariably taking a
financial risk to determine the effectiveness of a machine learning
model.
Sometimes, it is surprisingly difficult to convince everyone that you
have to take a risk.
Always do a proper test to
establish causality - or why
you need to take a financial
risk
Example: predicting customer churn
“Can’t we just log churn risks without actually acting upon them,
and then follow up on how many people churned?”
Always do a proper test to
establish causality - or why
you need to take a financial
risk
Example: predicting customer churn
“Can’t we just log churn risks without actually acting upon them,
and then follow up on how many people churned?”
Problem 1: data scientists already do these counterfactual tests as
part of modelling (testing accuracy on new data)
Always do a proper test to
establish causality - or why
you need to take a financial
risk
Example: predicting customer churn
“Can’t we just log churn risks without actually acting upon them,
and then follow up on how many people churned?”
Problem 2: the treatment action may itself influence future
behaviour
Always do a proper test to
establish causality - or why
you need to take a financial
risk
Example: predicting customer churn
“Can’t we just log churn risks without actually acting upon them,
and then follow up on how many people churned?”
Problem 3: if we did a “risk-free” run, and the model worked well,
we’d still need a real A/B test, effectively doubling time spent
testing
Always do a proper test to
establish causality - or why
you need to take a financial
risk
Moral of the story: don’t shy away from real experimentation.
Mitigate risks during modelling and/or by varying treatment group
sizes (Bayesian methods handle the latter naturally)
Framing learning problems isn’t as easy as it seems,
and it’s mostly because of lousy metrics
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Let’s say we are tasked with building a recommender system for a
news site.
Do we build a model that:
● Predicts clicks/non-clicks?
● Predicts read time?
● Predicts conversion rates?
● Predicts explicit ratings?
● Predicts implicit ratings?
● Predicts something else?
Side note: all of the above have be used for recommendations in the
past.
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Let’s say we are tasked with building a recommender system for a
news site.
Do we use a:
● Regression algorithm?
● Binary classification algorithm?
● A pairwise classification algorithm?
● A ranking algorithm?
● A multiclass classification algorithm?
● A multilabel classification algorithm?
● A matrix factorization algorithm?
● A non-parametric similarity algorithm?
● A reinforcement learning algorithm?
● ...
● A hybrid approach?
Side note: all of the above can be used for recommendations.
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Thumb rule: first choose a good metric, then experiment with
different learning algorithms.
Problem: most metrics used in business range from bad to
terrible.
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
On The Theory of Scales of Measurement (S.S. Stevens),
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Avg. rating, “The website has a friendly user interface”: 3.5/5
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Avg. rating, “The website has a friendly user interface”: 3.5/5
Not strictly allowed, but yet we do this all
the time. Why?
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Good metrics are:
+ Measurable
+ Objective and unhackable*
+ Derived from strategy
+ Describe what you want and need to know
+ Are usable in every-day work
+ Understanded and accessible by everyone
+ Validated regularly
Bad metrics are:
- Unmeasurable
- Subjective and/or hackable
- Derived from coffee table conversation
- Chosen because they were easily available
- Too big to have an impact on or too narrow to describe different
cases
- Unknown to other stakeholders and in worst case even to you
- Not trusted or fully understood
Credit: Jan Hiekkaranta
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Good metrics are:
+ Measurable
+ Objective and unhackable*
+ Derived from strategy
+ Describe what you want and need to know
+ Are usable in every-day work
+ Understanded and accessible by everyone
+ Validated regularly
Bad metrics are:
- Unmeasurable
- Subjective and/or hackable
- Derived from coffee table conversation
- Chosen because they were easily available
- Too big to have an impact on or too narrow to describe different
cases
- Unknown to other stakeholders and in worst case even to you
- Not trusted or fully understood
Credit: Jan Hiekkaranta
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Theoretically, the best way to apply ML in business is to optimise
directly against critical business KPIs, such as profit.
In practice, this is extremely difficult, because so many other things
can influence highest-level KPIs.
The solution? Derive a good proxy metric.
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics< closer to your problem farther from your problem >
read time
engagement
customer value
EBIT
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Q: How do you know your proxy metric is good?
A: Validate that it tracks well with higher-level metrics. This can
even be done statistically, e.g. using IEEE’s standards for software
measurement (IEEE Standard for a Software Quality Metrics
Methodology. Technical report, December 1998, ISBN
1-55937-529-9). The standards aren’t made for this purpose, but
work quite well!
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Statistical validation aside, thoughtful reasoning is still valuable.
Consider recommender systems that predict click-through-rates
(CTR):
● Does a click really mean I’m interested?
● Who would really care about CTRs if I can improve total
minutes spent with our system?
○ Conversely, who would care if CTRs were high but
read times lousy?
● What biases are at play here?
● ...
● Where does money change hands?
Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
To business developers: set out well-designed, validated KPIs &
proxy metrics and require that ML projects target those. Data
Scientists can help with metric designs.
Data Scientists should optimise a model against real
costs & returns, but often can’t
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
True positives
11,854
False positives
582
False negatives
134
True negatives
300,297
F1-Score: 0.9707, Recall: 0.9888, Precision: 0.9532
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
True positives
11,854
False positives
582
False negatives
134
True negatives
300,297
F1-Score: 0.9707, Recall: 0.9888, Precision: 0.9532
Use case: predicting fraud
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
True positives
11,854
False positives
23
False negatives
1333
True negatives
300,297
F1-Score: 0.9451, Recall: 0.8989, Precision: 0.9981
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
True positives
11,854
False positives
23
False negatives
1333
True negatives
300,297
F1-Score: 0.9451, Recall: 0.8989, Precision: 0.9981
Use case: detecting malignant tumours
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
All classification problems are cost-sensitive
classification problems.
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
All classification problems are cost-sensitive
classification problems.
Expected cost in €
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
Strategies for cost-sensitive classification:
● Upsampling
● Downsampling
● Rejection sampling
● Importance weighting
● Using a native cost-sensitive classification algorithm
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
Data Scientist: “on validation data, the accuracy is 98% with
an F1-score of 94%. This is a 19% improvement over our
baseline”
Data Scientist: “we estimate 3,4 euros more per month per
user if we put this model into production”
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
Data Scientist: “on validation data, the accuracy is 98% with
an F1-score of 94%. This is a 19% improvement over our
baseline”
Data Scientist: “we estimate 3,4 euros more per month per
user if we put this model into production”
Data Scientists should
optimise a model against
real costs & returns, but
often can’t
Predicting customer churn
Data Scientist: “What’s the expected cost to the company if we fail
to keep a customer from leaving?”
PO: “Well, the expected lifetime value of a customer is around 350
euros”
Business Manager A: “100 euros”
Software Engineer B: “1210 euros”
Accountant C: “420 euros”
Another Data Scientist: “It depends”
Existing business processes can severely restrict the
potential of machine learning
Existing business processes
can severely restrict the
potential of machine
learning
“We already have a logic-based system for flagging critical alarms,
but some still slip through. We’d like to replace the entire system
with ML”
Data Scientist: “OK, where’s the control group data?”
Existing business processes
can severely restrict the
potential of machine
learning
“We want to forecast the number of customer service chats each
day, for resource allocation purposes. We’ve got data on all the calls
our reps take”
Data Scientist: “Are the incoming chat attempts recorded
somewhere? Is the customer service number closed during
evenings/weekends?”
Existing opinions can severely restrict the potential of
machine learning
Existing opinions can
severely restrict the
potential of machine
learning
“The prices the algorithm suggest is sometimes to low, so we
disregard those”
“On Fridays, we don’t use the recommendation engine because our
content creators want to promote something else”
“We can’t release this to production; the recommendations I got
were pretty bad”
All of the above are strawman examples. Edge cases that are truly
suboptimal should be addressed on the algorithm level, not by
slapping opinions on top.
If you aren’t ready to let machine learning do its thing, don’t
use it. The less you override it, the better it works.
Existing opinions can
severely restrict the
potential of machine
learning
When the machine learning part of a machine learning
project fails, it’s because of bad features/feature
engineering
When the machine learning
part of a machine learning
project fails, it’s because of
bad features/feature
engineering
Garbage In, Garbage Out.
When the machine learning
part of a machine learning
project fails, it’s because of
bad features/feature
engineering
When a model fails to predict something, it’s because the
information used to train it lacks predictive power.
This, in turn, is because either the information used is wrong, or
not engineered into useful features.
There are no exceptions to this rule. Applied machine learning
is basically an exercise in feature engineering (note: feature
engineering is hard).
When the machine learning
part of a machine learning
project fails, it’s because of
bad features/feature
engineering
Good feature engineering + a naïve learning algorithm trumps bad
engineering + a sophisticated learning algorithm 99% of the time.
When the machine learning
part of a machine learning
project fails, it’s because of
bad features/feature
engineering
At the end of the day, some machine learning projects succeed and
some fail. What makes the difference? Easily the most important
factor is the features used. If you have many independent features
that each correlate well with the class, learning is easy. On the other
hand, if the class is a very complex function of the features, you
may not be able to learn it. Often, the raw data is not in a form that
is amenable to learning, but you can construct features from it that
are. This is typically where most of the effort in a machine learning
project goes. It is often also one of the most interesting parts, where
intuition, creativity and “black art” are as important as the
technical stuff.
A Few Useful Things to Know about Machine Learning (Pedro Domingos),
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
When the machine learning
part of a machine learning
project fails, it’s because of
bad features/feature
engineering
Machine learning can’t really be called “intelligent”
unless you allow for exploration
Machine learning can’t
really be called “intelligent”
unless you allow for
exploration
Machine learning can’t
really be called “intelligent”
unless you allow for
exploration
Machine learning can’t
really be called “intelligent”
unless you allow for
exploration
Direct Feedback Loops. A model may directly influence the
selection of its own future training data. It is common practice to
use standard supervised algorithms, although the theoretically
correct solution would be to use bandit algorithms. The problem
here is that bandit algorithms (such as contextual bandits [9]) do
not necessarily scale well to the size of action spaces typically
required for real-world problems. It is possible to mitigate these
effects by using some amount of randomization [3], or by isolating
certain parts of data from being influenced by a given model.
Hidden Technical Debt in Machine Learning (D. Sculley, Gary Holt, Daniel Golovin et al),
http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Machine learning can’t
really be called “intelligent”
unless you allow for
exploration
Learn
Log
Deploy
Almost all production machine learning
systems
Machine learning can’t
really be called “intelligent”
unless you allow for
exploration
A fundamentally correct machine learning
system
Learn
Log
Explore
Deploy
Having company-wide control groups is a
non-negotiable part of data-driven decision making &
modelling
Having company-wide
control groups is a
non-negotiable part of
data-driven decision
making & modelling
Some things I’ve seen happen:
- Random uniform choices working better than
human opinions (including my own)
- Machine learning models tested only against other
machine learning models
- “Controlled” experiments run without control groups
- A/B tests failing due to other treatments happening at the
same time
Programming languages for Data Science aren’t all that
great
Programming languages for
Data Science aren’t all that
great
R: made for data science, with other stuff added later
Python: built for general purpose computing, with data science
stuff added later
Both: too slow in many cases
Others: not always viable because of meagre ecosystems
The tool & service ecosystem for machine learning is
fragmented, non-standardised, and fragile
The tool & service
ecosystem for machine
learning is fragmented,
non-standardised, and
fragile
The tool & service
ecosystem for machine
learning is fragmented,
non-standardised, and
fragile
Current status of model exchange formats
Sometimes, Data Scientists make good models using
learning algorithms they don’t fully understand
Sometimes, Data Scientists
make good models using
learning algorithms they
don’t fully understand
Me: “neural networks learn through backpropagation, which
adjusts weights based on the chain rule and the partial derivative of
the loss function with respect to the weights in each layer.
Initialisation must however be symmetry-breaking...”
Me: “gradient boosted trees learn using a set of weak learners”
Me “Random Forests are made up of trees”
Me: “what’s an SVM?”
Sunk costs are almost always taken into account when
productionising machine learning projects, but they
shouldn’t be
Sunk costs are almost
always taken into account
when productionising
machine learning projects,
but they shouldn’t be
“The license for this platform cost us 1.2 M€, so it should be our
primary platform going forward.”
Sunk costs are almost
always taken into account
when productionising
machine learning projects,
but they shouldn’t be
Internal thinking: “developing this model & A/B test took 4
months, so we’re definitely taking it to production”
Sunk costs are almost
always taken into account
when productionising
machine learning projects,
but they shouldn’t be
In 1968 Knox and Inkster,[2] in what is perhaps the classic sunk
cost experiment, approached 141 horse bettors: 72 of the people
had just finished placing a $2.00 bet within the past 30 seconds, and
69 people were about to place a $2.00 bet in the next 30 seconds.
Their hypothesis was that people who had just committed
themselves to a course of action (betting $2.00) would reduce
post-decision dissonance by believing more strongly than ever that
they had picked a winner. Knox and Inkster asked the bettors to
rate their horse's chances of winning on a 7-point scale. What they
found was that people who were about to place a bet rated the
chance that their horse would win at an average of 3.48 which
corresponded to a "fair chance of winning" whereas people who
had just finished betting gave an average rating of 4.81 which
corresponded to a "good chance of winning".
Sunk costs are almost
always taken into account
when productionising
machine learning projects,
but they shouldn’t be
Sunk costs are almost
always taken into account
when productionising
machine learning projects,
but they shouldn’t be
Though from a different domain, adapting the Markov property
is a good rule of thumb.
“The future should be independent of the past given the present”
?
Fortune still favours the brave.
Why can’t we use machine
learning to optimise an
airport?
Optimal aircraft parking & people transportation using linear
programming. Live at Kittilä Airport.
Why can’t we use machine
learning to generate a single
malt whisky?
Machine-generated single malt whisky recipes, curated by
Mackmyra’s Master Blender. A mix of old & new learning
algorithms, including generator/discriminators. Full reveal at
The Next Web 2019.
Machine learning isn’t easy. But it’s worth it.
Thank you! Questions?
A special thanks to Jarno Kartela & Jan Hiekkaranta for their
contributions.
max.pagels@fourkind.com
www.fourkind.com
@fourkindnow

(In)convenient truths about applied machine learning

  • 1.
    Applied Machine Learning MaxPagels, Machine Learning Partner
  • 2.
    Job: Fourkind Education: BSc,MSc comp. sci, University of Helsinki Background: CS researcher, full-stack dev, front-end dev, data scientist Interests: Immediate-reward RL, ML reductions, incremental/online learning, generative design Some industries: maritime, insurance, ecommerce, gaming, telecommunications, transportation, media, education, logistics
  • 3.
    Preface Machine learning issometimes at odds with business in general
  • 4.
    Machine learning is sometimesat odds with business in general Machine learning is perhaps the best example of applying the scientific method: “It involves formulating hypotheses, via induction, based on such observations; experimental and measurement-based testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings” All machine learning projects are, in effect, a series of experiments, where outcomes are uncertain.
  • 5.
    Machine learning is sometimesat odds with business in general A key tenant of the scientific method is that “failed” experiments don’t equal failure. Failed experiments add to the body of knowledge, and allow us to do better in the future.
  • 6.
    Machine learning is sometimesat odds with business in general Alexander Fleming, year unknown
  • 7.
    Machine learning is sometimesat odds with business in general Unfortunately, business rarely looks at failed projects in the same way scientists do. This can be hard to reconcile.
  • 8.
    Machine learning is sometimesat odds with business in general Project A: “let’s build a new webshop for our product” Project B: “We lose 2 million each year because of wasted inventory. Let’s solve that using ML”
  • 9.
    How do wereconcile the scientific method with the business world? There’s no silver bullet. But by studying the experiences of others, and bringing ML closer to what businesses care about, we can avoid some mistakes.
  • 10.
    What follows aresome observations. Some seem very obvious, some not, but all still pose a challenge in practice. Disclaimer: all of the following examples are based on personal experience, personal failures, or personal opinion. Please consume with a healthy grain of salt.
  • 11.
    In many cases,you don’t need machine learning in order to solve a problem
  • 12.
    In many cases,you don’t need machine learning in order to solve a problem Data Scientist: “we built a model for predicting the channel customers contact us in” PO: “awesome, let’s take this to production!” Data Scientist: “great, I’ll work with our engineers to make it happen” (development continues) Engineer: “why don’t we just collect the correct channel information when someone calls or emails us?” PO: “...” Data Scientist: “...”
  • 13.
    In many cases,you don’t need machine learning in order to solve a problem “Rule #1: Don’t be afraid to launch a product without machine learning. Machine learning is cool, but it requires data. Theoretically, you can take data from a different problem and then tweak the model for a new product, but this will likely underperform basic heuristics. If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.” Rules of Machine Learning: Best Practices for ML Engineering (Martin Zinkevich et al), http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
  • 14.
    Sometimes, the datayou already have is useless
  • 15.
    Sometimes, the datayou already have is useless Client: “we want to be able to predict who is most likely to be our customer in the future” Data Scientist: “OK, for whom would you like to be able to predict that?” Client: “for all people that aren’t already our customers”
  • 16.
    Sometimes, the datayou already have is useless Client: “we want to be able to predict who is most likely to be our customer in the future” Data Scientist: “OK, for whom would you like to be able to predict that?” Client: “for all people that aren’t already our customers”
  • 17.
    Not understanding technicalconstraints can make a machine learning project fail
  • 18.
    Not understanding technical constraintscan make a machine learning project fail Business: “let’s use machine learning to automatically assign tickets to the proper technician” Data Scientist: “sounds plausible, I’ll get to work” (development continues) Data Scientist: “here’s the best model I could make. In simulation, it’s only wrong 0.1% of the time” Business: “that’s unacceptable – it can’t assign work to the wrong technician” Data Scientist: “but it’s function approximation...by definition, it can’t–” Business: “no exceptions” Data Scientist: “...”
  • 19.
    Not understanding technical constraintscan make a machine learning project fail Data Scientist: “I’ve made a non-parametric model for a recommendation engine and now we need to deploy it to production” Engineer: “OK, where’s the data you need at prediction time?” Data Scientist: “Oh, some of it is in two data warehouses and the rest is in S3” Engineer: “We have to make that data accessible in an operational DB. How much data are we talking about?” Data Scientist: “Around 2 billion rows” Engineer: “...” Data Scientist: “Oh, and since the model is non-parametric and in-memory, it needs 50GB of RAM to run and doesn’t scale horizontally”
  • 20.
    Not understanding technical constraintscan make a machine learning project fail
  • 21.
    Not understanding technical constraintscan make a machine learning project fail May encourage overly complex solutions
  • 22.
    Not understanding technical constraintscan make a machine learning project fail
  • 23.
    Not understanding technical constraintscan make a machine learning project fail
  • 24.
    In machine learning,domain expertise means less than you might think
  • 25.
    In machine learning, domainexpertise means less than you might think Predicting customer churn in an eCommerce business Data Scientist: “OK, i’ll start with these features, gridsearch a good XGBClassifier and iterate from there” Predicting if heavy machinery is likely to break down within the next day Data Scientist: “OK, i’ll start with these features, gridsearch a good XGBClassifier and iterate from there”
  • 26.
    In machine learning, domainexpertise means less than you might think Just about any image recognition task, regardless of industry: Data Scientist: I’ll use a convnet
  • 27.
    In machine learning, domainexpertise means less than you might think During planning Business owner(s): “the model should take a,b,c,d,e,f & g into account when making a decision” Data Scientist: “OK” During modelling Data Scientist: “let me use a,b,c,d,e,f & g and make a baseline model” “hmm, these results aren’t great. I’ll add h,i,j & k” “hmm. a, b, c & d have no predictive power – I’ll drop those” Data Scientist: “all done!” Business owner(s): “this model takes into account the stuff we talked about, right?” Data Scientist: “sure.”
  • 28.
    In machine learning, domainexpertise means less than you might think Consult with stakeholders to understand the problem and get an idea of what types of data might be useful (include everyone’s ideas) Figure out what data is viable to get/use as features Through modelling, learn what type of data is actually useful Business heuristics? (Possibly) add them as features, never as hard-coded logic. Let learning algorithms figure out if they are useful.
  • 29.
    Always make aproof-of-concept
  • 30.
  • 31.
    Always make a proof-of-concept HiddenTechnical Debt in Machine Learning (D. Sculley, Gary Holt, Daniel Golovin et al), http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Machine Learning: The High-Interest Credit Card of Technical Debt (D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov et al), https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf
  • 32.
    Always make a proof-of-concept Machinelearning projects can, and will, fail from time to time. To start, make the simplest model possible, and test its effectiveness using the simplest possible process. Adding surrounding infrastructure without validating the approach first is asking for trouble.
  • 33.
    Always do aproper test to establish causality - or why you need to take a financial risk
  • 34.
    Always do aproper test to establish causality - or why you need to take a financial risk The gold standard for establishing causality is a randomised controlled experiment (A/B-test), though other useful causal inference methods also exist for situations where A/B-testing isn’t possible.
  • 35.
    Always do aproper test to establish causality - or why you need to take a financial risk During a controlled experiment, you are invariably taking a financial risk to determine the effectiveness of a machine learning model. Sometimes, it is surprisingly difficult to convince everyone that you have to take a risk.
  • 36.
    Always do aproper test to establish causality - or why you need to take a financial risk Example: predicting customer churn “Can’t we just log churn risks without actually acting upon them, and then follow up on how many people churned?”
  • 37.
    Always do aproper test to establish causality - or why you need to take a financial risk Example: predicting customer churn “Can’t we just log churn risks without actually acting upon them, and then follow up on how many people churned?” Problem 1: data scientists already do these counterfactual tests as part of modelling (testing accuracy on new data)
  • 38.
    Always do aproper test to establish causality - or why you need to take a financial risk Example: predicting customer churn “Can’t we just log churn risks without actually acting upon them, and then follow up on how many people churned?” Problem 2: the treatment action may itself influence future behaviour
  • 39.
    Always do aproper test to establish causality - or why you need to take a financial risk Example: predicting customer churn “Can’t we just log churn risks without actually acting upon them, and then follow up on how many people churned?” Problem 3: if we did a “risk-free” run, and the model worked well, we’d still need a real A/B test, effectively doubling time spent testing
  • 40.
    Always do aproper test to establish causality - or why you need to take a financial risk Moral of the story: don’t shy away from real experimentation. Mitigate risks during modelling and/or by varying treatment group sizes (Bayesian methods handle the latter naturally)
  • 41.
    Framing learning problemsisn’t as easy as it seems, and it’s mostly because of lousy metrics
  • 42.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Let’s say we are tasked with building a recommender system for a news site. Do we build a model that: ● Predicts clicks/non-clicks? ● Predicts read time? ● Predicts conversion rates? ● Predicts explicit ratings? ● Predicts implicit ratings? ● Predicts something else? Side note: all of the above have be used for recommendations in the past.
  • 43.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Let’s say we are tasked with building a recommender system for a news site. Do we use a: ● Regression algorithm? ● Binary classification algorithm? ● A pairwise classification algorithm? ● A ranking algorithm? ● A multiclass classification algorithm? ● A multilabel classification algorithm? ● A matrix factorization algorithm? ● A non-parametric similarity algorithm? ● A reinforcement learning algorithm? ● ... ● A hybrid approach? Side note: all of the above can be used for recommendations.
  • 44.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Thumb rule: first choose a good metric, then experiment with different learning algorithms. Problem: most metrics used in business range from bad to terrible.
  • 45.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics On The Theory of Scales of Measurement (S.S. Stevens), https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
  • 46.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics
  • 47.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Avg. rating, “The website has a friendly user interface”: 3.5/5
  • 48.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Avg. rating, “The website has a friendly user interface”: 3.5/5 Not strictly allowed, but yet we do this all the time. Why?
  • 49.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Good metrics are: + Measurable + Objective and unhackable* + Derived from strategy + Describe what you want and need to know + Are usable in every-day work + Understanded and accessible by everyone + Validated regularly Bad metrics are: - Unmeasurable - Subjective and/or hackable - Derived from coffee table conversation - Chosen because they were easily available - Too big to have an impact on or too narrow to describe different cases - Unknown to other stakeholders and in worst case even to you - Not trusted or fully understood Credit: Jan Hiekkaranta
  • 50.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Good metrics are: + Measurable + Objective and unhackable* + Derived from strategy + Describe what you want and need to know + Are usable in every-day work + Understanded and accessible by everyone + Validated regularly Bad metrics are: - Unmeasurable - Subjective and/or hackable - Derived from coffee table conversation - Chosen because they were easily available - Too big to have an impact on or too narrow to describe different cases - Unknown to other stakeholders and in worst case even to you - Not trusted or fully understood Credit: Jan Hiekkaranta
  • 51.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Theoretically, the best way to apply ML in business is to optimise directly against critical business KPIs, such as profit. In practice, this is extremely difficult, because so many other things can influence highest-level KPIs. The solution? Derive a good proxy metric.
  • 52.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics< closer to your problem farther from your problem > read time engagement customer value EBIT
  • 53.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Q: How do you know your proxy metric is good? A: Validate that it tracks well with higher-level metrics. This can even be done statistically, e.g. using IEEE’s standards for software measurement (IEEE Standard for a Software Quality Metrics Methodology. Technical report, December 1998, ISBN 1-55937-529-9). The standards aren’t made for this purpose, but work quite well!
  • 54.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics Statistical validation aside, thoughtful reasoning is still valuable. Consider recommender systems that predict click-through-rates (CTR): ● Does a click really mean I’m interested? ● Who would really care about CTRs if I can improve total minutes spent with our system? ○ Conversely, who would care if CTRs were high but read times lousy? ● What biases are at play here? ● ... ● Where does money change hands?
  • 55.
    Framing learning problems isn’tas easy as it seems, and it’s mostly because of lousy metrics To business developers: set out well-designed, validated KPIs & proxy metrics and require that ML projects target those. Data Scientists can help with metric designs.
  • 56.
    Data Scientists shouldoptimise a model against real costs & returns, but often can’t
  • 57.
    Data Scientists should optimisea model against real costs & returns, but often can’t True positives 11,854 False positives 582 False negatives 134 True negatives 300,297 F1-Score: 0.9707, Recall: 0.9888, Precision: 0.9532
  • 58.
    Data Scientists should optimisea model against real costs & returns, but often can’t True positives 11,854 False positives 582 False negatives 134 True negatives 300,297 F1-Score: 0.9707, Recall: 0.9888, Precision: 0.9532 Use case: predicting fraud
  • 59.
    Data Scientists should optimisea model against real costs & returns, but often can’t True positives 11,854 False positives 23 False negatives 1333 True negatives 300,297 F1-Score: 0.9451, Recall: 0.8989, Precision: 0.9981
  • 60.
    Data Scientists should optimisea model against real costs & returns, but often can’t True positives 11,854 False positives 23 False negatives 1333 True negatives 300,297 F1-Score: 0.9451, Recall: 0.8989, Precision: 0.9981 Use case: detecting malignant tumours
  • 61.
    Data Scientists should optimisea model against real costs & returns, but often can’t All classification problems are cost-sensitive classification problems.
  • 62.
    Data Scientists should optimisea model against real costs & returns, but often can’t All classification problems are cost-sensitive classification problems. Expected cost in €
  • 63.
    Data Scientists should optimisea model against real costs & returns, but often can’t Strategies for cost-sensitive classification: ● Upsampling ● Downsampling ● Rejection sampling ● Importance weighting ● Using a native cost-sensitive classification algorithm
  • 64.
    Data Scientists should optimisea model against real costs & returns, but often can’t Data Scientist: “on validation data, the accuracy is 98% with an F1-score of 94%. This is a 19% improvement over our baseline” Data Scientist: “we estimate 3,4 euros more per month per user if we put this model into production”
  • 65.
    Data Scientists should optimisea model against real costs & returns, but often can’t Data Scientist: “on validation data, the accuracy is 98% with an F1-score of 94%. This is a 19% improvement over our baseline” Data Scientist: “we estimate 3,4 euros more per month per user if we put this model into production”
  • 66.
    Data Scientists should optimisea model against real costs & returns, but often can’t Predicting customer churn Data Scientist: “What’s the expected cost to the company if we fail to keep a customer from leaving?” PO: “Well, the expected lifetime value of a customer is around 350 euros” Business Manager A: “100 euros” Software Engineer B: “1210 euros” Accountant C: “420 euros” Another Data Scientist: “It depends”
  • 67.
    Existing business processescan severely restrict the potential of machine learning
  • 68.
    Existing business processes canseverely restrict the potential of machine learning “We already have a logic-based system for flagging critical alarms, but some still slip through. We’d like to replace the entire system with ML” Data Scientist: “OK, where’s the control group data?”
  • 69.
    Existing business processes canseverely restrict the potential of machine learning “We want to forecast the number of customer service chats each day, for resource allocation purposes. We’ve got data on all the calls our reps take” Data Scientist: “Are the incoming chat attempts recorded somewhere? Is the customer service number closed during evenings/weekends?”
  • 70.
    Existing opinions canseverely restrict the potential of machine learning
  • 71.
    Existing opinions can severelyrestrict the potential of machine learning “The prices the algorithm suggest is sometimes to low, so we disregard those” “On Fridays, we don’t use the recommendation engine because our content creators want to promote something else” “We can’t release this to production; the recommendations I got were pretty bad” All of the above are strawman examples. Edge cases that are truly suboptimal should be addressed on the algorithm level, not by slapping opinions on top. If you aren’t ready to let machine learning do its thing, don’t use it. The less you override it, the better it works.
  • 72.
    Existing opinions can severelyrestrict the potential of machine learning
  • 73.
    When the machinelearning part of a machine learning project fails, it’s because of bad features/feature engineering
  • 74.
    When the machinelearning part of a machine learning project fails, it’s because of bad features/feature engineering Garbage In, Garbage Out.
  • 75.
    When the machinelearning part of a machine learning project fails, it’s because of bad features/feature engineering When a model fails to predict something, it’s because the information used to train it lacks predictive power. This, in turn, is because either the information used is wrong, or not engineered into useful features. There are no exceptions to this rule. Applied machine learning is basically an exercise in feature engineering (note: feature engineering is hard).
  • 76.
    When the machinelearning part of a machine learning project fails, it’s because of bad features/feature engineering Good feature engineering + a naïve learning algorithm trumps bad engineering + a sophisticated learning algorithm 99% of the time.
  • 77.
    When the machinelearning part of a machine learning project fails, it’s because of bad features/feature engineering At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. If you have many independent features that each correlate well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it. Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. This is typically where most of the effort in a machine learning project goes. It is often also one of the most interesting parts, where intuition, creativity and “black art” are as important as the technical stuff. A Few Useful Things to Know about Machine Learning (Pedro Domingos), https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
  • 78.
    When the machinelearning part of a machine learning project fails, it’s because of bad features/feature engineering
  • 79.
    Machine learning can’treally be called “intelligent” unless you allow for exploration
  • 80.
    Machine learning can’t reallybe called “intelligent” unless you allow for exploration
  • 81.
    Machine learning can’t reallybe called “intelligent” unless you allow for exploration
  • 82.
    Machine learning can’t reallybe called “intelligent” unless you allow for exploration Direct Feedback Loops. A model may directly influence the selection of its own future training data. It is common practice to use standard supervised algorithms, although the theoretically correct solution would be to use bandit algorithms. The problem here is that bandit algorithms (such as contextual bandits [9]) do not necessarily scale well to the size of action spaces typically required for real-world problems. It is possible to mitigate these effects by using some amount of randomization [3], or by isolating certain parts of data from being influenced by a given model. Hidden Technical Debt in Machine Learning (D. Sculley, Gary Holt, Daniel Golovin et al), http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  • 83.
    Machine learning can’t reallybe called “intelligent” unless you allow for exploration Learn Log Deploy Almost all production machine learning systems
  • 84.
    Machine learning can’t reallybe called “intelligent” unless you allow for exploration A fundamentally correct machine learning system Learn Log Explore Deploy
  • 85.
    Having company-wide controlgroups is a non-negotiable part of data-driven decision making & modelling
  • 86.
    Having company-wide control groupsis a non-negotiable part of data-driven decision making & modelling Some things I’ve seen happen: - Random uniform choices working better than human opinions (including my own) - Machine learning models tested only against other machine learning models - “Controlled” experiments run without control groups - A/B tests failing due to other treatments happening at the same time
  • 87.
    Programming languages forData Science aren’t all that great
  • 88.
    Programming languages for DataScience aren’t all that great R: made for data science, with other stuff added later Python: built for general purpose computing, with data science stuff added later Both: too slow in many cases Others: not always viable because of meagre ecosystems
  • 89.
    The tool &service ecosystem for machine learning is fragmented, non-standardised, and fragile
  • 90.
    The tool &service ecosystem for machine learning is fragmented, non-standardised, and fragile
  • 91.
    The tool &service ecosystem for machine learning is fragmented, non-standardised, and fragile Current status of model exchange formats
  • 92.
    Sometimes, Data Scientistsmake good models using learning algorithms they don’t fully understand
  • 93.
    Sometimes, Data Scientists makegood models using learning algorithms they don’t fully understand Me: “neural networks learn through backpropagation, which adjusts weights based on the chain rule and the partial derivative of the loss function with respect to the weights in each layer. Initialisation must however be symmetry-breaking...” Me: “gradient boosted trees learn using a set of weak learners” Me “Random Forests are made up of trees” Me: “what’s an SVM?”
  • 94.
    Sunk costs arealmost always taken into account when productionising machine learning projects, but they shouldn’t be
  • 95.
    Sunk costs arealmost always taken into account when productionising machine learning projects, but they shouldn’t be “The license for this platform cost us 1.2 M€, so it should be our primary platform going forward.”
  • 96.
    Sunk costs arealmost always taken into account when productionising machine learning projects, but they shouldn’t be Internal thinking: “developing this model & A/B test took 4 months, so we’re definitely taking it to production”
  • 97.
    Sunk costs arealmost always taken into account when productionising machine learning projects, but they shouldn’t be In 1968 Knox and Inkster,[2] in what is perhaps the classic sunk cost experiment, approached 141 horse bettors: 72 of the people had just finished placing a $2.00 bet within the past 30 seconds, and 69 people were about to place a $2.00 bet in the next 30 seconds. Their hypothesis was that people who had just committed themselves to a course of action (betting $2.00) would reduce post-decision dissonance by believing more strongly than ever that they had picked a winner. Knox and Inkster asked the bettors to rate their horse's chances of winning on a 7-point scale. What they found was that people who were about to place a bet rated the chance that their horse would win at an average of 3.48 which corresponded to a "fair chance of winning" whereas people who had just finished betting gave an average rating of 4.81 which corresponded to a "good chance of winning".
  • 98.
    Sunk costs arealmost always taken into account when productionising machine learning projects, but they shouldn’t be
  • 99.
    Sunk costs arealmost always taken into account when productionising machine learning projects, but they shouldn’t be Though from a different domain, adapting the Markov property is a good rule of thumb. “The future should be independent of the past given the present”
  • 100.
  • 101.
    Why can’t weuse machine learning to optimise an airport? Optimal aircraft parking & people transportation using linear programming. Live at Kittilä Airport.
  • 102.
    Why can’t weuse machine learning to generate a single malt whisky? Machine-generated single malt whisky recipes, curated by Mackmyra’s Master Blender. A mix of old & new learning algorithms, including generator/discriminators. Full reveal at The Next Web 2019.
  • 103.
    Machine learning isn’teasy. But it’s worth it.
  • 104.
    Thank you! Questions? Aspecial thanks to Jarno Kartela & Jan Hiekkaranta for their contributions. max.pagels@fourkind.com www.fourkind.com @fourkindnow