(In)convenient truths about applied machine learning

Applied Machine Learning
Max Pagels, Machine Learning Partner

Job: Fourkind
Education: BSc, MSc comp. sci, University of Helsinki
Background: CS researcher, full-stack dev, front-end dev,
data scientist
Interests: Immediate-reward RL, ML reductions,
incremental/online learning, generative design
Some industries: maritime, insurance, ecommerce, gaming,
telecommunications, transportation, media, education,
logistics

Preface
Machine learning is sometimes at odds with business in
general

Machine learning is
sometimes at odds with
business in general
Machine learning is perhaps the best example of applying
the scientific method:
“It involves formulating hypotheses, via induction, based on
such observations; experimental and measurement-based
testing of deductions drawn from the hypotheses; and
refinement (or elimination) of the hypotheses based on the
experimental findings”
All machine learning projects are, in effect, a series of
experiments, where outcomes are uncertain.

Machine learning is
business in general
A key tenant of the scientific method is that “failed”
experiments don’t equal failure.
Failed experiments add to the body of knowledge, and allow
us to do better in the future.

Machine learning is
business in general
Alexander Fleming, year unknown

Machine learning is
business in general
Unfortunately, business rarely looks at failed projects in the
same way scientists do. This can be hard to reconcile.

Machine learning is
business in general
Project A: “let’s build a new webshop for our product”
Project B: “We lose 2 million each year because of wasted
inventory. Let’s solve that using ML”

How do we reconcile the scientiﬁc method with
the business world?
There’s no silver bullet. But by studying the experiences of others, and bringing ML
closer to what businesses care about, we can avoid some mistakes.

What follows are some observations. Some seem very obvious,
some not, but all still pose a challenge in practice.
Disclaimer: all of the following examples are based on personal experience, personal
failures, or personal opinion. Please consume with a healthy grain of salt.

In many cases, you don’t need machine learning in
order to solve a problem

In many cases, you don’t
need machine learning in
Data Scientist: “we built a model for predicting the channel
customers contact us in”
PO: “awesome, let’s take this to production!”
Data Scientist: “great, I’ll work with our engineers to make it
happen”
(development continues)
Engineer: “why don’t we just collect the correct channel
information when someone calls or emails us?”
PO: “...”
Data Scientist: “...”

In many cases, you don’t
need machine learning in
“Rule #1: Don’t be afraid to launch a product without
machine learning.
Machine learning is cool, but it requires data. Theoretically,
you can take data from a different problem and then tweak
the model for a new product, but this will likely
underperform basic heuristics. If you think that machine
learning will give you a 100% boost, then a heuristic will get
you 50% of the way there.”
Rules of Machine Learning: Best Practices for ML Engineering (Martin Zinkevich et al),
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

Sometimes, the data you already have is useless

Sometimes, the data you
already have is useless
Client: “we want to be able to predict who is most likely to be our
customer in the future”
Data Scientist: “OK, for whom would you like to be able to predict
that?”
Client: “for all people that aren’t already our customers”

Not understanding technical constraints can make a
machine learning project fail

Not understanding technical
constraints can make a
machine learning project
fail
Business: “let’s use machine learning to automatically assign
tickets to the proper technician”
Data Scientist: “sounds plausible, I’ll get to work”
(development continues)
Data Scientist: “here’s the best model I could make. In
simulation, it’s only wrong 0.1% of the time”
Business: “that’s unacceptable – it can’t assign work to the
wrong technician”
Data Scientist: “but it’s function approximation...by
definition, it can’t–”
Business: “no exceptions”
Data Scientist: “...”

fail
Data Scientist: “I’ve made a non-parametric model for a
recommendation engine and now we need to deploy it to
production”
Engineer: “OK, where’s the data you need at prediction
time?”
Data Scientist: “Oh, some of it is in two data warehouses and
the rest is in S3”
Engineer: “We have to make that data accessible in an
operational DB. How much data are we talking about?”
Data Scientist: “Around 2 billion rows”
Engineer: “...”
Data Scientist: “Oh, and since the model is non-parametric
and in-memory, it needs 50GB of RAM to run and doesn’t
scale horizontally”

fail

fail
May encourage overly complex solutions

In machine learning, domain expertise means less than
you might think

In machine learning,
domain expertise means
less than you might think
Predicting customer churn in an eCommerce business
Data Scientist: “OK, i’ll start with these features, gridsearch a good
XGBClassifier and iterate from there”
Predicting if heavy machinery is likely to break down within
the next day
Data Scientist: “OK, i’ll start with these features, gridsearch a good
XGBClassifier and iterate from there”

Just about any image recognition task, regardless of industry:
Data Scientist: I’ll use a convnet

During planning
Business owner(s): “the model should take a,b,c,d,e,f & g into
account when making a decision”
Data Scientist: “OK”
During modelling
Data Scientist:
“let me use a,b,c,d,e,f & g and make a baseline model”
“hmm, these results aren’t great. I’ll add h,i,j & k”
“hmm. a, b, c & d have no predictive power – I’ll drop those”
Data Scientist: “all done!”
Business owner(s): “this model takes into account the stuff we
talked about, right?”
Data Scientist: “sure.”

Consult with stakeholders to understand the
problem and get an idea of what types of
data might be useful (include everyone’s
ideas)
Figure out what data is viable to get/use as
features
Through modelling, learn what type of data
is actually useful
Business heuristics? (Possibly) add them as features, never as
hard-coded logic. Let learning algorithms figure out if they are useful.

Always make a proof-of-concept

Always make a
proof-of-concept

Always make a
proof-of-concept
Hidden Technical Debt in Machine Learning (D. Sculley, Gary Holt, Daniel Golovin et al),
http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Machine Learning: The High-Interest Credit Card of Technical Debt (D. Sculley, Gary Holt, Daniel
Golovin, Eugene Davydov et al),
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf

Always make a
proof-of-concept
Machine learning projects can, and will, fail from time to time. To
start, make the simplest model possible, and test its effectiveness
using the simplest possible process. Adding surrounding
infrastructure without validating the approach first is asking for
trouble.

Always do a proper test to establish causality - or why
you need to take a ﬁnancial risk

Always do a proper test to
establish causality - or why
you need to take a ﬁnancial
risk
The gold standard for establishing causality is a randomised
controlled experiment (A/B-test), though other useful causal
inference methods also exist for situations where A/B-testing isn’t
possible.

risk
During a controlled experiment, you are invariably taking a
financial risk to determine the effectiveness of a machine learning
model.
Sometimes, it is surprisingly difficult to convince everyone that you
have to take a risk.

risk
Example: predicting customer churn
“Can’t we just log churn risks without actually acting upon them,
and then follow up on how many people churned?”

risk
Problem 1: data scientists already do these counterfactual tests as
part of modelling (testing accuracy on new data)

risk
Problem 2: the treatment action may itself influence future
behaviour

risk
Problem 3: if we did a “risk-free” run, and the model worked well,
we’d still need a real A/B test, effectively doubling time spent
testing

risk
Moral of the story: don’t shy away from real experimentation.
Mitigate risks during modelling and/or by varying treatment group
sizes (Bayesian methods handle the latter naturally)

Framing learning problems isn’t as easy as it seems,
and it’s mostly because of lousy metrics

Framing learning problems
isn’t as easy as it seems, and
it’s mostly because of lousy
metrics
Let’s say we are tasked with building a recommender system for a
news site.
Do we build a model that:
● Predicts clicks/non-clicks?
● Predicts read time?
● Predicts conversion rates?
● Predicts explicit ratings?
● Predicts implicit ratings?
● Predicts something else?
Side note: all of the above have be used for recommendations in the
past.

metrics
Let’s say we are tasked with building a recommender system for a
news site.
Do we use a:
● Regression algorithm?
● Binary classification algorithm?
● A pairwise classification algorithm?
● A ranking algorithm?
● A multiclass classification algorithm?
● A multilabel classification algorithm?
● A matrix factorization algorithm?
● A non-parametric similarity algorithm?
● A reinforcement learning algorithm?
● ...
● A hybrid approach?
Side note: all of the above can be used for recommendations.

metrics
Thumb rule: first choose a good metric, then experiment with
different learning algorithms.
Problem: most metrics used in business range from bad to
terrible.

metrics
On The Theory of Scales of Measurement (S.S. Stevens),
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

metrics

metrics
Avg. rating, “The website has a friendly user interface”: 3.5/5

metrics
Avg. rating, “The website has a friendly user interface”: 3.5/5
Not strictly allowed, but yet we do this all
the time. Why?

metrics
Good metrics are:
+ Measurable
+ Objective and unhackable*
+ Derived from strategy
+ Describe what you want and need to know
+ Are usable in every-day work
+ Understanded and accessible by everyone
+ Validated regularly
Bad metrics are:
- Unmeasurable
- Subjective and/or hackable
- Derived from coffee table conversation
- Chosen because they were easily available
- Too big to have an impact on or too narrow to describe different
cases
- Unknown to other stakeholders and in worst case even to you
- Not trusted or fully understood
Credit: Jan Hiekkaranta

metrics
Theoretically, the best way to apply ML in business is to optimise
directly against critical business KPIs, such as profit.
In practice, this is extremely difficult, because so many other things
can influence highest-level KPIs.
The solution? Derive a good proxy metric.

metrics< closer to your problem farther from your problem >
read time
engagement
customer value
EBIT

metrics
Q: How do you know your proxy metric is good?
A: Validate that it tracks well with higher-level metrics. This can
even be done statistically, e.g. using IEEE’s standards for software
measurement (IEEE Standard for a Software Quality Metrics
Methodology. Technical report, December 1998, ISBN
1-55937-529-9). The standards aren’t made for this purpose, but
work quite well!

metrics
Statistical validation aside, thoughtful reasoning is still valuable.
Consider recommender systems that predict click-through-rates
(CTR):
● Does a click really mean I’m interested?
● Who would really care about CTRs if I can improve total
minutes spent with our system?
○ Conversely, who would care if CTRs were high but
read times lousy?
● What biases are at play here?
● ...
● Where does money change hands?

metrics
To business developers: set out well-designed, validated KPIs &
proxy metrics and require that ML projects target those. Data
Scientists can help with metric designs.

Data Scientists should optimise a model against real
costs & returns, but often can’t

Data Scientists should
optimise a model against
real costs & returns, but
often can’t
True positives
11,854
False positives
582
False negatives
134
True negatives
300,297
F1-Score: 0.9707, Recall: 0.9888, Precision: 0.9532

often can’t
True positives
11,854
False positives
582
False negatives
134
True negatives
300,297
Use case: predicting fraud

often can’t
True positives
11,854
False positives
23
False negatives
1333
True negatives
300,297

often can’t
True positives
11,854
False positives
23
False negatives
1333
True negatives
300,297
Use case: detecting malignant tumours

often can’t
All classification problems are cost-sensitive
classification problems.

often can’t
All classification problems are cost-sensitive
classification problems.
Expected cost in €

often can’t
Strategies for cost-sensitive classification:
● Upsampling
● Downsampling
● Rejection sampling
● Importance weighting
● Using a native cost-sensitive classification algorithm

often can’t
Data Scientist: “on validation data, the accuracy is 98% with
an F1-score of 94%. This is a 19% improvement over our
baseline”
Data Scientist: “we estimate 3,4 euros more per month per
user if we put this model into production”

often can’t
Predicting customer churn
Data Scientist: “What’s the expected cost to the company if we fail
to keep a customer from leaving?”
PO: “Well, the expected lifetime value of a customer is around 350
euros”
Business Manager A: “100 euros”
Software Engineer B: “1210 euros”
Accountant C: “420 euros”
Another Data Scientist: “It depends”

Existing business processes can severely restrict the
potential of machine learning

Existing business processes
can severely restrict the
potential of machine
learning
“We already have a logic-based system for flagging critical alarms,
but some still slip through. We’d like to replace the entire system
with ML”
Data Scientist: “OK, where’s the control group data?”

Existing business processes
can severely restrict the
learning
“We want to forecast the number of customer service chats each
day, for resource allocation purposes. We’ve got data on all the calls
our reps take”
Data Scientist: “Are the incoming chat attempts recorded
somewhere? Is the customer service number closed during
evenings/weekends?”

Existing opinions can severely restrict the potential of
machine learning

Existing opinions can
severely restrict the
learning
“The prices the algorithm suggest is sometimes to low, so we
disregard those”
“On Fridays, we don’t use the recommendation engine because our
content creators want to promote something else”
“We can’t release this to production; the recommendations I got
were pretty bad”
All of the above are strawman examples. Edge cases that are truly
suboptimal should be addressed on the algorithm level, not by
slapping opinions on top.
If you aren’t ready to let machine learning do its thing, don’t
use it. The less you override it, the better it works.

Existing opinions can
severely restrict the
learning

When the machine learning part of a machine learning
project fails, it’s because of bad features/feature
engineering

When the machine learning
part of a machine learning
project fails, it’s because of
bad features/feature
engineering
Garbage In, Garbage Out.

engineering
When a model fails to predict something, it’s because the
information used to train it lacks predictive power.
This, in turn, is because either the information used is wrong, or
not engineered into useful features.
There are no exceptions to this rule. Applied machine learning
is basically an exercise in feature engineering (note: feature
engineering is hard).

engineering
Good feature engineering + a naïve learning algorithm trumps bad
engineering + a sophisticated learning algorithm 99% of the time.

engineering
At the end of the day, some machine learning projects succeed and
some fail. What makes the difference? Easily the most important
factor is the features used. If you have many independent features
that each correlate well with the class, learning is easy. On the other
hand, if the class is a very complex function of the features, you
may not be able to learn it. Often, the raw data is not in a form that
is amenable to learning, but you can construct features from it that
are. This is typically where most of the effort in a machine learning
project goes. It is often also one of the most interesting parts, where
intuition, creativity and “black art” are as important as the
technical stuff.
A Few Useful Things to Know about Machine Learning (Pedro Domingos),
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

engineering

Machine learning can’t really be called “intelligent”
unless you allow for exploration

Machine learning can’t
really be called “intelligent”
unless you allow for
exploration

exploration
Direct Feedback Loops. A model may directly influence the
selection of its own future training data. It is common practice to
use standard supervised algorithms, although the theoretically
correct solution would be to use bandit algorithms. The problem
here is that bandit algorithms (such as contextual bandits [9]) do
not necessarily scale well to the size of action spaces typically
required for real-world problems. It is possible to mitigate these
effects by using some amount of randomization [3], or by isolating
certain parts of data from being influenced by a given model.
Hidden Technical Debt in Machine Learning (D. Sculley, Gary Holt, Daniel Golovin et al),
http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

exploration
Learn
Log
Deploy
Almost all production machine learning
systems

exploration
A fundamentally correct machine learning
system
Learn
Log
Explore
Deploy

Having company-wide control groups is a
non-negotiable part of data-driven decision making &
modelling

Having company-wide
control groups is a
non-negotiable part of
data-driven decision
making & modelling
Some things I’ve seen happen:
- Random uniform choices working better than
human opinions (including my own)
- Machine learning models tested only against other
machine learning models
- “Controlled” experiments run without control groups
- A/B tests failing due to other treatments happening at the
same time

Programming languages for Data Science aren’t all that
great

Programming languages for
Data Science aren’t all that
great
R: made for data science, with other stuff added later
Python: built for general purpose computing, with data science
stuff added later
Both: too slow in many cases
Others: not always viable because of meagre ecosystems

The tool & service ecosystem for machine learning is
fragmented, non-standardised, and fragile

The tool & service
ecosystem for machine
learning is fragmented,
non-standardised, and
fragile

The tool & service
ecosystem for machine
learning is fragmented,
non-standardised, and
fragile
Current status of model exchange formats

Sometimes, Data Scientists make good models using
learning algorithms they don’t fully understand

Sometimes, Data Scientists
make good models using
learning algorithms they
don’t fully understand
Me: “neural networks learn through backpropagation, which
adjusts weights based on the chain rule and the partial derivative of
the loss function with respect to the weights in each layer.
Initialisation must however be symmetry-breaking...”
Me: “gradient boosted trees learn using a set of weak learners”
Me “Random Forests are made up of trees”
Me: “what’s an SVM?”

Sunk costs are almost always taken into account when
productionising machine learning projects, but they
shouldn’t be

Sunk costs are almost
always taken into account
when productionising
machine learning projects,
but they shouldn’t be
“The license for this platform cost us 1.2 M€, so it should be our
primary platform going forward.”

Internal thinking: “developing this model & A/B test took 4
months, so we’re definitely taking it to production”

In 1968 Knox and Inkster,[2] in what is perhaps the classic sunk
cost experiment, approached 141 horse bettors: 72 of the people
had just finished placing a $2.00 bet within the past 30 seconds, and
69 people were about to place a $2.00 bet in the next 30 seconds.
Their hypothesis was that people who had just committed
themselves to a course of action (betting $2.00) would reduce
post-decision dissonance by believing more strongly than ever that
they had picked a winner. Knox and Inkster asked the bettors to
rate their horse's chances of winning on a 7-point scale. What they
found was that people who were about to place a bet rated the
chance that their horse would win at an average of 3.48 which
corresponded to a "fair chance of winning" whereas people who
had just finished betting gave an average rating of 4.81 which
corresponded to a "good chance of winning".

Though from a different domain, adapting the Markov property
is a good rule of thumb.
“The future should be independent of the past given the present”

?
Fortune still favours the brave.

Why can’t we use machine
learning to optimise an
airport?
Optimal aircraft parking & people transportation using linear
programming. Live at Kittilä Airport.

Why can’t we use machine
learning to generate a single
malt whisky?
Machine-generated single malt whisky recipes, curated by
Mackmyra’s Master Blender. A mix of old & new learning
algorithms, including generator/discriminators. Full reveal at
The Next Web 2019.

Machine learning isn’t easy. But it’s worth it.

Thank you! Questions?
A special thanks to Jarno Kartela & Jan Hiekkaranta for their
contributions.
max.pagels@fourkind.com
www.fourkind.com
@fourkindnow

(In)convenient truths about applied machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to (In)convenient truths about applied machine learning

Similar to (In)convenient truths about applied machine learning (20)

Recently uploaded

Recently uploaded (20)

(In)convenient truths about applied machine learning