DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX

Technical debt in ML
by Jaroslaw Szymczak, Senior Data Scientist @ OLX Group

2
Agenda
● Introduction
● Basic concepts
● Components of technical debt in machine learning...
● … and how can we tackle them
● Q&A

3
Few words about OLX
350+ million
active users
40+
countries
60+ million new
listings per month
Market leader
in 35 countries

4
● M. Sc. in computer science with speciality in machine learning
● Senior Data Scientist @ OLX Group
● Focusing on content quality as well as trust and safety topics
● Responsible for full ML projects lifecycle - research,
productization, development and maintenance
● Having experience in delivering anti-fraud solutions for tier one
European insurance companies as well as retail and investment
banks
● Worked on external data science projects for churn prediction,
sales estimation and predictive maintenance for various
companies in Germany and Poland
And few words about me
jaroslaw.szymczak@olx.com

Basic concepts
What is technical debt? How this concept work with relation
to ML?

6
Technical debt
Technical debt is a concept in software development that reflects the implied cost of additional rework
caused by choosing an easy solution now instead of using a better approach that would take longer.
(by Wikipedia)
● Main components increasing technical debt:
○ lack of testing
○ inadequate system monitoring
○ code complexity
○ “dependency jungle”
○ lack of documentation (especially painful in connection with high staff turnover)
● Main reasons of technical debt:
○ time pressure
○ using prototypes as base for production system
○ lack of technical debt awareness

7
Technical debt in machine learning
Technical debt in machine learning is a usual technical debt plus more. ML has unique capability of
increasing the debt in extremely fast pace.
Image source: https://c.pxhere.com/photos/29/fb/money_euro_coin_coins_bank_note_calculator_budget_save-1021960.jpg!d (labelled for reuse)

Components of technical debt
in ML
Or at least some of them

9
(Hidden) feedback loops
Image source: https://upload.wikimedia.org/wikipedia/commons/thumb/5/50/General_closed_loop_feedback_system.svg/400px-General_closed_loop_feedback_system.svg.png
When system is retrained on the data it has provided… and what’s worse - when you
measure performance on such data only...

10
Breaking the loop
Sampling Model retraining Testing & monitoring
There’s an area for footnotes too!
● should be seen as
a necessity, not as
optional feature
● make sure to make
it unbiased
● or take bias into
account
● cannot be used
everywhere, but
A/B tests can
● very often majority of training data will
come from feedback loop
● try to account for this, e.g. with
weighting
● establish process for frequent
retraining
● think of it as part of
your MVP product
● as without it you’re
just guessing that
things work
● use real
distributions for
offline testing
● and ensure it is
aligned with what
you see live

11
Undeclared consumers
Image source:https://www.flickr.com/photos/texese/106442115 (labeled for reuse)
When your system is so great,
that everyone wants to use it,
not necessarily letting you know about it…
And then you improve it - are they happy?

12
OLX way of handling the data
Catalog
Data lake
Amazon S3 AWS GlueAmazon
EMR
Reservoir
Reservoir
Reservoir
Reservoir
Reservoir
Reservoir

13
Data dependencies
Image source:https://de.wikipedia.org/wiki/Datei:Data.jpg
ML model = algorithm + data
What happens when Google
changes ranking algorithm?
What will happen to our models
when the incoming data
changes?
Do we really need all these
features?

14
Robust feature encoding example
Before (on raw data):
● one hot encoded on id
● no encoded hierarchy
● extremely sensitive to any
changes
After (on enhanced data):
● encoding the hierarchy
● using names to have
meaningful features
● still data dependent
(as ML will always be)
● should survive our
challenges though
Challenges - what will happen:
● when we split large category into more sub-categories?
● when merge subcategories?
● when we do some other changes in hierarchy?

15
Feature consistency in online and offline setting
Model Feature
Aggregation
Information
extracted after
event occured
Database
record
Aggregation
Information
extracted at
event time
User / service
live request
with data
Offline feature extraction
Online feature extraction example
Goal:
Offline (for training)
and online (for live prediction)
feature extraction processes
end up with same feature value

16
Decision cascades
Image source: https://www.maxpixel.net/Cascade-By-The-Sautadet-Cascade-Gard-567383 (labeled for reuse)
Rules are everywhere...
And sometimes it really makes
sense to use them (or another
model) in combination with your
model
But then why this automatic
decision was made? Which part
of system is responsible for it?
Oh, we have very bad automatic
decisions affecting our clients -
how can we fix it?

17
Zoo of rules - how do we manage it?
Image source: https://c1.staticflickr.com/9/8044/8445978554_1d1716447b_b.jpg (marked for reuse)
● define a clear decision logic in single
component of your system
● make it very transparent, allow for partial
decisions and incomplete input decisions
(because you will need it)
● audit every partial decision on every version of
your input
● do not use thresholding inside the rules, make
your component responsible for that
● be careful with machine learning models that
can have a different output distribution after
model updates
● same for rules, be aware what concept they
represent
● allow running simulations of how system would
behave with various settings, including past and
potential future simulations

18
Thresholds in ML models
Photo information: a screenshot from toy example of model evaluation in Amazon Machine Learning
Key facts to remember about:
● every time we retrain the model scores differ
● by evaluating on proper sample we can calibrate

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX

More Related Content

What's hot

Similar to DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX

More from Dataconomy Media

Recently uploaded

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX