Building ML Systems that Continuously Learn from Mistakes

Continuous Learning Systems:
Building ML systems that learn
from their mistakes
Anuj Gupta
(Intuit)
Saurabh Arora, Satyam Saxena, Navaneethan Santhanam
This work was done when the authors were at Freshworks

Agenda
1. Understanding the Problem Statement
● Background
● Metrics that matter
● Observations
1. Solution v1.0
2. Issues
1. Solution v2.0
a. Building Feedback loop
b. Global + local
1. Results
2. Conclusions and Way Forward

Background
● Customer Support on social is now must for all B2C brands.
● Ex: @AppleSupport, @AmazonHelp, @BofA_Help.
● Twitter, Facebook have launched dedicated features for this.
● Most CRM suites support Customer Service@social

Metrics that matter
● Owing to public nature of conversations, brands
care about 2 things:
a. Reply fast
b. Reply well
Both these contribute to how a brand is perceived.
● To measure (a), 2 key metrics are:
a. Average First Response Time (AFRT)
b. Average Response Time (ART)

● Many of our customers (CS team of brands) had pretty high AFRT/ART
● Ask: Reduce AFRT/ART
● Traffic on brand’s social channel is not just questions or requests. Its lot more than that!

Observations
✅
❌
❌
❌
Actionable
Noise/spam

Observations
● The average number of replies sent per agent per day was relatively low. (~12-15). Yet the
ART/FRT were pretty high.
● Of the total inbound traffic on support handles, only a fraction of tickets were being replied to.
typically ~ 5% - 40%.
● In between 2 messages that were responded to, lot of
messages that were not being responded to (~3-30)
Most of time going in finding finding
actionable conversations

Solution v1.0
• Noise filter for CS@social
• Model it as (binary) classification problem.
• Acquire good quality dataset.
• Engineer features – there are some very good indicators.
Actionable Noise/Spam
• Train-test-tune, ~75% accuracy. Deploy

Issues
*within couple of weeks of deployment
● Performance varied across brands.
● While for some brands the model worked very well, for some it did very badly.
● As time* went by even the models that performed well, started doing badly.

• Our data was changing
Behind the Scene
Non-stationary distributions
A stationary process is time-independent ~ the averages remain more or less the constant.

• World of CS@social is not just Black(noise) and White (actionable).
• It also has a spectrum of grey in between:
a. “Hi”, “Hello”, “Good mornings”
b. “Any new offers today”
c. “The recent ad you launched is very good. Keep it up”
d. Quizzes, engagement posts
• Some brands respond to such traffic, some do not.
• Noise and actionable were merely 2 extremes of this spectrum.
• Definition of noise and actionable was not consistent across various brands.
• Boundary (in the grey region) separating noise from actionable varies from brand to brand.
• A single common classifier for all is doomed to fail!
Behind the Scene

In Nutshell
• Based on last few slides, degradation in model performance shouldn’t come
as surprise
• One model fits all is not going to work.
• Non-stationary distributions is not just specific to twitter data. In general, it is
found in other domains as well:
o Monitoring & Anomaly detection (one-class classification) in adversarial setting
o Recommendations (where the user preferences are continuously changing; evolving labels)
o Stock market predictions (concept drift; evolving distributions).

• Build per brand model to have brand specific learning.
• Learn from mistakes: In our system, by looking at what messages are being
replied to and what not, we know (with a small delay), if the classification done
by the system is right or wrong.
• The model is not utilizing these signals to improve.
• If feedback is utilized well:
• With time adapt to brands definition of noise and actionable.
• Adapt to variations/changes in features
Towards Solution: Exploration

Incorporate feedback
• Frequently retrain your model on the updated data and deploy the same.
o Training, testing, fine-tuning – 45K models.
Compute heavy. Doesn’t scale at all .
o Loose all old learnings
• Keep learning from feedback: Model adapts to the new incoming data.

What worked for us
Global Model
Batch trained
Large Corpus
No short term updates
Local model
Fast learner
Short term updates
● 2 models - Global + Local
● Global model is common for all
brands
○ Trained on large dataset
● 1 Local model per brand

Local
• Goals
o Improves with feedback.
• Desired properties
o Fast learner (light compute)
▪ Incorporates most feedbacks successfully
(After model update, if the same data point is presented, it must correctly predict its class label.)
o Avoid catastrophic forgetting
(After model update, if the last N data point is presented, it should predict its class label with higher accuracy.)

Building feedback loop
ML model
<Tweet, Yp>
<Tweet, YT>
If YT ≠ Yp
Tweet

Works fine if the velocity of
feedback data is high (don’t
have to wait long to accumulate
a mini-batch of feedbacks).
Many applications don’t have
high velocity.
Very few data point - can skew
the model
mini-batches Instant feedback, tiny-
batches
Possible Approaches to incorporate feedback

Building feedback loop
• We model a feedback point <Tweet, YT> as a datapoint presented to local model
in online setting.
• Thus, a bunch of feedbacks = incoming data stream
• We used a Online learning.
• Online learning:
Data is modeled as stream.
Model makes a prediction (YP), when presented with data point (X).
Environment reveals the correct class label (YT)
If YP ≠ YT, update the model with <X, YT>

Online Algorithms
http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_comparison.html
Crammer’s PA-II

• Dataset – 150K tweets, time sequenced
• Feedback incorporation improves accuracy:
o Trained (offline batch mode) model on first 100K data points.
o On test set (last 50k data points) it gave 75% accuracy (offline batch mode)
o Then ran the model on test data (50k data points) in online fashion
Model made a total 9028 mistakes.
These mistakes were instantaneously fed into the local model as feedback.
This gives a accuracy ~82 % across the test set.
○ We gained ~7% accuracy by incorporating feedback.
Results of Local :

Improving accuracy
# of test points
We also tested the local by feeding it with wrong feedbacks.

Combining global and local
• Scores from both global and local, combined to get a single score and apply
threshold to arrive at a prediction.
• We got an accuracy of ~82%
Global
Local
combined
score

Pros:
• Improved running accuracy
• Personalization : The notion of spam varies from brand to brand. Some
brands treat ‘Hi’, ‘Hello’ as spam while others treat them as actionable. By
learning from feedback, the model adapts to the notions of the brand.
• Local is light-weight, fast thus easy to boot-strap, deploy and scale.
Cons:
● Local can overfit to feedback, thus become biased.
● Need to monitor biasness.
● Reset local as when it becomes biased.

Future Work
• Instead of a single global, have vertical specific global
• Try other online algorithms
• Handle drift
• Not incorporate every feedback? Update on most important ones.

References
1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006
2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010
3. “Learning with drift detection” – Gama et al., BSAI 2004
4. "Adaptive regularization of weight vectors." ” - Crammer et al., ANIPS 2009
5. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL

Thank You
Please feel free to reach out post this talk or on the interwebs.
@anujgupta82
Anuj Gupta

Building ML Systems that Continuously Learn from Mistakes

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Building ML Systems that Continuously Learn from Mistakes

Similar to Building ML Systems that Continuously Learn from Mistakes (20)

More from Anuj Gupta

More from Anuj Gupta (9)

Recently uploaded

Recently uploaded (20)

Building ML Systems that Continuously Learn from Mistakes