This document discusses building machine learning systems that can continuously learn from feedback to improve over time.
It describes how models trained on static datasets degrade over time when applied to non-stationary domains like customer support conversations. A two-model approach is proposed using a global model trained on large datasets and local models that learn from feedback to adapt to each brand.
Local models use online learning algorithms to incorporate feedback from predictions and continuously update. This approach improved accuracy by 7% on a customer support test dataset by adapting to each brand's definitions. Combining global and local model scores provided further gains. Future work focuses on improving adaptation and reducing bias from feedback.
Building ML Systems that Continuously Learn from Mistakes
1. Continuous Learning Systems:
Building ML systems that learn
from their mistakes
Anuj Gupta
(Intuit)
Saurabh Arora, Satyam Saxena, Navaneethan Santhanam
This work was done when the authors were at Freshworks
2. Agenda
1. Understanding the Problem Statement
● Background
● Metrics that matter
● Observations
1. Solution v1.0
2. Issues
1. Solution v2.0
a. Building Feedback loop
b. Global + local
1. Results
2. Conclusions and Way Forward
3. Background
● Customer Support on social is now must for all B2C brands.
● Ex: @AppleSupport, @AmazonHelp, @BofA_Help.
● Twitter, Facebook have launched dedicated features for this.
● Most CRM suites support Customer Service@social
4. Metrics that matter
● Owing to public nature of conversations, brands
care about 2 things:
a. Reply fast
b. Reply well
Both these contribute to how a brand is perceived.
● To measure (a), 2 key metrics are:
a. Average First Response Time (AFRT)
b. Average Response Time (ART)
5. ● Many of our customers (CS team of brands) had pretty high AFRT/ART
● Ask: Reduce AFRT/ART
● Traffic on brand’s social channel is not just questions or requests. Its lot more than that!
7. Observations
● The average number of replies sent per agent per day was relatively low. (~12-15). Yet the
ART/FRT were pretty high.
● Of the total inbound traffic on support handles, only a fraction of tickets were being replied to.
typically ~ 5% - 40%.
● In between 2 messages that were responded to, lot of
messages that were not being responded to (~3-30)
Most of time going in finding finding
actionable conversations
8. Solution v1.0
• Noise filter for CS@social
• Model it as (binary) classification problem.
• Acquire good quality dataset.
• Engineer features – there are some very good indicators.
Actionable Noise/Spam
• Train-test-tune, ~75% accuracy. Deploy
9. Issues
*within couple of weeks of deployment
● Performance varied across brands.
● While for some brands the model worked very well, for some it did very badly.
● As time* went by even the models that performed well, started doing badly.
10. • Our data was changing
Behind the Scene
Non-stationary distributions
A stationary process is time-independent ~ the averages remain more or less the constant.
11. • World of CS@social is not just Black(noise) and White (actionable).
• It also has a spectrum of grey in between:
a. “Hi”, “Hello”, “Good mornings”
b. “Any new offers today”
c. “The recent ad you launched is very good. Keep it up”
d. Quizzes, engagement posts
• Some brands respond to such traffic, some do not.
• Noise and actionable were merely 2 extremes of this spectrum.
• Definition of noise and actionable was not consistent across various brands.
• Boundary (in the grey region) separating noise from actionable varies from brand to brand.
• A single common classifier for all is doomed to fail!
Behind the Scene
12. In Nutshell
• Based on last few slides, degradation in model performance shouldn’t come
as surprise
• One model fits all is not going to work.
• Non-stationary distributions is not just specific to twitter data. In general, it is
found in other domains as well:
o Monitoring & Anomaly detection (one-class classification) in adversarial setting
o Recommendations (where the user preferences are continuously changing; evolving labels)
o Stock market predictions (concept drift; evolving distributions).
13. • Build per brand model to have brand specific learning.
• Learn from mistakes: In our system, by looking at what messages are being
replied to and what not, we know (with a small delay), if the classification done
by the system is right or wrong.
• The model is not utilizing these signals to improve.
• If feedback is utilized well:
• With time adapt to brands definition of noise and actionable.
• Adapt to variations/changes in features
Towards Solution: Exploration
14. Incorporate feedback
• Frequently retrain your model on the updated data and deploy the same.
o Training, testing, fine-tuning – 45K models.
Compute heavy. Doesn’t scale at all .
o Loose all old learnings
• Keep learning from feedback: Model adapts to the new incoming data.
15. What worked for us
Global Model
Batch trained
Large Corpus
No short term updates
Local model
Fast learner
Short term updates
● 2 models - Global + Local
● Global model is common for all
brands
○ Trained on large dataset
● 1 Local model per brand
16. Local
• Goals
o Improves with feedback.
• Desired properties
o Fast learner (light compute)
▪ Incorporates most feedbacks successfully
(After model update, if the same data point is presented, it must correctly predict its class label.)
o Avoid catastrophic forgetting
(After model update, if the last N data point is presented, it should predict its class label with higher accuracy.)
18. Works fine if the velocity of
feedback data is high (don’t
have to wait long to accumulate
a mini-batch of feedbacks).
Many applications don’t have
high velocity.
Very few data point - can skew
the model
mini-batches Instant feedback, tiny-
batches
Possible Approaches to incorporate feedback
19. Building feedback loop
• We model a feedback point <Tweet, YT> as a datapoint presented to local model
in online setting.
• Thus, a bunch of feedbacks = incoming data stream
• We used a Online learning.
• Online learning:
Data is modeled as stream.
Model makes a prediction (YP), when presented with data point (X).
Environment reveals the correct class label (YT)
If YP ≠ YT, update the model with <X, YT>
21. • Dataset – 150K tweets, time sequenced
• Feedback incorporation improves accuracy:
o Trained (offline batch mode) model on first 100K data points.
o On test set (last 50k data points) it gave 75% accuracy (offline batch mode)
o Then ran the model on test data (50k data points) in online fashion
Model made a total 9028 mistakes.
These mistakes were instantaneously fed into the local model as feedback.
This gives a accuracy ~82 % across the test set.
○ We gained ~7% accuracy by incorporating feedback.
Results of Local :
22. Improving accuracy
# of test points
We also tested the local by feeding it with wrong feedbacks.
23. Combining global and local
• Scores from both global and local, combined to get a single score and apply
threshold to arrive at a prediction.
• We got an accuracy of ~82%
Global
Local
combined
score
25. Pros:
• Improved running accuracy
• Personalization : The notion of spam varies from brand to brand. Some
brands treat ‘Hi’, ‘Hello’ as spam while others treat them as actionable. By
learning from feedback, the model adapts to the notions of the brand.
• Local is light-weight, fast thus easy to boot-strap, deploy and scale.
Cons:
● Local can overfit to feedback, thus become biased.
● Need to monitor biasness.
● Reset local as when it becomes biased.
26. Future Work
• Instead of a single global, have vertical specific global
• Try other online algorithms
• Handle drift
• Not incorporate every feedback? Update on most important ones.
27. References
1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006
2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010
3. “Learning with drift detection” – Gama et al., BSAI 2004
4. "Adaptive regularization of weight vectors." ” - Crammer et al., ANIPS 2009
5. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL
28. Thank You
Please feel free to reach out post this talk or on the interwebs.
@anujgupta82
Anuj Gupta