AUC is and has been an extremely powerful lens through which machine learning practitioners have been able to evaluate and compare model performance. Is the phrase “my curve is better than your curve” the right threshold for publishing a new paper or pushing a new model into production? In this talk, I will demonstrate the ways in which we at Remitly are thinking outside the box (and the area under the curve) to challenge whether or not AUC is the right metric for a range of applications. Price and cost are fundamental components of economic modeling, and are quintessential aspects of an economist’s education and economic way of thinking. These are foreign concepts for many machine learning practitioners. Remitly’s Data Science team manages and thinks deeply about a number of classification tasks such as risk management and fraud detection. For a number of these tasks, misclassification is extremely costly compared to the gains of a correct classification. We are willing to sacrifice AUC in order to incorporate costs of classification and misclassification into our loss functions. By incorporating the notion of “indifference curves” (i.e., level sets), we show that by choosing models whose ROC curves cross our indifference curve thresholds, we can aim for models that give us the best bang for our buck.
3. 3
Introduction
• Model selection: data and algorithms aren’t the only knobs
• Problems with typical model selection strategies
• Review of model evaluation metrics
• Augmenting these metrics to address practical problems
• Why this matters to Remitly
Agenda
4. You may think in order to solve all of your machine
learning problems, you only need to have…
5.
6.
7. ... but you need to think carefully about model selection.
8. 8
Why is model selection important?
• Big data is not enough:
• Not everyone has it. Or maybe the big data you have isn’t
useful.
• Fancy algorithms are not enough:
• No Free Lunch Theorem (Wolpert, 1997). There isn’t a ”one-
size-fits-all” model class. Deep learning not a silver bullet.
• Inadequate coverage in the literature:
• This is a practical problem, it’s hard, and it matters.
• Problems such as class imbalance and inclusion of economic
constraints.
Model Selection
9. 9
ML + Economics
• Loss matrices inadequate:
• Penalty of misclassification may vary per instance.
• E.g., size of transaction. Not all misclassifications result in
same penalty even if misclassified from same class.
• Indifference curves good for post-training selection:
• We can compare tradeoffs of selecting different
classification thresholds.
• EXTREMELY IMPORTANT when costs of false positives
and false negatives are very, very different.
Economics: including costs/revenue into model selection
10. 10
Classic machine learning
• Test positive and test negative (prediction outcomes)
• Condition positive and condition negative (actual values)
• True positive: condition positive and test positive
• True negative: condition negative and test negative
• False positive (Type I error): condition negative and test
positive
• False negative (Type II error): condition positive and test
negative
Confusion matrix
11. 11
Radar in WWII
• Classic approach measuring area under the receiver
operating characteristic (ROC)
• Pros:
• Standard in the literature
• Descriptive of predictive power across thresholds
• Cons:
• Ignores class imbalances
• Ignores constraints such as costs of FP vs. FN
My curve is better than your curve
12. 12
Metrics affected by class imbalance
• X axis is recall == tpr == TP / (TP + FN)
• I.e., of the total positive instances, what proportion did
our model classify as positive?
• Y axis is precision == TP / (TP + FP).
• I.e., of the positive classifications, what proportion were
positive instances?
• Class imbalance affects this: WLOG, class imbalance
shifts
curves down (for smaller positive classes).
• There exists a one-to-one mapping from ROC space to PR
space. But optimizing ROC AUC != optimizing PR AUC.
Precision and Recall curves
13. 13
Inclusion of costs in ROC Space
• Indifference Curve:
• Level set that defines, e.g., where your classifier implies
business profitability vs. loss.
• Defined via constraint optimization (e.g., costs of
quadrants in your confusion matrix).
• Points above this curve satisfy the constraint and are
good. Points below == bad.
• Why we care:
• Orange model doesn’t have a threshold that crosses
your indifference curve, even if its AUC is larger. No
threshold for orange model can satisfy your constraint.
Cost curves in ROC Space
14. 14
How do I pick the right threshold?
• Threshold choices:
• Find point with maximum distance from indifference
curve.
• Of your threshold choices, this point maximizes your
utility.
• Technically you’re at a higher indifference curve
• Other things to consider:
• Changes in your constraints – costs changes, therefore
your indifference curve can change.
• Update models and thresholds subject to such changes.
Picking the right classifier threshold
15. 15
Citing our sources
Bibliography
Davis, Jesse, and Mark Goadrich. "The relationship between Precision-Recall and ROC curves." In Proceedings of the 23rd international conference on Machine
learning, pp. 233-240. ACM, 2006.
Raghavan, V., Bollmann, P., & Jung, G. S. (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst.,
7, 205–229
Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceeding of the 15th International
Conference on Machine Learning (pp. 445–453). Morgan Kaufmann, San Francisco, CA
Drummond, C., & Holte, R. (2000). Explicitly representing expected cost: an alternative to ROC representation. Proceeding of Knowledge Discovery and Datamining
(pp. 198–207).
Drummond, C., & Holte, R. C. (2004). What ROC curves can’t do (and cost curves can). ROCAI (pp. 19–26)
Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159
Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters27, no. 8 (2006): 861-874
Metz, Charles E. "Basic principles of ROC analysis." In Seminars in nuclear medicine, vol. 8, no. 4, pp. 283-298. WB Saunders, 1978
Saito, Takaya, and Marc Rehmsmeier. "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced
datasets." PloS one 10, no. 3 (2015): e0118432
"Information Theoretic Metrics for Multi-class Predictor Evaluation", Sam Steingold, 2016, accessed 23 June 2016, http://www.slideshare.net/SessionsEvents/sam-
steingold-lead-data-scientist-magnetic-media-online-at-mlconf-sea-5201
“Machine Learning Meets Economics”, Datacratic 2016, accessed 23 June 2016, http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/
16. 16
What we talked about
• Model selection: data and algorithms aren’t the only knobs
• Problems with typical model selection strategies
• Review of model evaluation metrics
• Augmenting these metrics to address practical problems
• Why this matters to Remitly
Summary
17. 17
Remitly’s Data Science team uses ML for a variety of purposes.
ML applications are core to our business – therefore our business must be core to our ML applications.
Machine learning at Remitly
Hi everyone.
My name is Alex Korbonits, and I am a data scientist at Remitly.
This talk is broadly about evaluating and comparing machine learning models.
Before we dive in, here’s a little bit about Remitly and me.
Remitly was founded in 2011 out to forever change the way people send money to their loved ones.
Worldwide, remittances represent over 600 billion dollars annually, roughly 4x the amount of foreign aid.
We’re now the largest independent digital remittance company in the U.S.
We’re sending nearly 2 billion dollars annually and growing quickly
Our CEO, Matt Oppenheimer, was just named one of Ernst and Young’s 2016 Entrepreneurs of the Year
I'm Remitly's first data scientist, and our team is growing.
Right now my principal focus is FRAUD CLASSIFICATION
Previously, I was a data scientist at startup called Nuiku, focusing on NLP.
Model selection is crucial for delivering successful data science projects in industry.
The inclusion of economic constraints and class imbalance issues into this process is often overlooked, for example, if you’re simply maximizing area under the ROC curve.
Industrial settings require thinking beyond status quo model evaluation metrics: today we’ll consider tying model selection to business costs and impact.
That makes sense, and dollars and cents.
For us, w.r.t. fraud classification, there is a real penalty of being incorrect. We need to address the economic impact of model selection head-on.
So, you may think that in order to solve all of your machine learning problems, you only need to have…
BIG DATA
Or maybe you think all of your problems will be solved with…
DEEP LEARNING AND NEURAL NETWORKS
Even the TV show Silicon Valley mentioned neural networks and machine learning in several episodes this season.
Please stop fanning the flames of AI hype before another AI winter sets in. THANKS.
It is not the case that BIG DATA or FANCY ALGORITHMS can solve all of your machine learning problems!
How do you evaluate YOUR model or compare models?
Today we’re just going to focus on model evaluation in a supervised classification setting.
The No Free Lunch Theorem tells us there isn’t a one-size-fits-all model class.
So how do we do model selection? What do we need to incorporate that other approaches don't?
It’s not just about cross-validation, hyperparameter tuning, etc. We want to tie models into our business objectives.
You may have a problem where the penalty of misclassification varies PER INSTANCE.
Loss matrices – weighting training misclassifications by class – won’t work for us here since the penalty of misclassifying one transaction worth $1,000 IS VERY DIFFERENT than misclassifying one transaction worth $100.
In this talk we do not explore weighting individual points differently during training – which only works for some models -- nor do we explore resampling methods.
In economics, we have budget constraints and utility functions, i.e., constraint optimization and Lagrangians. Oh so many Lagrangians. Rational individuals maximize their utility subject to their budget constraints. Level sets of their utility functions represent curves of equivalently achievable utility.
At Remitly, we have transactions, revenue associated with completing them, costs of reviewing them, and costs of losing money due to fraud and chargebacks. Like a rational individual, as a price-taking firm in a competitive industry, we want to maximize our own utility subject to our constraints.
Models that have to predict a propensity score – such as logistic regression -- have tradeoffs.
It’s not really one classifier per se.
It’s a continuum of classifiers as you vary your classification threshold from 0 to 1.
Each threshold represents one confusion matrix.
Selecting the right model will give us our optimal confusion matrix.
Fraud is extremely expensive if it occurs, and also painful for customers to be put into review too easily, so you do care a lot, and differently, about Type I and Type II errors.
The receiver operating characteristic, or ROC curve, was first developed during World War II for detecting enemy objects in battlefields.
This curve is useful because it offers a description of the predictive power of a model or set of classifiers across different thresholds, so it gives an indication of the kinds of tradeoffs you can expect to make by choosing a particular threshold.
The precision and recall curve is another popular and important metric.
Precision and recall are affected by class imbalance, unlike ROC! Story for another time.
PR is just as useful for comparing models as ROC but there are some important differences, which we won't go into here...
Now let’s talk about costs. Here’s an example in ROC space.
What’s an indifference curve? It’s called “indifference” because all points along this line are equivalent – it’s a level set of tradeoffs in ROC space.
Here we have two curves for two models, one whose area under the curve is greater than the other.
The green classifier with WORSE area under the curve satisfies our constraint.
This model, for thresholds right of intersecting our indifference curve, is economically viable for us. We can make a profit. Success!
Each quadrant of your confusion matrix may have different costs. Your costs will define the slope and intercept of this curve in ROC and PR space.
Note that this need not be linear.
You need to have a model with a set of thresholds that crosses this curve for your model to make business sense to put into production.
Incorporating business sense into our model selection helps us choose between these two models. In isolation, the model with higher AUC seems more attractive, but when considering additional constraints, we see that in fact the model with lower AUC is more attractive.
Back to PR space.
Now that we have a model that is economically viable, how do we choose a threshold?
One way you would want to pick a threshold: find the maximum point in the direction perpendicular to the indifference curve that your classifier can achieve.
This is actually at a higher indifference curve, specifically, your maximally achievable level set.
Here we have two models that satisfy our constraints. It looks like, given our indifference curve in this example, your curve actually wins out in the bottom right-hand corner. Even though the area under the PR curve is significantly greater for my curve, your curve has a set of points farther from our indifference curve and thus picking a threshold on your curve to use for classification will be better for our task.
I.e., the economic constraints weight the cost of false negatives so heavily that extremely high recall is required for viability.
Remember that your constraints may be non-linear and may change with time. Be sure to re-evaluate your choices in thresholds when business logic changes. You may miss out on some big wins.
In summary, model selection is IMPORTANT.
Maximizing area under the ROC curve in an industrial setting may be inadequate.
For us, there is a real penalty of being incorrect. We HAVE to incorporate costs into model selection.
We are just getting started. We are not done with this analysis.
We're doing post-training analysis with costs and different model metrics such as ROC and PR. We’re looking into incorporating business objectives into the loss functions of our learners during training.
What does machine learning at Remitly look like?
Understanding:
Fraud classification
Anomaly detection
Customer behavior
Market forces
We're hiring!
Email me at alex@remitly.com.
That’s all, folks!
THANKS