Building beter models faster using actve learning
Nick Gaylord
Data Day Seatle
July 23, 2016
2
A couple questons to start
●
How many people here have heard of CrowdFlower?
3
A couple questons to start
●
How many people here have heard of CrowdFlower?
●
How many people here have done work in ML?
4
At the beginning of an ML project...
5
At the beginning of an ML project...
6
At the beginning of an ML project...
Customer Support Tickets Classify by Urgency
News Artcles Classify by Relevance
Social Media Posts Classify by Topic / Sentment
Images Classify by Scene Type / Content
7
You start to build your model...
8
You start to build your model...
9
Afer a while...
10
Afer a while...
11
Why do we get diminishing returns?
●
Maybe we should try a fancier model!
12
Why do we get diminishing returns?
●
Real data from recent Kaggle competition:
13
Why do we get diminishing returns?
14
Why do we get diminishing returns?
●
More specifically:
●
Additional examples add progressively less information
●
Performance can actually decrease for underrepresented labels
●
How can we work to improve the model without wasting our time?
15
Actve Learning
Iteratively select new examples that help model the most
●
Efficient – less time spent labeling
●
Balanced – tends to favor examples of less-represented labels
16
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
17
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
18
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
19
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
20
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
21
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
22
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
23
Efciency
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Items closer to the decision boundary
convey more informaton, and have
lower model confdences
24
How important is efciency?
●
Data isn't the bottleneck it used to be
●
CrowdFlower makes getting human labels at scale more feasible
●
Still, potential for significant cost and time savings via active learning
●
But there's an even more important reason to do it
25
Balance
Labels in your data
probably aren't
equally frequent.
And you ofen care
more about the
rare ones.
26
Balance
●
Imbalanced training data is a problem
●
Slow to accumulate examples of rare labels via random selection
●
Overrepresented labels can actually hurt accuracy on rarer ones
27
Balance
Confidences on these rarer labels will
generally be lower. Active learning will
select more examples of these, helping
to create a more balanced training set.
28
Recap
●
Active learning helps solve two big challenges
●
More useful training data for the model
●
Works in favor of creating more balanced training data
●
Iterative process focusing on high information items
29
Balance
Minimum confidence threshold correlates
with model accuracy (r=0.978).
30
Why doesn't everybody do this?
●
Idea has been around for a long time
●
Adequate data volume, infrastructure are more recent
●
Lots of discussion on “best” way to sample
●
Success requires attention to lots of variables
●
Iterative development means ongoing commitment
Sure would be nice to automate this to work at scale!
31
CrowdFlower: A brief introducton
32
CrowdFlower: A brief introducton
33
CrowdFlower: A brief introducton
34
CrowdFlower: A brief introducton
35
CrowdFlower AI
Contributor job(s) Classifcaton model
36
CrowdFlower AI
37
CrowdFlower AI: Workfow
CrowdFlower
AI Model
38
CrowdFlower AI
39
CrowdFlower AI: Workfow
High Confdence
LowConfdence
CrowdFlower
AI Model
40
CrowdFlower AI: Workfow
High Confdence
LowConfdence
Accurate
Classifcatons
CrowdFlower
AI Model
41
CrowdFlower AI: Workfow
High Confdence
LowConfdence
Additional
Training
Data
CrowdFlower
AI Model
Accurate
Classifcatons
42
CrowdFlower AI: Workfow
High Confdence
LowConfdence
Additional
Training
Data
CrowdFlower
AI Model
Accurate
Classifcatons
43
CrowdFlower AI: Benefts
●
Reduced human labeling costs
●
At least some rows can be automated very quickly
●
Savings increase over time as model improves
●
Increased throughput
●
Classify large data sets in minutes to hours, not days
●
Adaptive model
●
Capture changes in the data you collect, as they occur
●
Expand your scope
●
Free up resources to scale out project or pursue new efforts
44
Inspiratonal Quote
“I believe these challenges are leading to innovations which draw us
closer to methods for effective interactive learning systems.”
(Settles, B. 2011. Proceedings of JMLR 16)
45
Thank you!
nick.gaylord@crowdfower.com
@texastacos

Building Better Models Faster Using Active Learning