This document discusses using machine learning models for efficient and reliable query processing over unstructured data. It presents challenges with directly using ML models for queries due to models being unreliable and expensive to run. The author's work addresses these challenges with two key ideas: (1) using proxy models to generate cheap approximations to reduce oracle model calls, and (2) sampling techniques to provide statistical guarantees on query accuracy while minimizing costs. The techniques are applied to different query types like selection, aggregation, and limit queries. Evaluation shows the methods outperform baselines in achieving accuracy targets with fewer oracle model evaluations. The work also aims to improve ML models by allowing users to specify when errors may be occurring.
2. Unstructured data is ubiquitous and cheap
2
» High quality sensors are
incredibly cheap (<$0.70
for a webcam)
» London alone has 500K
CCTVs
» Autonomous vehicles
produce large volumes
of data
3. Many research/commercial applications would
benefit from being able to analyze this data
3
“Find 95% of
hummingbirds in the video
for ecological analysis”
“[…] to have the [$1.5
million] also here in
Ghana no later than
end of this week”
“Find bribes in emails with
95% precision so SEC
lawyers can investigate”
“How many cars passed
by on Monday?”
4. 4
DNNs have made strides in
language understanding
CNNs perform well on image
classification benchmarks
ML models can perform well on a range of
benchmark tasks
5. Query processing case study: finding
hummingbirds for ecological analysis
5
Goal: match field readings
with hummingbird visits
Only ~0.1% of the video
contains hummingbirds!
Query: find 95% of the
instances of hummingbirds
6. Can we deploy ML to help?
6
Ideal case:
1. Find off-the-shelf model
2. Execute over data
3. Find all the hummingbirds!
7. Challenge 1: ML models are unreliable
7
Small shifts in distribution can substantially harm
even state-of-the-art DNNs!
8. Challenge 2: Deploying ML is expensive
8
Labeling datasets can cost $100k+ State-of-the-art DNNs run
as slow as 3 fps
10. My work: how can we use unreliable and
expensive ML models in query processing?
10
1.Query processing techniques to efficiently deploy fixed
ML models with statistical guarantees [VLDB ‘18, VLDB
‘20, VLDB ’20, VLDB’ 21]
2.Methods of improving ML models for better query
accuracy/speed by allowing users to specify when errors
may be occurring [MLSys ’20, MLSys ‘21 (under review)]
12. Answering queries over unstructured data
with ML: overview of the naïve method
12
Query
Oracle
(e.g., DNN)
Video Relation
Oracles are used in higher level queries
but can be incredibly expensive
» Selection
» Aggregation
» Limit queries
» ..
13. Two key ideas: sampling and proxy scores
13
Can we combine them?
Sampling can reduce the number
of records considered
a(t)
m(t)
Time (t)
Numberofcars
We can generate cheap
approximations
14. Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Improving ML models by finding errors
» Ongoing work
14
15. Query type one: select instances of rare
events
15
Data
Records
Oracle
(naive)
Proxy
Scores
x
0.30
0.94
0.92
0.12
…
…
0.75
x
?
?
x
0.71
Human or
complex model
» “Find 95% of the
hummingbirds”
» “Find 95% of the
mentions of bribes”
» …
16. Common technique: proxy models to
reduce the cost of selection
16
Data
Records
Oracle
(naive)
Proxy
Scores
x
0.30
0.94
0.92
0.12
…
…
0.75
x
?
?
x
0.71
Proxy model
evaluated on all data
“Oracle” consulted
on a sample of data
Widely explored for video analytics
[NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18…]
18. Many queries require statistical guarantees
on accuracy
18
» “Find 95% of the hummingbird
frames with failure probability at
most 5%”
» “Find e-mails referencing bribes
with at least 95% precision with
failure probability at most 5%”
Scientists require high probability of success to publish
19. Prior work using proxies fail to achieve
statistical guarantees on failure probability!
19
Prior work does not achieve the target
accuracy over half the time
Precision over
multiple runs
20. Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Problem statement
» Algorithms
» Evaluation
20
21. Problem statement: recall target
21
SELECT * FROM bush_video
WHERE HUMMINGBIRD_PRESENT(frame)
ORACLE LIMIT 10,000
USING MASK_RCNN(frame)
RECALL TARGET 90%
WITH PROBABILITY 95%
Return a set of records that achieve recall 90% using at
most 10,000 oracle samples with at least probability 95%
with precision as high as possible
22. Example query: finding hummingbirds with
high recall
22
SELECT * FROM hummingbird_video
WHERE HUMMINGBIRD_PRESENT(frame) = TRUE
ORACLE LIMIT 1,000
USING MASK_RCNN(frame) = ‘hummingbird’
RECALL TARGET 90%
WITH PROBABILITY 95%
23. Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Problem statement
» Algorithms
» Evaluation
23
24. Algorithm overview
24
0. Order records by proxy score
Universe
2. Choose a selection threshold based on sample labels
3. Return all records above threshold
Returned set of records
Sample
1. Sample oracle labels
25. Querying for 60% recall: Naive Method
25
1. Sample oracle labels uniformly at random
Universe
Sample
Naïve method fails to achieve 60% recall!
2. Choose a selection threshold based on sample labels
Sample
26. Key idea: use confidence intervals to have a
“buffer” to ensure high probability of success
26
27. Querying for 60% recall: Uniform Method
27
1. Sample oracle labels uniformly at random
Universe
Sample
Uniform sampling results in low precision (20%)!
2. Choose a selection threshold based on sample labels
Sample
28. Querying for 60% recall: SUPG
28
Importance sampling can give higher precision (42%)!
1. Sample with importance weights
Universe
Sample
2. Choose a selection threshold based on sample labels
Sample
29. Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Problem statement
» Algorithms
» Evaluation
29
30. Evaluation setting
30
Dataset Modality Task Proxy Oracle
ImageNet Images Find
hummingbirds
ResNet Human
night-street Video Find cars ResNet Mask R-CNN
OntoNotes Text Find employees
relations
LSTM Human
TACRED Text Find city relations SpanBERT Human
Real world datasets span text, images, and videos
with a variety of proxy and oracle models
31. SUPG query costs are cheap relative to
exhaustive labeling
31
SUPG queries are substantially cheaper than exhaustive labeling
32. Prior work fails to respect error bounds:
recall target
32
SUPG achieves target recall
with high probability
Naïve methods without correction
fail >50% of the time
Naïve
SUPG
33. SUPG outperforms uniform sampling:
recall target
33
Uniform sampling is
sample inefficient
Importance sampling
outperforms on all settings
34. Many other experiments in paper
34
» Precision, joint target setting experiment and algorithms
» Our algorithms are not sensitive to choice of parameters
» Our algorithms not not sensitive to choice of confidence
interval method
» …
https://github.com/stanford-futuredata/supg
35. Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Improving ML models by finding errors
» Ongoing work
35
36. Query type two: aggregation
Query: “what is the average number of cars per frame?”
SELECT COUNT(*)
FROM taipei
WHERE class = 'car'
ERROR WITHIN 0.1
AT CONFIDENCE 95%
Queries for
understanding bulk
properties
37. Can we use proxy models for
aggregation?
Is there a car in this frame?
Binary detection: Yes
Prior work on binary detection*:
1. Does not help for busy
videos
2. Does not help count
3. Does not provide statistical
guarantees
* NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18, …
38. Optimizing approximate aggregation:
sampling
Query: “What is the average number of cars per frame?”
Two ideas:
1. Sampling reduces number
of frames considered
2. Proxy models provide a
noisy signal
Proxy model: 17 cars
Ground truth: 20 cars
39. Optimizing approximate aggregation:
proxy models as control variates
Query: “What is the average number of cars per frame?”
a(t)
m(t)
Time (t)
Numberofcars
𝑚∗
𝑡 = 𝑚 𝑡 + 𝑐 ⋅ (𝑎 𝑡 − 𝐴)
𝑉𝑎𝑟 𝑚∗
= 1 − 𝜌",$
%
⋅ 𝑉𝑎𝑟(𝑚)
Variance of decreases with the correlation
between the proxy and target
40. Optimizing approximate aggregation:
EBS stopping
Query: “What is the average number of cars per frame?”
Intuition: always valid
stopping based on sample
variance
Lower variance terminates
earlier!
41. BlazeIt outperforms materializing rows and
random sampling for aggregation
1. Naïve is extremely slow
2. Knowing if a car is in the frame
doesn’t help for busy videos!
3. Sampling dramatically improves
performance
4. BlazeIt can forego object
detection
5. Even faster with caching
42. Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Improving ML models by finding errors
» Ongoing work
42
43. Query type three: limit queries
Query: “find clips of at least one bus and at least five cars”
SELECT timestamp
FROM taipei
GROUP BY timestamp
HAVING SUM(class='bus')>=1
AND SUM(class='car')>=5
LIMIT 10
Queries for manual
inspection of rare
events
44. Optimizing limit queries:
proxy models for ranking
Query: “Find frames of at least five cars, at least one bus”
Proxy model:
P(>= 5 cars) = 0.98
P(>= 1 bus) = 0.83
Proxy model:
P(>= 5 cars) = 0.13
P(>= 1 bus) = 0.89
45. Optimizing limit queries:
proxy models for ranking
Query: “Find frames of at least five cars, at least one bus”
Key idea:
Bias search towards
high-confidence areas
Evaluate this frame first
46. BlazeIt can sample many fewer frames
than naïve approaches for limit queries
Key takeaway: don’t call
expensive model!
47. Many more results and code available!
» Selection [VLDB ’17, ’20], Aggregation / LIMIT queries
[VLDB ‘20]
» Generating proxy scores [TASTI, MLSys ’21 (under
review)]
» Fast inference for visual analytics [VLDB ‘21 (to appear)]
47
https://github.com/stanford-futuredata/tasti
49. ML models make errors that affect
downstream analytics
49
Boxes of cars should not
highly overlap
Cars should not flicker in and
out of video
Key insight: ML models make systematic errors
50. Human labels from leading commercial
vendors contain errors
50
Errors in “ground truth labels” can cause
downstream safety risks!
51. 51
“As the [automated driving system] changed the classification of the
pedestrian several times—alternating between vehicle, bicycle, and
an other — the system was unable to correctly predict the path of the
detected object,” the board’s report states.
52. Evaluating Model Quality after Retraining:
Qualitative Improvement
Best Retrained SSDOriginal SSD
52
54. Conclusion
» Analytics queries over unstructured data are important for
commercial and research applications
» ML models are unreliable and expensive!
» We can use cheap proxies to get statistical guarantees on
downstream queries efficiently
54
https://github.com/stanford-futuredata/tasti
ddkang@stanford.edu