Efficient Query Processing Using Machine Learning

Efficient and Reliable Query
Processing using Machine Learning
Daniel Kang
1

Unstructured data is ubiquitous and cheap
2
» High quality sensors are
incredibly cheap (<$0.70
for a webcam)
» London alone has 500K
CCTVs
» Autonomous vehicles
produce large volumes
of data

Many research/commercial applications would
benefit from being able to analyze this data
3
“Find 95% of
hummingbirds in the video
for ecological analysis”
“[…] to have the [$1.5
million] also here in
Ghana no later than
end of this week”
“Find bribes in emails with
95% precision so SEC
lawyers can investigate”
“How many cars passed
by on Monday?”

4
DNNs have made strides in
language understanding
CNNs perform well on image
classification benchmarks
ML models can perform well on a range of
benchmark tasks

Query processing case study: finding
hummingbirds for ecological analysis
5
Goal: match field readings
with hummingbird visits
Only ~0.1% of the video
contains hummingbirds!
Query: find 95% of the
instances of hummingbirds

Can we deploy ML to help?
6
Ideal case:
1. Find off-the-shelf model
2. Execute over data
3. Find all the hummingbirds!

Challenge 1: ML models are unreliable
7
Small shifts in distribution can substantially harm
even state-of-the-art DNNs!

Challenge 2: Deploying ML is expensive
8
Labeling datasets can cost $100k+ State-of-the-art DNNs run
as slow as 3 fps

Fundamental trade-off between accuracy
and speed/cost
9
Log scale!
Can we
get here?

My work: how can we use unreliable and
expensive ML models in query processing?
10
1.Query processing techniques to efficiently deploy fixed
ML models with statistical guarantees [VLDB ‘18, VLDB
‘20, VLDB ’20, VLDB’ 21]
2.Methods of improving ML models for better query
accuracy/speed by allowing users to specify when errors
may be occurring [MLSys ’20, MLSys ‘21 (under review)]

Outline
» Motivation
» Efficient query processing with ML models
» Improving ML models by finding errors
» Ongoing work
11

Answering queries over unstructured data
with ML: overview of the naïve method
12
Query
Oracle
(e.g., DNN)
Video Relation
Oracles are used in higher level queries
but can be incredibly expensive
» Selection
» Aggregation
» Limit queries
» ..

Two key ideas: sampling and proxy scores
13
Can we combine them?
Sampling can reduce the number
of records considered
a(t)
m(t)
Time (t)
Numberofcars
We can generate cheap
approximations

Outline
» Motivation
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Ongoing work
14

Query type one: select instances of rare
events
15
Data
Records
Oracle
(naive)
Proxy
Scores
x
0.30
0.94
0.92
0.12
…
…
0.75
x
?
?
x
0.71
Human or
complex model
» “Find 95% of the
hummingbirds”
» “Find 95% of the
mentions of bribes”
» …

Common technique: proxy models to
reduce the cost of selection
16
Data
Records
Oracle
(naive)
Proxy
Scores
x
0.30
0.94
0.92
0.12
…
…
0.75
x
?
?
x
0.71
Proxy model
evaluated on all data
“Oracle” consulted
on a sample of data
Widely explored for video analytics
[NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18…]

Example: ecological analysis
17
Find 95% of the hummingbirds
using Mask R-CNN as a proxy
with human labels as ground truth

Many queries require statistical guarantees
on accuracy
18
» “Find 95% of the hummingbird
frames with failure probability at
most 5%”
» “Find e-mails referencing bribes
with at least 95% precision with
failure probability at most 5%”
Scientists require high probability of success to publish

Prior work using proxies fail to achieve
statistical guarantees on failure probability!
19
Prior work does not achieve the target
accuracy over half the time
Precision over
multiple runs

Outline
» Motivation
» Problem statement
» Algorithms
» Evaluation
20

Problem statement: recall target
21
SELECT * FROM bush_video
WHERE HUMMINGBIRD_PRESENT(frame)
ORACLE LIMIT 10,000
USING MASK_RCNN(frame)
RECALL TARGET 90%
WITH PROBABILITY 95%
Return a set of records that achieve recall 90% using at
most 10,000 oracle samples with at least probability 95%
with precision as high as possible

Example query: finding hummingbirds with
high recall
22
SELECT * FROM hummingbird_video
WHERE HUMMINGBIRD_PRESENT(frame) = TRUE
ORACLE LIMIT 1,000
USING MASK_RCNN(frame) = ‘hummingbird’
RECALL TARGET 90%
WITH PROBABILITY 95%

Outline
» Motivation
» Algorithms
» Evaluation
23

Algorithm overview
24
0. Order records by proxy score
Universe
2. Choose a selection threshold based on sample labels
3. Return all records above threshold
Returned set of records
Sample
1. Sample oracle labels

Querying for 60% recall: Naive Method
25
1. Sample oracle labels uniformly at random
Universe
Sample
Naïve method fails to achieve 60% recall!
Sample

Key idea: use confidence intervals to have a
“buffer” to ensure high probability of success
26

Querying for 60% recall: Uniform Method
27
1. Sample oracle labels uniformly at random
Universe
Sample
Uniform sampling results in low precision (20%)!
Sample

Querying for 60% recall: SUPG
28
Importance sampling can give higher precision (42%)!
1. Sample with importance weights
Universe
Sample
Sample

Outline
» Motivation
» Algorithms
» Evaluation
29

Evaluation setting
30
Dataset Modality Task Proxy Oracle
ImageNet Images Find
hummingbirds
ResNet Human
night-street Video Find cars ResNet Mask R-CNN
OntoNotes Text Find employees
relations
LSTM Human
TACRED Text Find city relations SpanBERT Human
Real world datasets span text, images, and videos
with a variety of proxy and oracle models

SUPG query costs are cheap relative to
exhaustive labeling
31
SUPG queries are substantially cheaper than exhaustive labeling

Prior work fails to respect error bounds:
recall target
32
SUPG achieves target recall
with high probability
Naïve methods without correction
fail >50% of the time
Naïve
SUPG

SUPG outperforms uniform sampling:
recall target
33
Uniform sampling is
sample inefficient
Importance sampling
outperforms on all settings

Many other experiments in paper
34
» Precision, joint target setting experiment and algorithms
» Our algorithms are not sensitive to choice of parameters
» Our algorithms not not sensitive to choice of confidence
interval method
» …
https://github.com/stanford-futuredata/supg

Outline
» Motivation
» Limit queries
» Ongoing work
35

Query type two: aggregation
Query: “what is the average number of cars per frame?”
SELECT COUNT(*)
FROM taipei
WHERE class = 'car'
ERROR WITHIN 0.1
AT CONFIDENCE 95%
Queries for
understanding bulk
properties

Can we use proxy models for
aggregation?
Is there a car in this frame?
Binary detection: Yes
Prior work on binary detection*:
1. Does not help for busy
videos
2. Does not help count
3. Does not provide statistical
guarantees
* NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18, …

Optimizing approximate aggregation:
sampling
Query: “What is the average number of cars per frame?”
Two ideas:
1. Sampling reduces number
of frames considered
2. Proxy models provide a
noisy signal
Proxy model: 17 cars
Ground truth: 20 cars

proxy models as control variates
a(t)
m(t)
Time (t)
Numberofcars
𝑚∗
𝑡 = 𝑚 𝑡 + 𝑐 ⋅ (𝑎 𝑡 − 𝐴)
𝑉𝑎𝑟 𝑚∗
= 1 − 𝜌",$
%
⋅ 𝑉𝑎𝑟(𝑚)
Variance of decreases with the correlation
between the proxy and target

EBS stopping
Intuition: always valid
stopping based on sample
variance
Lower variance terminates
earlier!

BlazeIt outperforms materializing rows and
random sampling for aggregation
1. Naïve is extremely slow
2. Knowing if a car is in the frame
doesn’t help for busy videos!
3. Sampling dramatically improves
performance
4. BlazeIt can forego object
detection
5. Even faster with caching

Outline
» Motivation
» Limit queries
» Ongoing work
42

Query type three: limit queries
Query: “find clips of at least one bus and at least five cars”
SELECT timestamp
FROM taipei
GROUP BY timestamp
HAVING SUM(class='bus')>=1
AND SUM(class='car')>=5
LIMIT 10
Queries for manual
inspection of rare
events

Optimizing limit queries:
proxy models for ranking
Query: “Find frames of at least five cars, at least one bus”
Proxy model:
P(>= 5 cars) = 0.98
P(>= 1 bus) = 0.83
Proxy model:
P(>= 5 cars) = 0.13
P(>= 1 bus) = 0.89

Optimizing limit queries:
proxy models for ranking
Query: “Find frames of at least five cars, at least one bus”
Key idea:
Bias search towards
high-confidence areas
Evaluate this frame first

BlazeIt can sample many fewer frames
than naïve approaches for limit queries
Key takeaway: don’t call
expensive model!

Many more results and code available!
» Selection [VLDB ’17, ’20], Aggregation / LIMIT queries
[VLDB ‘20]
» Generating proxy scores [TASTI, MLSys ’21 (under
review)]
» Fast inference for visual analytics [VLDB ‘21 (to appear)]
47
https://github.com/stanford-futuredata/tasti

Outline
» Motivation
» Ongoing work
48

ML models make errors that affect
downstream analytics
49
Boxes of cars should not
highly overlap
Cars should not flicker in and
out of video
Key insight: ML models make systematic errors

Human labels from leading commercial
vendors contain errors
50
Errors in “ground truth labels” can cause
downstream safety risks!

51
“As the [automated driving system] changed the classification of the
pedestrian several times—alternating between vehicle, bicycle, and
an other — the system was unable to correctly predict the path of the
detected object,” the board’s report states.

Evaluating Model Quality after Retraining:
Qualitative Improvement
Best Retrained SSDOriginal SSD
52

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Conclusion
» Analytics queries over unstructured data are important for
commercial and research applications
» ML models are unreliable and expensive!
» We can use cheap proxies to get statistical guarantees on
downstream queries efficiently
54
https://github.com/stanford-futuredata/tasti
ddkang@stanford.edu

Efficient Query Processing Using Machine Learning

Recommended

Recommended

More Related Content

Similar to Efficient Query Processing Using Machine Learning

Similar to Efficient Query Processing Using Machine Learning (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Efficient Query Processing Using Machine Learning