SlideShare a Scribd company logo
1 of 54
Download to read offline
Efficient and Reliable Query
Processing using Machine Learning
Daniel Kang
1
Unstructured data is ubiquitous and cheap
2
» High quality sensors are
incredibly cheap (<$0.70
for a webcam)
» London alone has 500K
CCTVs
» Autonomous vehicles
produce large volumes
of data
Many research/commercial applications would
benefit from being able to analyze this data
3
“Find 95% of
hummingbirds in the video
for ecological analysis”
“[…] to have the [$1.5
million] also here in
Ghana no later than
end of this week”
“Find bribes in emails with
95% precision so SEC
lawyers can investigate”
“How many cars passed
by on Monday?”
4
DNNs have made strides in
language understanding
CNNs perform well on image
classification benchmarks
ML models can perform well on a range of
benchmark tasks
Query processing case study: finding
hummingbirds for ecological analysis
5
Goal: match field readings
with hummingbird visits
Only ~0.1% of the video
contains hummingbirds!
Query: find 95% of the
instances of hummingbirds
Can we deploy ML to help?
6
Ideal case:
1. Find off-the-shelf model
2. Execute over data
3. Find all the hummingbirds!
Challenge 1: ML models are unreliable
7
Small shifts in distribution can substantially harm
even state-of-the-art DNNs!
Challenge 2: Deploying ML is expensive
8
Labeling datasets can cost $100k+ State-of-the-art DNNs run
as slow as 3 fps
Fundamental trade-off between accuracy
and speed/cost
9
Log scale!
Can we
get here?
My work: how can we use unreliable and
expensive ML models in query processing?
10
1.Query processing techniques to efficiently deploy fixed
ML models with statistical guarantees [VLDB ‘18, VLDB
‘20, VLDB ’20, VLDB’ 21]
2.Methods of improving ML models for better query
accuracy/speed by allowing users to specify when errors
may be occurring [MLSys ’20, MLSys ‘21 (under review)]
Outline
» Motivation
» Efficient query processing with ML models
» Improving ML models by finding errors
» Ongoing work
11
Answering queries over unstructured data
with ML: overview of the naïve method
12
Query
Oracle
(e.g., DNN)
Video Relation
Oracles are used in higher level queries
but can be incredibly expensive
» Selection
» Aggregation
» Limit queries
» ..
Two key ideas: sampling and proxy scores
13
Can we combine them?
Sampling can reduce the number
of records considered
a(t)
m(t)
Time (t)
Numberofcars
We can generate cheap
approximations
Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Improving ML models by finding errors
» Ongoing work
14
Query type one: select instances of rare
events
15
Data
Records
Oracle
(naive)
Proxy
Scores
x
0.30
0.94
0.92
0.12
…
…
0.75
x
?
?
x
0.71
Human or
complex model
» “Find 95% of the
hummingbirds”
» “Find 95% of the
mentions of bribes”
» …
Common technique: proxy models to
reduce the cost of selection
16
Data
Records
Oracle
(naive)
Proxy
Scores
x
0.30
0.94
0.92
0.12
…
…
0.75
x
?
?
x
0.71
Proxy model
evaluated on all data
“Oracle” consulted
on a sample of data
Widely explored for video analytics
[NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18…]
Example: ecological analysis
17
Find 95% of the hummingbirds
using Mask R-CNN as a proxy
with human labels as ground truth
Many queries require statistical guarantees
on accuracy
18
» “Find 95% of the hummingbird
frames with failure probability at
most 5%”
» “Find e-mails referencing bribes
with at least 95% precision with
failure probability at most 5%”
Scientists require high probability of success to publish
Prior work using proxies fail to achieve
statistical guarantees on failure probability!
19
Prior work does not achieve the target
accuracy over half the time
Precision over
multiple runs
Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Problem statement
» Algorithms
» Evaluation
20
Problem statement: recall target
21
SELECT * FROM bush_video
WHERE HUMMINGBIRD_PRESENT(frame)
ORACLE LIMIT 10,000
USING MASK_RCNN(frame)
RECALL TARGET 90%
WITH PROBABILITY 95%
Return a set of records that achieve recall 90% using at
most 10,000 oracle samples with at least probability 95%
with precision as high as possible
Example query: finding hummingbirds with
high recall
22
SELECT * FROM hummingbird_video
WHERE HUMMINGBIRD_PRESENT(frame) = TRUE
ORACLE LIMIT 1,000
USING MASK_RCNN(frame) = ‘hummingbird’
RECALL TARGET 90%
WITH PROBABILITY 95%
Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Problem statement
» Algorithms
» Evaluation
23
Algorithm overview
24
0. Order records by proxy score
Universe
2. Choose a selection threshold based on sample labels
3. Return all records above threshold
Returned set of records
Sample
1. Sample oracle labels
Querying for 60% recall: Naive Method
25
1. Sample oracle labels uniformly at random
Universe
Sample
Naïve method fails to achieve 60% recall!
2. Choose a selection threshold based on sample labels
Sample
Key idea: use confidence intervals to have a
“buffer” to ensure high probability of success
26
Querying for 60% recall: Uniform Method
27
1. Sample oracle labels uniformly at random
Universe
Sample
Uniform sampling results in low precision (20%)!
2. Choose a selection threshold based on sample labels
Sample
Querying for 60% recall: SUPG
28
Importance sampling can give higher precision (42%)!
1. Sample with importance weights
Universe
Sample
2. Choose a selection threshold based on sample labels
Sample
Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Problem statement
» Algorithms
» Evaluation
29
Evaluation setting
30
Dataset Modality Task Proxy Oracle
ImageNet Images Find
hummingbirds
ResNet Human
night-street Video Find cars ResNet Mask R-CNN
OntoNotes Text Find employees
relations
LSTM Human
TACRED Text Find city relations SpanBERT Human
Real world datasets span text, images, and videos
with a variety of proxy and oracle models
SUPG query costs are cheap relative to
exhaustive labeling
31
SUPG queries are substantially cheaper than exhaustive labeling
Prior work fails to respect error bounds:
recall target
32
SUPG achieves target recall
with high probability
Naïve methods without correction
fail >50% of the time
Naïve
SUPG
SUPG outperforms uniform sampling:
recall target
33
Uniform sampling is
sample inefficient
Importance sampling
outperforms on all settings
Many other experiments in paper
34
» Precision, joint target setting experiment and algorithms
» Our algorithms are not sensitive to choice of parameters
» Our algorithms not not sensitive to choice of confidence
interval method
» …
https://github.com/stanford-futuredata/supg
Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Improving ML models by finding errors
» Ongoing work
35
Query type two: aggregation
Query: “what is the average number of cars per frame?”
SELECT COUNT(*)
FROM taipei
WHERE class = 'car'
ERROR WITHIN 0.1
AT CONFIDENCE 95%
Queries for
understanding bulk
properties
Can we use proxy models for
aggregation?
Is there a car in this frame?
Binary detection: Yes
Prior work on binary detection*:
1. Does not help for busy
videos
2. Does not help count
3. Does not provide statistical
guarantees
* NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18, …
Optimizing approximate aggregation:
sampling
Query: “What is the average number of cars per frame?”
Two ideas:
1. Sampling reduces number
of frames considered
2. Proxy models provide a
noisy signal
Proxy model: 17 cars
Ground truth: 20 cars
Optimizing approximate aggregation:
proxy models as control variates
Query: “What is the average number of cars per frame?”
a(t)
m(t)
Time (t)
Numberofcars
𝑚∗
𝑡 = 𝑚 𝑡 + 𝑐 ⋅ (𝑎 𝑡 − 𝐴)
𝑉𝑎𝑟 𝑚∗
= 1 − 𝜌",$
%
⋅ 𝑉𝑎𝑟(𝑚)
Variance of decreases with the correlation
between the proxy and target
Optimizing approximate aggregation:
EBS stopping
Query: “What is the average number of cars per frame?”
Intuition: always valid
stopping based on sample
variance
Lower variance terminates
earlier!
BlazeIt outperforms materializing rows and
random sampling for aggregation
1. Naïve is extremely slow
2. Knowing if a car is in the frame
doesn’t help for busy videos!
3. Sampling dramatically improves
performance
4. BlazeIt can forego object
detection
5. Even faster with caching
Outline
» Motivation
» Efficient query processing with ML models
» Selection queries with guarantees
» Aggregation queries
» Limit queries
» Improving ML models by finding errors
» Ongoing work
42
Query type three: limit queries
Query: “find clips of at least one bus and at least five cars”
SELECT timestamp
FROM taipei
GROUP BY timestamp
HAVING SUM(class='bus')>=1
AND SUM(class='car')>=5
LIMIT 10
Queries for manual
inspection of rare
events
Optimizing limit queries:
proxy models for ranking
Query: “Find frames of at least five cars, at least one bus”
Proxy model:
P(>= 5 cars) = 0.98
P(>= 1 bus) = 0.83
Proxy model:
P(>= 5 cars) = 0.13
P(>= 1 bus) = 0.89
Optimizing limit queries:
proxy models for ranking
Query: “Find frames of at least five cars, at least one bus”
Key idea:
Bias search towards
high-confidence areas
Evaluate this frame first
BlazeIt can sample many fewer frames
than naïve approaches for limit queries
Key takeaway: don’t call
expensive model!
Many more results and code available!
» Selection [VLDB ’17, ’20], Aggregation / LIMIT queries
[VLDB ‘20]
» Generating proxy scores [TASTI, MLSys ’21 (under
review)]
» Fast inference for visual analytics [VLDB ‘21 (to appear)]
47
https://github.com/stanford-futuredata/tasti
Outline
» Motivation
» Efficient query processing with ML models
» Improving ML models by finding errors
» Ongoing work
48
ML models make errors that affect
downstream analytics
49
Boxes of cars should not
highly overlap
Cars should not flicker in and
out of video
Key insight: ML models make systematic errors
Human labels from leading commercial
vendors contain errors
50
Errors in “ground truth labels” can cause
downstream safety risks!
51
“As the [automated driving system] changed the classification of the
pedestrian several times—alternating between vehicle, bicycle, and
an other — the system was unable to correctly predict the path of the
detected object,” the board’s report states.
Evaluating Model Quality after Retraining:
Qualitative Improvement
Best Retrained SSDOriginal SSD
52
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Conclusion
» Analytics queries over unstructured data are important for
commercial and research applications
» ML models are unreliable and expensive!
» We can use cheap proxies to get statistical guarantees on
downstream queries efficiently
54
https://github.com/stanford-futuredata/tasti
ddkang@stanford.edu

More Related Content

Similar to Efficient Query Processing Using Machine Learning

Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learningSylvain Ferrandiz
 
Machine learning for sanctions screening
Machine learning for sanctions screeningMachine learning for sanctions screening
Machine learning for sanctions screeningEnigma
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceLionel Briand
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsQuantUniversity
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingLionel Briand
 
Plotcon 2016 Visualization Talk by Alexandra Johnson
Plotcon 2016 Visualization Talk  by Alexandra JohnsonPlotcon 2016 Visualization Talk  by Alexandra Johnson
Plotcon 2016 Visualization Talk by Alexandra JohnsonSigOpt
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?Matei Zaharia
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Lionel Briand
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsLionel Briand
 
Platinum inspection using image processing
Platinum inspection using image processingPlatinum inspection using image processing
Platinum inspection using image processingDarien Pardinas Diaz
 
Machine Learning for Fraud Detection
Machine Learning for Fraud DetectionMachine Learning for Fraud Detection
Machine Learning for Fraud DetectionNitesh Kumar
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
Automated and Scalable Solutions for Software Testing: The Essential Role of ...
Automated and Scalable Solutions for Software Testing: The Essential Role of ...Automated and Scalable Solutions for Software Testing: The Essential Role of ...
Automated and Scalable Solutions for Software Testing: The Essential Role of ...Lionel Briand
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynoteShiva Nejati
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Lionel Briand
 

Similar to Efficient Query Processing Using Machine Learning (20)

Validation Is (Not) Easy
Validation Is (Not) EasyValidation Is (Not) Easy
Validation Is (Not) Easy
 
Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learning
 
Machine learning for sanctions screening
Machine learning for sanctions screeningMachine learning for sanctions screening
Machine learning for sanctions screening
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
Simulation
SimulationSimulation
Simulation
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security TestingApplications of Machine Learning and Metaheuristic Search to Security Testing
Applications of Machine Learning and Metaheuristic Search to Security Testing
 
Plotcon 2016 Visualization Talk by Alexandra Johnson
Plotcon 2016 Visualization Talk  by Alexandra JohnsonPlotcon 2016 Visualization Talk  by Alexandra Johnson
Plotcon 2016 Visualization Talk by Alexandra Johnson
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance Systems
 
Platinum inspection using image processing
Platinum inspection using image processingPlatinum inspection using image processing
Platinum inspection using image processing
 
Machine Learning for Fraud Detection
Machine Learning for Fraud DetectionMachine Learning for Fraud Detection
Machine Learning for Fraud Detection
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
Automated and Scalable Solutions for Software Testing: The Essential Role of ...
Automated and Scalable Solutions for Software Testing: The Essential Role of ...Automated and Scalable Solutions for Software Testing: The Essential Role of ...
Automated and Scalable Solutions for Software Testing: The Essential Role of ...
 
SSBSE 2020 keynote
SSBSE 2020 keynoteSSBSE 2020 keynote
SSBSE 2020 keynote
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Efficient Query Processing Using Machine Learning

  • 1. Efficient and Reliable Query Processing using Machine Learning Daniel Kang 1
  • 2. Unstructured data is ubiquitous and cheap 2 » High quality sensors are incredibly cheap (<$0.70 for a webcam) » London alone has 500K CCTVs » Autonomous vehicles produce large volumes of data
  • 3. Many research/commercial applications would benefit from being able to analyze this data 3 “Find 95% of hummingbirds in the video for ecological analysis” “[…] to have the [$1.5 million] also here in Ghana no later than end of this week” “Find bribes in emails with 95% precision so SEC lawyers can investigate” “How many cars passed by on Monday?”
  • 4. 4 DNNs have made strides in language understanding CNNs perform well on image classification benchmarks ML models can perform well on a range of benchmark tasks
  • 5. Query processing case study: finding hummingbirds for ecological analysis 5 Goal: match field readings with hummingbird visits Only ~0.1% of the video contains hummingbirds! Query: find 95% of the instances of hummingbirds
  • 6. Can we deploy ML to help? 6 Ideal case: 1. Find off-the-shelf model 2. Execute over data 3. Find all the hummingbirds!
  • 7. Challenge 1: ML models are unreliable 7 Small shifts in distribution can substantially harm even state-of-the-art DNNs!
  • 8. Challenge 2: Deploying ML is expensive 8 Labeling datasets can cost $100k+ State-of-the-art DNNs run as slow as 3 fps
  • 9. Fundamental trade-off between accuracy and speed/cost 9 Log scale! Can we get here?
  • 10. My work: how can we use unreliable and expensive ML models in query processing? 10 1.Query processing techniques to efficiently deploy fixed ML models with statistical guarantees [VLDB ‘18, VLDB ‘20, VLDB ’20, VLDB’ 21] 2.Methods of improving ML models for better query accuracy/speed by allowing users to specify when errors may be occurring [MLSys ’20, MLSys ‘21 (under review)]
  • 11. Outline » Motivation » Efficient query processing with ML models » Improving ML models by finding errors » Ongoing work 11
  • 12. Answering queries over unstructured data with ML: overview of the naïve method 12 Query Oracle (e.g., DNN) Video Relation Oracles are used in higher level queries but can be incredibly expensive » Selection » Aggregation » Limit queries » ..
  • 13. Two key ideas: sampling and proxy scores 13 Can we combine them? Sampling can reduce the number of records considered a(t) m(t) Time (t) Numberofcars We can generate cheap approximations
  • 14. Outline » Motivation » Efficient query processing with ML models » Selection queries with guarantees » Aggregation queries » Limit queries » Improving ML models by finding errors » Ongoing work 14
  • 15. Query type one: select instances of rare events 15 Data Records Oracle (naive) Proxy Scores x 0.30 0.94 0.92 0.12 … … 0.75 x ? ? x 0.71 Human or complex model » “Find 95% of the hummingbirds” » “Find 95% of the mentions of bribes” » …
  • 16. Common technique: proxy models to reduce the cost of selection 16 Data Records Oracle (naive) Proxy Scores x 0.30 0.94 0.92 0.12 … … 0.75 x ? ? x 0.71 Proxy model evaluated on all data “Oracle” consulted on a sample of data Widely explored for video analytics [NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18…]
  • 17. Example: ecological analysis 17 Find 95% of the hummingbirds using Mask R-CNN as a proxy with human labels as ground truth
  • 18. Many queries require statistical guarantees on accuracy 18 » “Find 95% of the hummingbird frames with failure probability at most 5%” » “Find e-mails referencing bribes with at least 95% precision with failure probability at most 5%” Scientists require high probability of success to publish
  • 19. Prior work using proxies fail to achieve statistical guarantees on failure probability! 19 Prior work does not achieve the target accuracy over half the time Precision over multiple runs
  • 20. Outline » Motivation » Efficient query processing with ML models » Selection queries with guarantees » Problem statement » Algorithms » Evaluation 20
  • 21. Problem statement: recall target 21 SELECT * FROM bush_video WHERE HUMMINGBIRD_PRESENT(frame) ORACLE LIMIT 10,000 USING MASK_RCNN(frame) RECALL TARGET 90% WITH PROBABILITY 95% Return a set of records that achieve recall 90% using at most 10,000 oracle samples with at least probability 95% with precision as high as possible
  • 22. Example query: finding hummingbirds with high recall 22 SELECT * FROM hummingbird_video WHERE HUMMINGBIRD_PRESENT(frame) = TRUE ORACLE LIMIT 1,000 USING MASK_RCNN(frame) = ‘hummingbird’ RECALL TARGET 90% WITH PROBABILITY 95%
  • 23. Outline » Motivation » Efficient query processing with ML models » Selection queries with guarantees » Problem statement » Algorithms » Evaluation 23
  • 24. Algorithm overview 24 0. Order records by proxy score Universe 2. Choose a selection threshold based on sample labels 3. Return all records above threshold Returned set of records Sample 1. Sample oracle labels
  • 25. Querying for 60% recall: Naive Method 25 1. Sample oracle labels uniformly at random Universe Sample Naïve method fails to achieve 60% recall! 2. Choose a selection threshold based on sample labels Sample
  • 26. Key idea: use confidence intervals to have a “buffer” to ensure high probability of success 26
  • 27. Querying for 60% recall: Uniform Method 27 1. Sample oracle labels uniformly at random Universe Sample Uniform sampling results in low precision (20%)! 2. Choose a selection threshold based on sample labels Sample
  • 28. Querying for 60% recall: SUPG 28 Importance sampling can give higher precision (42%)! 1. Sample with importance weights Universe Sample 2. Choose a selection threshold based on sample labels Sample
  • 29. Outline » Motivation » Efficient query processing with ML models » Selection queries with guarantees » Problem statement » Algorithms » Evaluation 29
  • 30. Evaluation setting 30 Dataset Modality Task Proxy Oracle ImageNet Images Find hummingbirds ResNet Human night-street Video Find cars ResNet Mask R-CNN OntoNotes Text Find employees relations LSTM Human TACRED Text Find city relations SpanBERT Human Real world datasets span text, images, and videos with a variety of proxy and oracle models
  • 31. SUPG query costs are cheap relative to exhaustive labeling 31 SUPG queries are substantially cheaper than exhaustive labeling
  • 32. Prior work fails to respect error bounds: recall target 32 SUPG achieves target recall with high probability Naïve methods without correction fail >50% of the time Naïve SUPG
  • 33. SUPG outperforms uniform sampling: recall target 33 Uniform sampling is sample inefficient Importance sampling outperforms on all settings
  • 34. Many other experiments in paper 34 » Precision, joint target setting experiment and algorithms » Our algorithms are not sensitive to choice of parameters » Our algorithms not not sensitive to choice of confidence interval method » … https://github.com/stanford-futuredata/supg
  • 35. Outline » Motivation » Efficient query processing with ML models » Selection queries with guarantees » Aggregation queries » Limit queries » Improving ML models by finding errors » Ongoing work 35
  • 36. Query type two: aggregation Query: “what is the average number of cars per frame?” SELECT COUNT(*) FROM taipei WHERE class = 'car' ERROR WITHIN 0.1 AT CONFIDENCE 95% Queries for understanding bulk properties
  • 37. Can we use proxy models for aggregation? Is there a car in this frame? Binary detection: Yes Prior work on binary detection*: 1. Does not help for busy videos 2. Does not help count 3. Does not provide statistical guarantees * NoScope VLDB ‘17, Probablistic Predicates SIGMOD ‘18, …
  • 38. Optimizing approximate aggregation: sampling Query: “What is the average number of cars per frame?” Two ideas: 1. Sampling reduces number of frames considered 2. Proxy models provide a noisy signal Proxy model: 17 cars Ground truth: 20 cars
  • 39. Optimizing approximate aggregation: proxy models as control variates Query: “What is the average number of cars per frame?” a(t) m(t) Time (t) Numberofcars 𝑚∗ 𝑡 = 𝑚 𝑡 + 𝑐 ⋅ (𝑎 𝑡 − 𝐴) 𝑉𝑎𝑟 𝑚∗ = 1 − 𝜌",$ % ⋅ 𝑉𝑎𝑟(𝑚) Variance of decreases with the correlation between the proxy and target
  • 40. Optimizing approximate aggregation: EBS stopping Query: “What is the average number of cars per frame?” Intuition: always valid stopping based on sample variance Lower variance terminates earlier!
  • 41. BlazeIt outperforms materializing rows and random sampling for aggregation 1. Naïve is extremely slow 2. Knowing if a car is in the frame doesn’t help for busy videos! 3. Sampling dramatically improves performance 4. BlazeIt can forego object detection 5. Even faster with caching
  • 42. Outline » Motivation » Efficient query processing with ML models » Selection queries with guarantees » Aggregation queries » Limit queries » Improving ML models by finding errors » Ongoing work 42
  • 43. Query type three: limit queries Query: “find clips of at least one bus and at least five cars” SELECT timestamp FROM taipei GROUP BY timestamp HAVING SUM(class='bus')>=1 AND SUM(class='car')>=5 LIMIT 10 Queries for manual inspection of rare events
  • 44. Optimizing limit queries: proxy models for ranking Query: “Find frames of at least five cars, at least one bus” Proxy model: P(>= 5 cars) = 0.98 P(>= 1 bus) = 0.83 Proxy model: P(>= 5 cars) = 0.13 P(>= 1 bus) = 0.89
  • 45. Optimizing limit queries: proxy models for ranking Query: “Find frames of at least five cars, at least one bus” Key idea: Bias search towards high-confidence areas Evaluate this frame first
  • 46. BlazeIt can sample many fewer frames than naïve approaches for limit queries Key takeaway: don’t call expensive model!
  • 47. Many more results and code available! » Selection [VLDB ’17, ’20], Aggregation / LIMIT queries [VLDB ‘20] » Generating proxy scores [TASTI, MLSys ’21 (under review)] » Fast inference for visual analytics [VLDB ‘21 (to appear)] 47 https://github.com/stanford-futuredata/tasti
  • 48. Outline » Motivation » Efficient query processing with ML models » Improving ML models by finding errors » Ongoing work 48
  • 49. ML models make errors that affect downstream analytics 49 Boxes of cars should not highly overlap Cars should not flicker in and out of video Key insight: ML models make systematic errors
  • 50. Human labels from leading commercial vendors contain errors 50 Errors in “ground truth labels” can cause downstream safety risks!
  • 51. 51 “As the [automated driving system] changed the classification of the pedestrian several times—alternating between vehicle, bicycle, and an other — the system was unable to correctly predict the path of the detected object,” the board’s report states.
  • 52. Evaluating Model Quality after Retraining: Qualitative Improvement Best Retrained SSDOriginal SSD 52
  • 53. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 54. Conclusion » Analytics queries over unstructured data are important for commercial and research applications » ML models are unreliable and expensive! » We can use cheap proxies to get statistical guarantees on downstream queries efficiently 54 https://github.com/stanford-futuredata/tasti ddkang@stanford.edu