ML-Based SQL Query
Resource Usage Prediction
Chunxu Tang (Twitter)
Beinan Wang (Alluxio)
September 2022
2
1 2 3
Presto System Forecasting SQL Query
Resource Usage
Conclusion & Takeaways
Roadmap
Presto System
A long path pursuing high scalability and availability
4
Use Cases
❏ Interactive analytics
❏ Business Intelligence (BI)
❏ Dashboard reporting
❏ Program Debugging
❏ ETL/Scheduled jobs
5
Hybrid-Cloud SQL Federation System
6
Issues Without Query Utilization Prediction
❏ Query scheduling requires an estimate of the immediate workload
in the SQL system.
❏ Data system customers would like to know the resource
consumption estimation of their queries.
❏ Elastic scaling needs query resource usage forecasting.
Pre-existing DBMS approaches usually use query plans generated from
SQL engines.
Forecasting SQL Query Cost
How to use machine learning to predict the resource
utilization of a SQL query ?
8
Design
9
Query Resource Usage Prediction System
❏ Request logs
❏ ~1.2M records in 3 months
❏ Training cluster
❏ CPU time model
❏ Peak memory bytes model
❏ Model repository
❏ Serving cluster
10
Data Preprocessing
11
Data Discretization
Transforming the continuous data to discrete data
❏ Customers do not care about the accurately predicted value
❏ Query scheduling/scaling does not require an accurate value
12
Feature Extraction
Word frequency vs. TF-IDF
13
Model Training & Evaluation
❏ Random Forest
❏ XGBoost
❏ Logistic Regression
14
Model Evaluation (Accuracy)
15
Model Evaluation (Accuracy)
16
Model Evaluation (Precision & Recall)
XGBoost + TF-IDF
17
Model Serving
❏ Real-time model serving for CPU time
and peak memory bytes prediction
❏ Model re-training may be required
because of concept drift
18
Open-Source Project
❏ Codebase
❏ https://github.com/prestodb/presto-query-predictor
❏ Docs
❏ https://chunxutang.github.io/presto-query-predictor-docs/ (temporary)
❏ https://github.com/prestodb/presto-query-predictor/tree/main/docs
❏ Paper
❏ Published at IC2E 2021 https://arxiv.org/pdf/2204.05529.pdf
Can we apple the ML-based
approach to other query systems?
20
BigQuery Slots Prediction
❏ A BigQuery slot is a virtual CPU used by BigQuery to execute SQL
queries.
❏ The processing of each SQL query generates a log record containing a
field “total_slot_ms”.
21
Evaluation
❏ Dataset
❏ ~400k BigQuery SQL records in one Google Cloud project
❏ Training cluster
❏ Total slot time model
❏ XGBoost + TF-IDF
❏ Overall accuracy
❏ 92.4%
Total slot time Precision Recall
[0, 1h) 0.94 0.80
[1h, ) 0.92 0.98
Conclusion & Takeaways
What we have learned
23
Conclusion & Takeaways
Machine learning
can be useful
Some basic models can
give quite good results
Logs are reliable
starting points
Machine learning
requires data for training
Be careful!
Models should be
evaluated on various
metrics
24
Future Work
❏ More sophisticated data discretization
❏ More state-of-art classification algorithms
❏ Model decay optimization
❏ Model interpretation
❏ ...
Q & A
Alluxio Slack: @Chunxu Tang @Beinan Wang
PrestoDB Slack: @Chunxu Tang @beinan

ML-Based SQL Query Resource Usage Prediction