Spark’s Catalyst Optimizer uses cost-based optimization (CBO) to pick the best execution plan for a SparkSQL query. The CBO can choose which join strategy to use (e.g., a broadcast join versus repartitioned join), which table to use as the build side for the hash-join, which join order to use in a multi-way join query, which filter to push down, and others. To get its decisions right, the CBO makes a number of assumptions including availability of up-to-date statistics about the data, accurate estimation of result sizes, and availability of accurate models to estimate query costs.
These assumptions may not hold in real-life settings such as multi-tenant clusters and agile cloud environments; unfortunately, causing the CBO to pick suboptimal execution plans. Dissatisfied users then have to step in and tune the queries manually. In this talk, we will describe how we built Elfino, a Self-Driving Query Optimizer. Elfino tracks each and every query over time—before, during, and after execution—and uses machine learning algorithms to learn from mistakes made by the CBO in estimating properties of the input datasets, intermediate query result sizes, speed of the underlying hardware, and query costs. Elfino can further use an AI algorithm (modeled on multi-armed bandits with expert advice) to “explore and experiment” in a safe and nonintrusive manner using otherwise idle cluster resources. This algorithm enables Elfino to learn about execution plans that the CBO will not consider otherwise.
Our talk will cover how these algorithms help guide the CBO towards better plans in real-life settings while reducing its reliance on assumptions and manual steps like query tuning, setting configuration parameters by trial-and-error, and detecting when statistics are stale. We will also share our experiences with evaluating Elfino in multiple environments and highlight interesting avenues for future work.
24. Extracting*Input*Features*from*Logs
java.lang.OutOfMemoryError: Java heap space
at
scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:114)
at
scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:112)
at …
• Extracting+stack+traces+and+error+messages
• Tokenize+by+class+names+and+words
Tokens*example:
java.lang.OutOfmemoryError Java heap space at
scala.reflect.ManifestFactory$$anon$9.newArray(Manife
st.scala:114)
• Create+a+vocabulary+of+words+from+all+words+collected
24#Exp8SAIS