Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.
Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.
Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.