Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.