In recent years Apache Spark has received a lot of hype in the Big Data community. It is seen as a silver bullet for all problems related to gathering, processing and analysing massive datasets. Due to its rapid evolution (do not forget that Spark is one the most active open source projects), some of the ideas behind it seem to be unclear and require digging into different blog posts and presentations. During this talk we will dive into the internals of Spark SQL, look how our queries are translated to the actual code executed on the nodes and find different ways to debug and optimize them.
1. Spark SQL under the hood
Mikołaj Kromka, VirtusLab
mkromka@virtuslab.com
DataKRK meetup
Kraków, 06.09.2017
2. Bio
● Software engineer at VirtusLab and Spark trainer at Virtusity
● Focused mostly on the Scala ecosystem
● Currently developing a new Analytics Platform for Tesco
3. Brief (and selective) history of structuring data
● Codd's relational model (1969 - 50th anniversary in two years!)
● SQL
○ one of the first commercial implementations at IBM (early 1970s)
○ SQL-based RDBMS developed at Relational Software, Inc (now Oracle Corporation) in the late 1970s
● Apache Hive bringing SQL-like capabilities to the Big Data world (open sourced 2008)
● Shark
● Spark SQL (2014)
4. Apache Spark: why the fuss?
● General engine for large-scale data processing
● Resilient Distributed Datasets
● Generating graph of computations automatically
● Scala, Java, Python and R APIs
● A lot of libraries on top of it (SQL, ML, GraphX, Streaming)
● One of the most active open source projects
source https://spark.apache.org/docs/latest/cluster-overview.html
6. Do we need anything else?
YES
● Data is usually structured - but RDDs contain arbitrary Java/Python objects
and Transformations of RDDs contain arbitrary code
● Analysts know SQL/Hive
● Large SQL/HiveQL codebases that we would like to reuse
● Connecting to different data sources with (semi-)structured datasets
● Applying advanced and complex algorithms (such as ML)
15. Code generation
● Why do we need it?
○ without it simple expressions such as (x + y) + 1 would be interpreted from scratch for every row in the
dataset
● Newer version of spark SQL support Whole-Stage Code Generation (not only expressions)
18. Some advice
● Don't stick to the Dataset API blindly - some operations cannot be inlined during codegen and will
be slower
● Don't think that Spark SQL has all features of the traditional RDBMS, if you don't handle large
amounts of data Postgres will be enough
● If possible don't create DataFrames from RDDs using .toDF() method, use specific
DataFrameReader instead
● Analyse plans generated by the Catalyst to see if some optimizations were missed or there is a
place to improve
● Spark UI is always useful