Migrating to Spark 2.0 -
Part 2
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● What’s New in Spark 2.0
● Recap of Part 1
● Sub Queries
● Catalog API
● Hive Catalog
● Refresh Table
● Check Point for Iteration
● References
What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features
Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes
Recap of Part 1
● Choosing Scala Version
● New Connectors
● Spark Session Entry Point
● Built in Csv Connector
● Moving from DF RDD API to Dataset
● Cross Joins
● Custom ML Transforms
SubQueries in Spark
SubQueries
● A Query inside another query is known as subquery
● Standard feature of SQL
● Ex from MySQL
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● Highly useful as they allow us to combine multiple
different type of aggregation in one query
Types of SubQuery
● In select clause ( Scalar)
SELECT employee_id,
age,
(SELECT MAX(age) FROM employee) max_age
FROM employee
● In from clause (Derived Tables)
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● In where clause ( Predicate)
SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
SubQuery support in Spark 1.x
● SubQuery support in spark 1.x mimic the support
available in hive 0.12
● Hive only supported the subqueries in from clause so
spark only supported the same.
● The subquery in from clause fairly limited on what they
are capable of doing.
● To support the advanced querying in spark sql, they
needed to add other from subqueries in 2.0
SubQuery support in Spark 2.x
● Spark has greatly improved its support on SQL dialect
on 2.0 version
● They have added most of the standard features of
SQL-92 standard
● Full fledged sql parser, no more depend on hive
● Runs all 99 queries TPC-DS natively
● Makes Spark full fledged OLAP query engine
Scalar SubQueries
● Scalar subqueries are the sub queries which returns
single ( scalar) result
● There are two kind of Scalar subqueries
○ UnCorrelated Subqueries
The one which doesn’t depend upon the external query
● Correlated Subqueries
The one depend upon the outer queries
Uncorrelated Scalar SubQueries
● Add maximum sales amount to each row of the sales
data
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
Correlated SubQueries
● Add maximum sales amount to each row of the sales
data in each item category
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
Catalog API
Catalog in SQL
● Catalog is a metadata store which contains the
metadata of the all the information of a SQL system
● Typical contents of catalog are
○ Databases
○ Tables
○ Table Metadata
○ Functions
○ Partitions
○ Buckets
Catalog in Spark 1.x
● By default, spark uses a in memory catalog which keeps
track of spark temp tables
● It is not persistent
● For any persistent operations, spark advocated use of
the hive metastore
● There was no standard API to query metadata
information from the in memory / hive metadata store
● AdHoc functions added to SQLContext over a time to fix
this
Need of Catalog API
● Many interactive applications, like notebook systems,
often need an API to query metastore to show relevant
information to user
● Whenever we integrate with hive, without a catalog API
we have to resort to running HQL queries and parsing
it’s information for getting data
● Cannot manipulate hive metastore directly from spark
API
● Need to evolve to support more meta stores in future
Catalog API in 2.x
● Spark 2.0 has added a full fledged catalog API to spark
session
● It lives a sparkSession.catalog namespace
● This catalog has API’s to create, read and delete
elements from in memory and also from hive metastore
● Having this standard API to interact with catalog makes
developer much easier than before
● If we were using non standard API’s before, it’s time to
migrate
Catalog API Migration
● Migrate from the sqlContext API’s to
sparkSession.catalog API
● Use sparkSession rather than using HiveContext to
have access to the special operations
● Spark 1.x
sparkone.CatalogExample
● Spark 2.x
sparktwo.CatalogExample
Hive Integration
Hive Integration in Spark 1.x
● Spark SQL had native support for the hive from
beginning
● In beginning spark SQL used hive query parse for
parsing and meta store for persistent storage
● To integrate with hive, one has to create HiveContext
which is separate than SQLContext
● Some of API’s were only available in SQLContext and
some hive specific on HiveContext
● No support for manipulating hive metastore
Hive Integration in 2.x
● No more separate HiveContext
● SparkSession has enableHiveSupport API to enable
the hive support
● This makes both spark SQL and Hive API’s consistent
● Spark 2.0 catalog API also supports the hive metastore
● Example
sparktwo.CatalogHiveExample
Refresh Table API
Need of Refresh Table API
● In spark, we cache a table ( dataset) for performance
reasons
● Spark caches the metadata in it’s metadatastore and
actual data in block manager
● If underneath file/table changes, there was no direct API
in spark to force a table refresh
● If you just uncache/recache, it will only reflects the
change in data not metadata
● So we need a standard way to refresh the table
Refresh Table and By Path
● Spark 2.0 provides two API’s for refreshing datasets in
spark
● refreshTable API which was imported from
HiveContext are used for or registered temp tables or
hive tables
● refreshByPath API is used for refreshing datasets
without have to register them as tables beforehand
● Spark 1.x
sparkone.RefreshExample
● Spark 2.x - sparktwo.RefreshExample
CheckPoint API
Iterative Programming in Spark
● Spark is one of the first big data framework to have
great support iterative programming natively
● Iterative programs go over the again and again to
compute some results
● Spark ML is one of iterative frameworks in spark
● Even though caching and RDD mechanisms worked
great with iterative programming, moving to dataframe
has created new challenges
Spark iterative processing
Iteration in Dataframe API
● As every step of iteration creates new DF, the logical
plan keeps on going big
● As spark needs to keep complete query plan for
recovery, the overhead to analyse the plan increases as
number of iterations increases
● This overhead is compute bound and done at master
● As this overhead increases, it makes iteration very very
slow
● Ex : sparkone.CheckPoint
Solution to Query Plan Issue
● To solve ever growing query plan (lineage) we need to
truncate it to make it faster
● Whenever we truncate the query plan, we will lose the
ability to recover
● To avoid that, we need to store the intermediate data
before we truncated the query plan
● Saving intermediate data with truncation of the query
plan will result in faster performance
Dataset Persistence API
● In spark 2.1, there is a new persist API on dataset
● It’s analogues to RDD persist API
● In RDD, persist used to persist the RDD and then
truncate it’s lineage
● Similarly in case of dataset, persist will persist the
dataset and then truncates it’s query plan
● Make sure that persist times are much lower than the
overhead you are facing
● Ex : sparktwo.CheckPoint
Migrating Best Practices
Best Practice Migration
● As fundamental abstractions has changed in spark 2.0,
we need to rethink our best practices
● Many best practices were centered around RDD
abstraction which is no more central abstraction
● Also there are many optimisation in the catalyst where
many optimisation is done for us by platform
● So let’s look at some best practices of 1.x and see how
we can change
Choice of Serializer
● Use kryo serializer over java and register classes
with kryo
● This best practice was devised for efficient caching and
transfer of RDD data
● But in spark 2.0, Dataset uses custom code generated
serialization framework for most of the code and data
● So unless there is heavy use of RDD in your project you
don’t need to worry about serializer is 2.0
Cache Format
● RDD uses MEMORY_ONLY as default and it’s most
efficient caching
● DataFrame/Dataset uses MEMORY_AND_DISK rather
than MEMORY_ONLY
● Computation of Dataset and converting to custom
serialization format is often costly
● So use MEMORY_AND_DISK as default format over
MEMORY_ONLY
Use of BroadCast variables
● Use broadcast variable for optimising the lookups and
joins
● BroadCast variables played an important role of making
joins efficient in the RDD world
● These variables don’t have much scope in Dataset API
land
● By configuring broadcast value, spark sql will do
broadcasting automatically
● Don’t use them unless there is a reason
Choice of Clusters
● Use YARN/Mesos for production. Standalone is
mostly for simple apps
● Users were encouraged to use a dedicated cluster
manager over standalone given by spark
● With databricks cloud putting weight behind standalone
cluster it has been ready for production grade
● Many companies run their spark applications in
standalone cluster today
● Choose standalone if you run only spark applications
Use of HiveContext
● Use HiveContext over SQLContext for using Spark
SQL
● In spark 1.x, Spark SQL was simplistic and heavily
depended on hive to give query parsing
● In spark 2.0, spark sql is enriched which is now more
powerful than hive itself
● Most of the udf of hive are now rewritten in spark sql
and code generated
● Unless you want to use hive metastore, use
sparksession without hive
References
● http://blog.madhukaraphatak.com/categories/spark-two-
migration-series/
● http://www.spark.tc/migrating-applications-to-apache-sp
ark-2-0-2/
● http://blog.madhukaraphatak.com/categories/spark-two/
● https://www.youtube.com/watch?v=jyXEUXCYGwo
● https://databricks.com/blog/2016/06/17/sql-subqueries-i
n-apache-spark-2-0.html

Migrating to Spark 2.0 - Part 2

  • 1.
    Migrating to Spark2.0 - Part 2 Moving to next generation spark https://github.com/phatak-dev/spark-two-migration
  • 2.
    ● Madhukara Phatak ●Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3.
    Agenda ● What’s Newin Spark 2.0 ● Recap of Part 1 ● Sub Queries ● Catalog API ● Hive Catalog ● Refresh Table ● Check Point for Iteration ● References
  • 4.
    What’s new in2.0? ● Dataset is the new user facing single abstraction ● RDD abstraction is used only for runtime ● Higher performance with whole stage code generation ● Significant changes to streaming abstraction with spark structured streaming ● Incorporates learning from 4 years of production use ● Spark ML is replacing MLLib as de facto ml library ● Breaks API compatibility for better API’s and features
  • 5.
    Need for Migration ●Lot of real world code is written in 1.x series of spark ● As fundamental abstractions are changed, all this code need to migrate make use performance and API benefits ● More and more ecosystem projects need the version of spark 2.0 ● 1.x series will be out of maintenance mode very soon. So no more bug fixes
  • 6.
    Recap of Part1 ● Choosing Scala Version ● New Connectors ● Spark Session Entry Point ● Built in Csv Connector ● Moving from DF RDD API to Dataset ● Cross Joins ● Custom ML Transforms
  • 7.
  • 8.
    SubQueries ● A Queryinside another query is known as subquery ● Standard feature of SQL ● Ex from MySQL SELECT AVG(sum_column1) FROM (SELECT SUM(column1) AS sum_column1 FROM t1 GROUP BY column1) AS t1; ● Highly useful as they allow us to combine multiple different type of aggregation in one query
  • 9.
    Types of SubQuery ●In select clause ( Scalar) SELECT employee_id, age, (SELECT MAX(age) FROM employee) max_age FROM employee ● In from clause (Derived Tables) SELECT AVG(sum_column1) FROM (SELECT SUM(column1) AS sum_column1 FROM t1 GROUP BY column1) AS t1; ● In where clause ( Predicate) SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
  • 10.
    SubQuery support inSpark 1.x ● SubQuery support in spark 1.x mimic the support available in hive 0.12 ● Hive only supported the subqueries in from clause so spark only supported the same. ● The subquery in from clause fairly limited on what they are capable of doing. ● To support the advanced querying in spark sql, they needed to add other from subqueries in 2.0
  • 11.
    SubQuery support inSpark 2.x ● Spark has greatly improved its support on SQL dialect on 2.0 version ● They have added most of the standard features of SQL-92 standard ● Full fledged sql parser, no more depend on hive ● Runs all 99 queries TPC-DS natively ● Makes Spark full fledged OLAP query engine
  • 12.
    Scalar SubQueries ● Scalarsubqueries are the sub queries which returns single ( scalar) result ● There are two kind of Scalar subqueries ○ UnCorrelated Subqueries The one which doesn’t depend upon the external query ● Correlated Subqueries The one depend upon the outer queries
  • 13.
    Uncorrelated Scalar SubQueries ●Add maximum sales amount to each row of the sales data ● This normally helps us to understand how far away a given transaction compared to maximum sales we have done ● In Spark 1.x sparkone.SubQueries ● In Spark 2.x sparktwo.SubQueries
  • 14.
    Correlated SubQueries ● Addmaximum sales amount to each row of the sales data in each item category ● This normally helps us to understand how far away a given transaction compared to maximum sales we have done ● In Spark 1.x sparkone.SubQueries ● In Spark 2.x sparktwo.SubQueries
  • 15.
  • 16.
    Catalog in SQL ●Catalog is a metadata store which contains the metadata of the all the information of a SQL system ● Typical contents of catalog are ○ Databases ○ Tables ○ Table Metadata ○ Functions ○ Partitions ○ Buckets
  • 17.
    Catalog in Spark1.x ● By default, spark uses a in memory catalog which keeps track of spark temp tables ● It is not persistent ● For any persistent operations, spark advocated use of the hive metastore ● There was no standard API to query metadata information from the in memory / hive metadata store ● AdHoc functions added to SQLContext over a time to fix this
  • 18.
    Need of CatalogAPI ● Many interactive applications, like notebook systems, often need an API to query metastore to show relevant information to user ● Whenever we integrate with hive, without a catalog API we have to resort to running HQL queries and parsing it’s information for getting data ● Cannot manipulate hive metastore directly from spark API ● Need to evolve to support more meta stores in future
  • 19.
    Catalog API in2.x ● Spark 2.0 has added a full fledged catalog API to spark session ● It lives a sparkSession.catalog namespace ● This catalog has API’s to create, read and delete elements from in memory and also from hive metastore ● Having this standard API to interact with catalog makes developer much easier than before ● If we were using non standard API’s before, it’s time to migrate
  • 20.
    Catalog API Migration ●Migrate from the sqlContext API’s to sparkSession.catalog API ● Use sparkSession rather than using HiveContext to have access to the special operations ● Spark 1.x sparkone.CatalogExample ● Spark 2.x sparktwo.CatalogExample
  • 21.
  • 22.
    Hive Integration inSpark 1.x ● Spark SQL had native support for the hive from beginning ● In beginning spark SQL used hive query parse for parsing and meta store for persistent storage ● To integrate with hive, one has to create HiveContext which is separate than SQLContext ● Some of API’s were only available in SQLContext and some hive specific on HiveContext ● No support for manipulating hive metastore
  • 23.
    Hive Integration in2.x ● No more separate HiveContext ● SparkSession has enableHiveSupport API to enable the hive support ● This makes both spark SQL and Hive API’s consistent ● Spark 2.0 catalog API also supports the hive metastore ● Example sparktwo.CatalogHiveExample
  • 24.
  • 25.
    Need of RefreshTable API ● In spark, we cache a table ( dataset) for performance reasons ● Spark caches the metadata in it’s metadatastore and actual data in block manager ● If underneath file/table changes, there was no direct API in spark to force a table refresh ● If you just uncache/recache, it will only reflects the change in data not metadata ● So we need a standard way to refresh the table
  • 26.
    Refresh Table andBy Path ● Spark 2.0 provides two API’s for refreshing datasets in spark ● refreshTable API which was imported from HiveContext are used for or registered temp tables or hive tables ● refreshByPath API is used for refreshing datasets without have to register them as tables beforehand ● Spark 1.x sparkone.RefreshExample ● Spark 2.x - sparktwo.RefreshExample
  • 27.
  • 28.
    Iterative Programming inSpark ● Spark is one of the first big data framework to have great support iterative programming natively ● Iterative programs go over the again and again to compute some results ● Spark ML is one of iterative frameworks in spark ● Even though caching and RDD mechanisms worked great with iterative programming, moving to dataframe has created new challenges
  • 29.
  • 30.
    Iteration in DataframeAPI ● As every step of iteration creates new DF, the logical plan keeps on going big ● As spark needs to keep complete query plan for recovery, the overhead to analyse the plan increases as number of iterations increases ● This overhead is compute bound and done at master ● As this overhead increases, it makes iteration very very slow ● Ex : sparkone.CheckPoint
  • 31.
    Solution to QueryPlan Issue ● To solve ever growing query plan (lineage) we need to truncate it to make it faster ● Whenever we truncate the query plan, we will lose the ability to recover ● To avoid that, we need to store the intermediate data before we truncated the query plan ● Saving intermediate data with truncation of the query plan will result in faster performance
  • 32.
    Dataset Persistence API ●In spark 2.1, there is a new persist API on dataset ● It’s analogues to RDD persist API ● In RDD, persist used to persist the RDD and then truncate it’s lineage ● Similarly in case of dataset, persist will persist the dataset and then truncates it’s query plan ● Make sure that persist times are much lower than the overhead you are facing ● Ex : sparktwo.CheckPoint
  • 33.
  • 34.
    Best Practice Migration ●As fundamental abstractions has changed in spark 2.0, we need to rethink our best practices ● Many best practices were centered around RDD abstraction which is no more central abstraction ● Also there are many optimisation in the catalyst where many optimisation is done for us by platform ● So let’s look at some best practices of 1.x and see how we can change
  • 35.
    Choice of Serializer ●Use kryo serializer over java and register classes with kryo ● This best practice was devised for efficient caching and transfer of RDD data ● But in spark 2.0, Dataset uses custom code generated serialization framework for most of the code and data ● So unless there is heavy use of RDD in your project you don’t need to worry about serializer is 2.0
  • 36.
    Cache Format ● RDDuses MEMORY_ONLY as default and it’s most efficient caching ● DataFrame/Dataset uses MEMORY_AND_DISK rather than MEMORY_ONLY ● Computation of Dataset and converting to custom serialization format is often costly ● So use MEMORY_AND_DISK as default format over MEMORY_ONLY
  • 37.
    Use of BroadCastvariables ● Use broadcast variable for optimising the lookups and joins ● BroadCast variables played an important role of making joins efficient in the RDD world ● These variables don’t have much scope in Dataset API land ● By configuring broadcast value, spark sql will do broadcasting automatically ● Don’t use them unless there is a reason
  • 38.
    Choice of Clusters ●Use YARN/Mesos for production. Standalone is mostly for simple apps ● Users were encouraged to use a dedicated cluster manager over standalone given by spark ● With databricks cloud putting weight behind standalone cluster it has been ready for production grade ● Many companies run their spark applications in standalone cluster today ● Choose standalone if you run only spark applications
  • 39.
    Use of HiveContext ●Use HiveContext over SQLContext for using Spark SQL ● In spark 1.x, Spark SQL was simplistic and heavily depended on hive to give query parsing ● In spark 2.0, spark sql is enriched which is now more powerful than hive itself ● Most of the udf of hive are now rewritten in spark sql and code generated ● Unless you want to use hive metastore, use sparksession without hive
  • 40.
    References ● http://blog.madhukaraphatak.com/categories/spark-two- migration-series/ ● http://www.spark.tc/migrating-applications-to-apache-sp ark-2-0-2/ ●http://blog.madhukaraphatak.com/categories/spark-two/ ● https://www.youtube.com/watch?v=jyXEUXCYGwo ● https://databricks.com/blog/2016/06/17/sql-subqueries-i n-apache-spark-2-0.html