Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migrating to Spark 2.0 - Part 2


Published on

Moving to next generation Spark

Published in: Data & Analytics
  • Login to see the comments

Migrating to Spark 2.0 - Part 2

  1. 1. Migrating to Spark 2.0 - Part 2 Moving to next generation spark
  2. 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at ● Consult in Hadoop, Spark and Scala ●
  3. 3. Agenda ● What’s New in Spark 2.0 ● Recap of Part 1 ● Sub Queries ● Catalog API ● Hive Catalog ● Refresh Table ● Check Point for Iteration ● References
  4. 4. What’s new in 2.0? ● Dataset is the new user facing single abstraction ● RDD abstraction is used only for runtime ● Higher performance with whole stage code generation ● Significant changes to streaming abstraction with spark structured streaming ● Incorporates learning from 4 years of production use ● Spark ML is replacing MLLib as de facto ml library ● Breaks API compatibility for better API’s and features
  5. 5. Need for Migration ● Lot of real world code is written in 1.x series of spark ● As fundamental abstractions are changed, all this code need to migrate make use performance and API benefits ● More and more ecosystem projects need the version of spark 2.0 ● 1.x series will be out of maintenance mode very soon. So no more bug fixes
  6. 6. Recap of Part 1 ● Choosing Scala Version ● New Connectors ● Spark Session Entry Point ● Built in Csv Connector ● Moving from DF RDD API to Dataset ● Cross Joins ● Custom ML Transforms
  7. 7. SubQueries in Spark
  8. 8. SubQueries ● A Query inside another query is known as subquery ● Standard feature of SQL ● Ex from MySQL SELECT AVG(sum_column1) FROM (SELECT SUM(column1) AS sum_column1 FROM t1 GROUP BY column1) AS t1; ● Highly useful as they allow us to combine multiple different type of aggregation in one query
  9. 9. Types of SubQuery ● In select clause ( Scalar) SELECT employee_id, age, (SELECT MAX(age) FROM employee) max_age FROM employee ● In from clause (Derived Tables) SELECT AVG(sum_column1) FROM (SELECT SUM(column1) AS sum_column1 FROM t1 GROUP BY column1) AS t1; ● In where clause ( Predicate) SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
  10. 10. SubQuery support in Spark 1.x ● SubQuery support in spark 1.x mimic the support available in hive 0.12 ● Hive only supported the subqueries in from clause so spark only supported the same. ● The subquery in from clause fairly limited on what they are capable of doing. ● To support the advanced querying in spark sql, they needed to add other from subqueries in 2.0
  11. 11. SubQuery support in Spark 2.x ● Spark has greatly improved its support on SQL dialect on 2.0 version ● They have added most of the standard features of SQL-92 standard ● Full fledged sql parser, no more depend on hive ● Runs all 99 queries TPC-DS natively ● Makes Spark full fledged OLAP query engine
  12. 12. Scalar SubQueries ● Scalar subqueries are the sub queries which returns single ( scalar) result ● There are two kind of Scalar subqueries ○ UnCorrelated Subqueries The one which doesn’t depend upon the external query ● Correlated Subqueries The one depend upon the outer queries
  13. 13. Uncorrelated Scalar SubQueries ● Add maximum sales amount to each row of the sales data ● This normally helps us to understand how far away a given transaction compared to maximum sales we have done ● In Spark 1.x sparkone.SubQueries ● In Spark 2.x sparktwo.SubQueries
  14. 14. Correlated SubQueries ● Add maximum sales amount to each row of the sales data in each item category ● This normally helps us to understand how far away a given transaction compared to maximum sales we have done ● In Spark 1.x sparkone.SubQueries ● In Spark 2.x sparktwo.SubQueries
  15. 15. Catalog API
  16. 16. Catalog in SQL ● Catalog is a metadata store which contains the metadata of the all the information of a SQL system ● Typical contents of catalog are ○ Databases ○ Tables ○ Table Metadata ○ Functions ○ Partitions ○ Buckets
  17. 17. Catalog in Spark 1.x ● By default, spark uses a in memory catalog which keeps track of spark temp tables ● It is not persistent ● For any persistent operations, spark advocated use of the hive metastore ● There was no standard API to query metadata information from the in memory / hive metadata store ● AdHoc functions added to SQLContext over a time to fix this
  18. 18. Need of Catalog API ● Many interactive applications, like notebook systems, often need an API to query metastore to show relevant information to user ● Whenever we integrate with hive, without a catalog API we have to resort to running HQL queries and parsing it’s information for getting data ● Cannot manipulate hive metastore directly from spark API ● Need to evolve to support more meta stores in future
  19. 19. Catalog API in 2.x ● Spark 2.0 has added a full fledged catalog API to spark session ● It lives a sparkSession.catalog namespace ● This catalog has API’s to create, read and delete elements from in memory and also from hive metastore ● Having this standard API to interact with catalog makes developer much easier than before ● If we were using non standard API’s before, it’s time to migrate
  20. 20. Catalog API Migration ● Migrate from the sqlContext API’s to sparkSession.catalog API ● Use sparkSession rather than using HiveContext to have access to the special operations ● Spark 1.x sparkone.CatalogExample ● Spark 2.x sparktwo.CatalogExample
  21. 21. Hive Integration
  22. 22. Hive Integration in Spark 1.x ● Spark SQL had native support for the hive from beginning ● In beginning spark SQL used hive query parse for parsing and meta store for persistent storage ● To integrate with hive, one has to create HiveContext which is separate than SQLContext ● Some of API’s were only available in SQLContext and some hive specific on HiveContext ● No support for manipulating hive metastore
  23. 23. Hive Integration in 2.x ● No more separate HiveContext ● SparkSession has enableHiveSupport API to enable the hive support ● This makes both spark SQL and Hive API’s consistent ● Spark 2.0 catalog API also supports the hive metastore ● Example sparktwo.CatalogHiveExample
  24. 24. Refresh Table API
  25. 25. Need of Refresh Table API ● In spark, we cache a table ( dataset) for performance reasons ● Spark caches the metadata in it’s metadatastore and actual data in block manager ● If underneath file/table changes, there was no direct API in spark to force a table refresh ● If you just uncache/recache, it will only reflects the change in data not metadata ● So we need a standard way to refresh the table
  26. 26. Refresh Table and By Path ● Spark 2.0 provides two API’s for refreshing datasets in spark ● refreshTable API which was imported from HiveContext are used for or registered temp tables or hive tables ● refreshByPath API is used for refreshing datasets without have to register them as tables beforehand ● Spark 1.x sparkone.RefreshExample ● Spark 2.x - sparktwo.RefreshExample
  27. 27. CheckPoint API
  28. 28. Iterative Programming in Spark ● Spark is one of the first big data framework to have great support iterative programming natively ● Iterative programs go over the again and again to compute some results ● Spark ML is one of iterative frameworks in spark ● Even though caching and RDD mechanisms worked great with iterative programming, moving to dataframe has created new challenges
  29. 29. Spark iterative processing
  30. 30. Iteration in Dataframe API ● As every step of iteration creates new DF, the logical plan keeps on going big ● As spark needs to keep complete query plan for recovery, the overhead to analyse the plan increases as number of iterations increases ● This overhead is compute bound and done at master ● As this overhead increases, it makes iteration very very slow ● Ex : sparkone.CheckPoint
  31. 31. Solution to Query Plan Issue ● To solve ever growing query plan (lineage) we need to truncate it to make it faster ● Whenever we truncate the query plan, we will lose the ability to recover ● To avoid that, we need to store the intermediate data before we truncated the query plan ● Saving intermediate data with truncation of the query plan will result in faster performance
  32. 32. Dataset Persistence API ● In spark 2.1, there is a new persist API on dataset ● It’s analogues to RDD persist API ● In RDD, persist used to persist the RDD and then truncate it’s lineage ● Similarly in case of dataset, persist will persist the dataset and then truncates it’s query plan ● Make sure that persist times are much lower than the overhead you are facing ● Ex : sparktwo.CheckPoint
  33. 33. Migrating Best Practices
  34. 34. Best Practice Migration ● As fundamental abstractions has changed in spark 2.0, we need to rethink our best practices ● Many best practices were centered around RDD abstraction which is no more central abstraction ● Also there are many optimisation in the catalyst where many optimisation is done for us by platform ● So let’s look at some best practices of 1.x and see how we can change
  35. 35. Choice of Serializer ● Use kryo serializer over java and register classes with kryo ● This best practice was devised for efficient caching and transfer of RDD data ● But in spark 2.0, Dataset uses custom code generated serialization framework for most of the code and data ● So unless there is heavy use of RDD in your project you don’t need to worry about serializer is 2.0
  36. 36. Cache Format ● RDD uses MEMORY_ONLY as default and it’s most efficient caching ● DataFrame/Dataset uses MEMORY_AND_DISK rather than MEMORY_ONLY ● Computation of Dataset and converting to custom serialization format is often costly ● So use MEMORY_AND_DISK as default format over MEMORY_ONLY
  37. 37. Use of BroadCast variables ● Use broadcast variable for optimising the lookups and joins ● BroadCast variables played an important role of making joins efficient in the RDD world ● These variables don’t have much scope in Dataset API land ● By configuring broadcast value, spark sql will do broadcasting automatically ● Don’t use them unless there is a reason
  38. 38. Choice of Clusters ● Use YARN/Mesos for production. Standalone is mostly for simple apps ● Users were encouraged to use a dedicated cluster manager over standalone given by spark ● With databricks cloud putting weight behind standalone cluster it has been ready for production grade ● Many companies run their spark applications in standalone cluster today ● Choose standalone if you run only spark applications
  39. 39. Use of HiveContext ● Use HiveContext over SQLContext for using Spark SQL ● In spark 1.x, Spark SQL was simplistic and heavily depended on hive to give query parsing ● In spark 2.0, spark sql is enriched which is now more powerful than hive itself ● Most of the udf of hive are now rewritten in spark sql and code generated ● Unless you want to use hive metastore, use sparksession without hive
  40. 40. References ● migration-series/ ● ark-2-0-2/ ● ● ● n-apache-spark-2-0.html