1. Migrating to Spark 2.0 -
Part 1
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration
2. ● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● What’s New in Spark 2.0
● Choosing Right Scala Version
● External Connectors
● New Entry Point
● Built in CSV connector
● RDD to Dataset
● Cross Joins
● Custom ML Transformers
● Testing
4. What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features
5. Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes
7. Spark and Scala Version
● Spark 1.x was built primarily using scala 2.10 and cross
build to use 2.11
● So most of the ecosystem libraries were using 2.10 and
some offered 2.11 versions
● Scala 2.10 is released 7 years ago and 2.11 4 years
back
● From spark 2.x, spark is built using 2.11 and backward
compatible with 2.10 for some time
● By 2.3.0, scala 2.10 support will be phased out
8. Scala and Binary Compatibility
● Unlike Java, major versions of the scala are not binary
compatible
● Every library in the project has to be compiled with
same scala major version in order them to co-exist
● If your library is not compiled with new version of scala,
even there is no new features, you cannot use it.
● This binary incompatibility is mostly to do with how scala
maps it’s features to jvm bytecode.
● Reason for % operator in build.sbt
9. Spark 2.x with scala 2.10
● You can use 2.1.x and 2.2.x version of spark with scala
2.10 version as of now
● Spark binaries are built for 2.11. You need to build from
source if you need 2.10
● Most of the built in connectors and third party one do
support both scala 2.10
● But scala 2.10 support more of leeway till everyone get
upgraded to 2.11.
● So it’s not advisable to use 2.10 unless there is strong
reason
10. Challenges to move to Scala 2.11
● Make sure all libraries you use, both spark and
non-spark have 2.11 version
● 2.11 has modularised xml and other parts. So you may
need to add additional dependencies which were part of
the scala library earlier
● Update sbt version, to support new version of scala
● More performance tuning as it comes with new backend
code generator
11. For Java Users
● You can update the dependency to scala 2.11 without
much effort
● Java 8 is needed for spark 2.x, as support for prior
versions of java is deprecated
● With Java 8, you can use spark API’s with much more
streamlined way using new FP abstractions in java 8
● Replace all instances of _2.10 with _2.11 in
dependencies
12. Migration Activity
● Update build.sbt to reflect new version of scala
● Use % operator whenever possible so that scala version
is automatically inferred
● Set the version of spark to 2.1.0 ( or latest stable
release of 2.x series)
● Verify all the libraries are good with new version of scala
● Now your dependencies are ready.
● Ex : build.sbt
14. API Compatibility for Connectors
● All the 1.x connectors are source compatible with spark
2.x
● Datasource API has not gone through much difference
in 2.x
● You need to recompile code to support new version of
spark and new version of scala
● Most of the external connectors like elasticsearch,
mongodb already support 2.x
15. Apache Bahir
● Spark has removed many earlier builtin connectors from
its repo to slim down the core project
● Most of them are from streaming . Ex: twitter,zeromq
● Now these are part of the open source project apache
bahir lead by IBM
● So if you are using any of these, you need to change
dependencies to apache bahir.
● You need to change code to reflect new package
names if needed
● Ex : ZeroMQWordCount.scala
16. New Connectors
● Not only some connectors are removed, there are few
built in ones are added
● Csv is one of the new built in connector added in 2.x
● Also few third party libraries have separate version for
2.0
Ex: elasticsearch-hadoop is now elasticsearch-spark in 2.x
● So find the right connectors as older one may be moved
under different name or version
17. Migration Activity
● Add all deleted connectors from apache bahir and
update the code reflect same
● Use built in connectors over third party library like for
csv
● Compile all custom sources for new version of spark
and scala
19. Contexts in Spark
● In spark, most of the code is starts with contexts
● In first version of spark, SparkContext was entry point to
the RDD world
● As new API’s added, spark added more contexts
○ StreamingContext - DStreamAPI
○ SQLContext - Dataframe API
○ HiveContext - SQL with Hive support
○ Custom Context for libraries like CassandraContext,
MemsqlContext etc
20. Challenges with Contexts
● In spark 1.x, as you use one or more API’s you need to
maintain multiple contexts
● Having multiple different entry points makes code more
complicated and less maintainable
● Inconsistencies between the contexts
○ getOrCreate API is available on SQLContext but not
HiveContext
● Makes API unification more harder
21. Spark Session
● Single entry point for all API’s
● Primarily targeted to replace all structured contexts like
SQLContext, HiveContext , custom contexts etc
● Wraps sparkcontext for all execution related specifics
● All API’s is copied from SQLContext so it can be used
as drop in place for SQLContext or HiveContext
● Replaces StreamingContext when structured streaming
comes of age
22. Migration Activity
● Replace SQLContext and HiveContext with spark
session API
● You can add custom optimisation rules to spark session
rather than using custom contexts for connectors
● Enable hive support on demand without changing the
entry point
● Use wrapped spark context rather creating one from
scratch
● Ex: CsvLoad.scala
24. Why CSV matters?
● Spark 1.x came with built in support for json rather than
csv
● Csv was supported by spark-csv library by databricks
● But spark team soon realised csv is de facto standard in
data science communities and major enterprises
● So in 2.x, spark-csv code improvised and built into the
core of the spark
● So from 2.x, you don’t need to use spark-csv library
anymore
● Ex : CsvLoad.scala
25. Advantage of built in connector
● No dependency on third party library.
● Easy to experiment on spark-shell or notebook systems
● Better performance for
○ Schema Inference
○ Joins
● Ex : CsvJoin.scala
27. DataFrame Abstraction in 1.x vs 2.x
DataFrame
RDD
Spark Catalyst
Dataset
RDD
Spark Catalyst
DataFrame
28. Dataframe functional API’s
● In spark 1.x, data frame exposed structural API’s using
dataframe dsl or spark sql interfaces
● Whenever developer needed functional API’s like map,
flatMap spark automatically fall back to RDD abstraction
● This movement between dataframe to RDD and back
made sure that developer can choose the right tool for
work
● But it also came with cost of non optimised RDD code
● Ex : DFMapExample.scala
29. Dataset functional API’s
● Dataset borrows it’s API’s both from RDD and
Dataframe
● So when you call functional API’s on dataframe in 2.x,
it’s no more returns a RDD but it returns a dataset
● This bridges the performance gap between structured
API’s and functional API’s
● But it also may break your code if it expecting an RDD
● So you need to migrate all the code to use rdd explicitly
or use dataset functional API’s
● Ex : DFMapExample.scala
31. What is Cross Joins?
● When we join two dataframes, without any join condition
we run into cross join
● Also known as cross product
● Most of the time induced by accident or created when
join condition is matching wrong data type columns
● Huge performance penalty
● Should be avoided in planning stage rather than in
execution phase
32. Cross Joins in 1.x vs 2.x
● In spark 1.x, there is no check to say whenever cross
join happens which resulted in poor performance with
large data
● In 2.x, spark has added check in logical plan to avoid
cross joins
● If user want cross join, they have to be explicitly
● So if you have joins which are cross join they result in
exceptions
● CrossJoin.scala
34. ML Transformers
● ML uses dataframe as pipeline mechanism to build
complex machine learning models
● Transformer is an API to represent the data pre
processing needed for learning algorithm
● From spark 2.x, the ml pipeline will be using dataset as
the abstraction rather than dataframe
● So if you have custom transformers in your code, you
need to update to new API to support dataset
abstraction
● Ex: CustomMLTransformer.scala