Migrating to spark 2.0

Migrating to Spark 2.0 -
Part 1
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration

● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● What’s New in Spark 2.0
● Choosing Right Scala Version
● External Connectors
● New Entry Point
● Built in CSV connector
● RDD to Dataset
● Cross Joins
● Custom ML Transformers
● Testing

What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features

Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes

Spark and Scala Version
● Spark 1.x was built primarily using scala 2.10 and cross
build to use 2.11
● So most of the ecosystem libraries were using 2.10 and
some offered 2.11 versions
● Scala 2.10 is released 7 years ago and 2.11 4 years
back
● From spark 2.x, spark is built using 2.11 and backward
compatible with 2.10 for some time
● By 2.3.0, scala 2.10 support will be phased out

Scala and Binary Compatibility
● Unlike Java, major versions of the scala are not binary
compatible
● Every library in the project has to be compiled with
same scala major version in order them to co-exist
● If your library is not compiled with new version of scala,
even there is no new features, you cannot use it.
● This binary incompatibility is mostly to do with how scala
maps it’s features to jvm bytecode.
● Reason for % operator in build.sbt

Spark 2.x with scala 2.10
● You can use 2.1.x and 2.2.x version of spark with scala
2.10 version as of now
● Spark binaries are built for 2.11. You need to build from
source if you need 2.10
● Most of the built in connectors and third party one do
support both scala 2.10
● But scala 2.10 support more of leeway till everyone get
upgraded to 2.11.
● So it’s not advisable to use 2.10 unless there is strong
reason

Challenges to move to Scala 2.11
● Make sure all libraries you use, both spark and
non-spark have 2.11 version
● 2.11 has modularised xml and other parts. So you may
need to add additional dependencies which were part of
the scala library earlier
● Update sbt version, to support new version of scala
● More performance tuning as it comes with new backend
code generator

For Java Users
● You can update the dependency to scala 2.11 without
much effort
● Java 8 is needed for spark 2.x, as support for prior
versions of java is deprecated
● With Java 8, you can use spark API’s with much more
streamlined way using new FP abstractions in java 8
● Replace all instances of _2.10 with _2.11 in
dependencies

Migration Activity
● Update build.sbt to reflect new version of scala
● Use % operator whenever possible so that scala version
is automatically inferred
● Set the version of spark to 2.1.0 ( or latest stable
release of 2.x series)
● Verify all the libraries are good with new version of scala
● Now your dependencies are ready.
● Ex : build.sbt

API Compatibility for Connectors
● All the 1.x connectors are source compatible with spark
2.x
● Datasource API has not gone through much difference
in 2.x
● You need to recompile code to support new version of
spark and new version of scala
● Most of the external connectors like elasticsearch,
mongodb already support 2.x

Apache Bahir
● Spark has removed many earlier builtin connectors from
its repo to slim down the core project
● Most of them are from streaming . Ex: twitter,zeromq
● Now these are part of the open source project apache
bahir lead by IBM
● So if you are using any of these, you need to change
dependencies to apache bahir.
● You need to change code to reflect new package
names if needed
● Ex : ZeroMQWordCount.scala

New Connectors
● Not only some connectors are removed, there are few
built in ones are added
● Csv is one of the new built in connector added in 2.x
● Also few third party libraries have separate version for
2.0
Ex: elasticsearch-hadoop is now elasticsearch-spark in 2.x
● So find the right connectors as older one may be moved
under different name or version

Migration Activity
● Add all deleted connectors from apache bahir and
update the code reflect same
● Use built in connectors over third party library like for
csv
● Compile all custom sources for new version of spark
and scala

Contexts in Spark
● In spark, most of the code is starts with contexts
● In first version of spark, SparkContext was entry point to
the RDD world
● As new API’s added, spark added more contexts
○ StreamingContext - DStreamAPI
○ SQLContext - Dataframe API
○ HiveContext - SQL with Hive support
○ Custom Context for libraries like CassandraContext,
MemsqlContext etc

Challenges with Contexts
● In spark 1.x, as you use one or more API’s you need to
maintain multiple contexts
● Having multiple different entry points makes code more
complicated and less maintainable
● Inconsistencies between the contexts
○ getOrCreate API is available on SQLContext but not
HiveContext
● Makes API unification more harder

Spark Session
● Single entry point for all API’s
● Primarily targeted to replace all structured contexts like
SQLContext, HiveContext , custom contexts etc
● Wraps sparkcontext for all execution related specifics
● All API’s is copied from SQLContext so it can be used
as drop in place for SQLContext or HiveContext
● Replaces StreamingContext when structured streaming
comes of age

Migration Activity
● Replace SQLContext and HiveContext with spark
session API
● You can add custom optimisation rules to spark session
rather than using custom contexts for connectors
● Enable hive support on demand without changing the
entry point
● Use wrapped spark context rather creating one from
scratch
● Ex: CsvLoad.scala

Why CSV matters?
● Spark 1.x came with built in support for json rather than
csv
● Csv was supported by spark-csv library by databricks
● But spark team soon realised csv is de facto standard in
data science communities and major enterprises
● So in 2.x, spark-csv code improvised and built into the
core of the spark
● So from 2.x, you don’t need to use spark-csv library
anymore
● Ex : CsvLoad.scala

Advantage of built in connector
● No dependency on third party library.
● Easy to experiment on spark-shell or notebook systems
● Better performance for
○ Schema Inference
○ Joins
● Ex : CsvJoin.scala

DataFrame Abstraction in 1.x vs 2.x
DataFrame
RDD
Spark Catalyst
Dataset
RDD
Spark Catalyst
DataFrame

Dataframe functional API’s
● In spark 1.x, data frame exposed structural API’s using
dataframe dsl or spark sql interfaces
● Whenever developer needed functional API’s like map,
flatMap spark automatically fall back to RDD abstraction
● This movement between dataframe to RDD and back
made sure that developer can choose the right tool for
work
● But it also came with cost of non optimised RDD code
● Ex : DFMapExample.scala

Dataset functional API’s
● Dataset borrows it’s API’s both from RDD and
Dataframe
● So when you call functional API’s on dataframe in 2.x,
it’s no more returns a RDD but it returns a dataset
● This bridges the performance gap between structured
API’s and functional API’s
● But it also may break your code if it expecting an RDD
● So you need to migrate all the code to use rdd explicitly
or use dataset functional API’s
● Ex : DFMapExample.scala

What is Cross Joins?
● When we join two dataframes, without any join condition
we run into cross join
● Also known as cross product
● Most of the time induced by accident or created when
join condition is matching wrong data type columns
● Huge performance penalty
● Should be avoided in planning stage rather than in
execution phase

Cross Joins in 1.x vs 2.x
● In spark 1.x, there is no check to say whenever cross
join happens which resulted in poor performance with
large data
● In 2.x, spark has added check in logical plan to avoid
cross joins
● If user want cross join, they have to be explicitly
● So if you have joins which are cross join they result in
exceptions
● CrossJoin.scala

ML Transformers
● ML uses dataframe as pipeline mechanism to build
complex machine learning models
● Transformer is an API to represent the data pre
processing needed for learning algorithm
● From spark 2.x, the ml pipeline will be using dataset as
the abstraction rather than dataframe
● So if you have custom transformers in your code, you
need to update to new API to support dataset
abstraction
● Ex: CustomMLTransformer.scala

References
● http://blog.madhukaraphatak.com/categories/spark-two-
migration-series/
● http://www.spark.tc/migrating-applications-to-apache-sp
ark-2-0-2/
● http://blog.madhukaraphatak.com/categories/spark-two/

Migrating to spark 2.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Migrating to spark 2.0

Similar to Migrating to spark 2.0 (20)

More from datamantra

More from datamantra (12)

Recently uploaded

Recently uploaded (20)

Migrating to spark 2.0