SlideShare a Scribd company logo
1 of 36
Download to read offline
Migrating to Spark 2.0 -
Part 1
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● What’s New in Spark 2.0
● Choosing Right Scala Version
● External Connectors
● New Entry Point
● Built in CSV connector
● RDD to Dataset
● Cross Joins
● Custom ML Transformers
● Testing
What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features
Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes
Choosing Right Scala Version
Spark and Scala Version
● Spark 1.x was built primarily using scala 2.10 and cross
build to use 2.11
● So most of the ecosystem libraries were using 2.10 and
some offered 2.11 versions
● Scala 2.10 is released 7 years ago and 2.11 4 years
back
● From spark 2.x, spark is built using 2.11 and backward
compatible with 2.10 for some time
● By 2.3.0, scala 2.10 support will be phased out
Scala and Binary Compatibility
● Unlike Java, major versions of the scala are not binary
compatible
● Every library in the project has to be compiled with
same scala major version in order them to co-exist
● If your library is not compiled with new version of scala,
even there is no new features, you cannot use it.
● This binary incompatibility is mostly to do with how scala
maps it’s features to jvm bytecode.
● Reason for % operator in build.sbt
Spark 2.x with scala 2.10
● You can use 2.1.x and 2.2.x version of spark with scala
2.10 version as of now
● Spark binaries are built for 2.11. You need to build from
source if you need 2.10
● Most of the built in connectors and third party one do
support both scala 2.10
● But scala 2.10 support more of leeway till everyone get
upgraded to 2.11.
● So it’s not advisable to use 2.10 unless there is strong
reason
Challenges to move to Scala 2.11
● Make sure all libraries you use, both spark and
non-spark have 2.11 version
● 2.11 has modularised xml and other parts. So you may
need to add additional dependencies which were part of
the scala library earlier
● Update sbt version, to support new version of scala
● More performance tuning as it comes with new backend
code generator
For Java Users
● You can update the dependency to scala 2.11 without
much effort
● Java 8 is needed for spark 2.x, as support for prior
versions of java is deprecated
● With Java 8, you can use spark API’s with much more
streamlined way using new FP abstractions in java 8
● Replace all instances of _2.10 with _2.11 in
dependencies
Migration Activity
● Update build.sbt to reflect new version of scala
● Use % operator whenever possible so that scala version
is automatically inferred
● Set the version of spark to 2.1.0 ( or latest stable
release of 2.x series)
● Verify all the libraries are good with new version of scala
● Now your dependencies are ready.
● Ex : build.sbt
External Connectors
API Compatibility for Connectors
● All the 1.x connectors are source compatible with spark
2.x
● Datasource API has not gone through much difference
in 2.x
● You need to recompile code to support new version of
spark and new version of scala
● Most of the external connectors like elasticsearch,
mongodb already support 2.x
Apache Bahir
● Spark has removed many earlier builtin connectors from
its repo to slim down the core project
● Most of them are from streaming . Ex: twitter,zeromq
● Now these are part of the open source project apache
bahir lead by IBM
● So if you are using any of these, you need to change
dependencies to apache bahir.
● You need to change code to reflect new package
names if needed
● Ex : ZeroMQWordCount.scala
New Connectors
● Not only some connectors are removed, there are few
built in ones are added
● Csv is one of the new built in connector added in 2.x
● Also few third party libraries have separate version for
2.0
Ex: elasticsearch-hadoop is now elasticsearch-spark in 2.x
● So find the right connectors as older one may be moved
under different name or version
Migration Activity
● Add all deleted connectors from apache bahir and
update the code reflect same
● Use built in connectors over third party library like for
csv
● Compile all custom sources for new version of spark
and scala
New Entry Point
Contexts in Spark
● In spark, most of the code is starts with contexts
● In first version of spark, SparkContext was entry point to
the RDD world
● As new API’s added, spark added more contexts
○ StreamingContext - DStreamAPI
○ SQLContext - Dataframe API
○ HiveContext - SQL with Hive support
○ Custom Context for libraries like CassandraContext,
MemsqlContext etc
Challenges with Contexts
● In spark 1.x, as you use one or more API’s you need to
maintain multiple contexts
● Having multiple different entry points makes code more
complicated and less maintainable
● Inconsistencies between the contexts
○ getOrCreate API is available on SQLContext but not
HiveContext
● Makes API unification more harder
Spark Session
● Single entry point for all API’s
● Primarily targeted to replace all structured contexts like
SQLContext, HiveContext , custom contexts etc
● Wraps sparkcontext for all execution related specifics
● All API’s is copied from SQLContext so it can be used
as drop in place for SQLContext or HiveContext
● Replaces StreamingContext when structured streaming
comes of age
Migration Activity
● Replace SQLContext and HiveContext with spark
session API
● You can add custom optimisation rules to spark session
rather than using custom contexts for connectors
● Enable hive support on demand without changing the
entry point
● Use wrapped spark context rather creating one from
scratch
● Ex: CsvLoad.scala
Built in CSV Connector
Why CSV matters?
● Spark 1.x came with built in support for json rather than
csv
● Csv was supported by spark-csv library by databricks
● But spark team soon realised csv is de facto standard in
data science communities and major enterprises
● So in 2.x, spark-csv code improvised and built into the
core of the spark
● So from 2.x, you don’t need to use spark-csv library
anymore
● Ex : CsvLoad.scala
Advantage of built in connector
● No dependency on third party library.
● Easy to experiment on spark-shell or notebook systems
● Better performance for
○ Schema Inference
○ Joins
● Ex : CsvJoin.scala
Moving RDD to Dataset
DataFrame Abstraction in 1.x vs 2.x
DataFrame
RDD
Spark Catalyst
Dataset
RDD
Spark Catalyst
DataFrame
Dataframe functional API’s
● In spark 1.x, data frame exposed structural API’s using
dataframe dsl or spark sql interfaces
● Whenever developer needed functional API’s like map,
flatMap spark automatically fall back to RDD abstraction
● This movement between dataframe to RDD and back
made sure that developer can choose the right tool for
work
● But it also came with cost of non optimised RDD code
● Ex : DFMapExample.scala
Dataset functional API’s
● Dataset borrows it’s API’s both from RDD and
Dataframe
● So when you call functional API’s on dataframe in 2.x,
it’s no more returns a RDD but it returns a dataset
● This bridges the performance gap between structured
API’s and functional API’s
● But it also may break your code if it expecting an RDD
● So you need to migrate all the code to use rdd explicitly
or use dataset functional API’s
● Ex : DFMapExample.scala
Cross Joins
What is Cross Joins?
● When we join two dataframes, without any join condition
we run into cross join
● Also known as cross product
● Most of the time induced by accident or created when
join condition is matching wrong data type columns
● Huge performance penalty
● Should be avoided in planning stage rather than in
execution phase
Cross Joins in 1.x vs 2.x
● In spark 1.x, there is no check to say whenever cross
join happens which resulted in poor performance with
large data
● In 2.x, spark has added check in logical plan to avoid
cross joins
● If user want cross join, they have to be explicitly
● So if you have joins which are cross join they result in
exceptions
● CrossJoin.scala
Custom ML Transformers
ML Transformers
● ML uses dataframe as pipeline mechanism to build
complex machine learning models
● Transformer is an API to represent the data pre
processing needed for learning algorithm
● From spark 2.x, the ml pipeline will be using dataset as
the abstraction rather than dataframe
● So if you have custom transformers in your code, you
need to update to new API to support dataset
abstraction
● Ex: CustomMLTransformer.scala
Thank You
References
● http://blog.madhukaraphatak.com/categories/spark-two-
migration-series/
● http://www.spark.tc/migrating-applications-to-apache-sp
ark-2-0-2/
● http://blog.madhukaraphatak.com/categories/spark-two/

More Related Content

What's hot

Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 

What's hot (20)

Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 

Similar to Migrating to spark 2.0

Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Data Engineer’s Lunch #45: Apache Livy
Data Engineer’s Lunch #45: Apache LivyData Engineer’s Lunch #45: Apache Livy
Data Engineer’s Lunch #45: Apache LivyAnant Corporation
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
How to keep maintainability of long life Scala applications
How to keep maintainability of long life Scala applicationsHow to keep maintainability of long life Scala applications
How to keep maintainability of long life Scala applicationstakezoe
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS GlueLaercio Serra
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark ScalaKnoldus Inc.
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Timothy Spann
 

Similar to Migrating to spark 2.0 (20)

Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Spark
SparkSpark
Spark
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Data Engineer’s Lunch #45: Apache Livy
Data Engineer’s Lunch #45: Apache LivyData Engineer’s Lunch #45: Apache Livy
Data Engineer’s Lunch #45: Apache Livy
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
How to keep maintainability of long life Scala applications
How to keep maintainability of long life Scala applicationsHow to keep maintainability of long life Scala applications
How to keep maintainability of long life Scala applications
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
 

More from datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 

More from datamantra (12)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 

Recently uploaded

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Migrating to spark 2.0

  • 1. Migrating to Spark 2.0 - Part 1 Moving to next generation spark https://github.com/phatak-dev/spark-two-migration
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● What’s New in Spark 2.0 ● Choosing Right Scala Version ● External Connectors ● New Entry Point ● Built in CSV connector ● RDD to Dataset ● Cross Joins ● Custom ML Transformers ● Testing
  • 4. What’s new in 2.0? ● Dataset is the new user facing single abstraction ● RDD abstraction is used only for runtime ● Higher performance with whole stage code generation ● Significant changes to streaming abstraction with spark structured streaming ● Incorporates learning from 4 years of production use ● Spark ML is replacing MLLib as de facto ml library ● Breaks API compatibility for better API’s and features
  • 5. Need for Migration ● Lot of real world code is written in 1.x series of spark ● As fundamental abstractions are changed, all this code need to migrate make use performance and API benefits ● More and more ecosystem projects need the version of spark 2.0 ● 1.x series will be out of maintenance mode very soon. So no more bug fixes
  • 7. Spark and Scala Version ● Spark 1.x was built primarily using scala 2.10 and cross build to use 2.11 ● So most of the ecosystem libraries were using 2.10 and some offered 2.11 versions ● Scala 2.10 is released 7 years ago and 2.11 4 years back ● From spark 2.x, spark is built using 2.11 and backward compatible with 2.10 for some time ● By 2.3.0, scala 2.10 support will be phased out
  • 8. Scala and Binary Compatibility ● Unlike Java, major versions of the scala are not binary compatible ● Every library in the project has to be compiled with same scala major version in order them to co-exist ● If your library is not compiled with new version of scala, even there is no new features, you cannot use it. ● This binary incompatibility is mostly to do with how scala maps it’s features to jvm bytecode. ● Reason for % operator in build.sbt
  • 9. Spark 2.x with scala 2.10 ● You can use 2.1.x and 2.2.x version of spark with scala 2.10 version as of now ● Spark binaries are built for 2.11. You need to build from source if you need 2.10 ● Most of the built in connectors and third party one do support both scala 2.10 ● But scala 2.10 support more of leeway till everyone get upgraded to 2.11. ● So it’s not advisable to use 2.10 unless there is strong reason
  • 10. Challenges to move to Scala 2.11 ● Make sure all libraries you use, both spark and non-spark have 2.11 version ● 2.11 has modularised xml and other parts. So you may need to add additional dependencies which were part of the scala library earlier ● Update sbt version, to support new version of scala ● More performance tuning as it comes with new backend code generator
  • 11. For Java Users ● You can update the dependency to scala 2.11 without much effort ● Java 8 is needed for spark 2.x, as support for prior versions of java is deprecated ● With Java 8, you can use spark API’s with much more streamlined way using new FP abstractions in java 8 ● Replace all instances of _2.10 with _2.11 in dependencies
  • 12. Migration Activity ● Update build.sbt to reflect new version of scala ● Use % operator whenever possible so that scala version is automatically inferred ● Set the version of spark to 2.1.0 ( or latest stable release of 2.x series) ● Verify all the libraries are good with new version of scala ● Now your dependencies are ready. ● Ex : build.sbt
  • 14. API Compatibility for Connectors ● All the 1.x connectors are source compatible with spark 2.x ● Datasource API has not gone through much difference in 2.x ● You need to recompile code to support new version of spark and new version of scala ● Most of the external connectors like elasticsearch, mongodb already support 2.x
  • 15. Apache Bahir ● Spark has removed many earlier builtin connectors from its repo to slim down the core project ● Most of them are from streaming . Ex: twitter,zeromq ● Now these are part of the open source project apache bahir lead by IBM ● So if you are using any of these, you need to change dependencies to apache bahir. ● You need to change code to reflect new package names if needed ● Ex : ZeroMQWordCount.scala
  • 16. New Connectors ● Not only some connectors are removed, there are few built in ones are added ● Csv is one of the new built in connector added in 2.x ● Also few third party libraries have separate version for 2.0 Ex: elasticsearch-hadoop is now elasticsearch-spark in 2.x ● So find the right connectors as older one may be moved under different name or version
  • 17. Migration Activity ● Add all deleted connectors from apache bahir and update the code reflect same ● Use built in connectors over third party library like for csv ● Compile all custom sources for new version of spark and scala
  • 19. Contexts in Spark ● In spark, most of the code is starts with contexts ● In first version of spark, SparkContext was entry point to the RDD world ● As new API’s added, spark added more contexts ○ StreamingContext - DStreamAPI ○ SQLContext - Dataframe API ○ HiveContext - SQL with Hive support ○ Custom Context for libraries like CassandraContext, MemsqlContext etc
  • 20. Challenges with Contexts ● In spark 1.x, as you use one or more API’s you need to maintain multiple contexts ● Having multiple different entry points makes code more complicated and less maintainable ● Inconsistencies between the contexts ○ getOrCreate API is available on SQLContext but not HiveContext ● Makes API unification more harder
  • 21. Spark Session ● Single entry point for all API’s ● Primarily targeted to replace all structured contexts like SQLContext, HiveContext , custom contexts etc ● Wraps sparkcontext for all execution related specifics ● All API’s is copied from SQLContext so it can be used as drop in place for SQLContext or HiveContext ● Replaces StreamingContext when structured streaming comes of age
  • 22. Migration Activity ● Replace SQLContext and HiveContext with spark session API ● You can add custom optimisation rules to spark session rather than using custom contexts for connectors ● Enable hive support on demand without changing the entry point ● Use wrapped spark context rather creating one from scratch ● Ex: CsvLoad.scala
  • 23. Built in CSV Connector
  • 24. Why CSV matters? ● Spark 1.x came with built in support for json rather than csv ● Csv was supported by spark-csv library by databricks ● But spark team soon realised csv is de facto standard in data science communities and major enterprises ● So in 2.x, spark-csv code improvised and built into the core of the spark ● So from 2.x, you don’t need to use spark-csv library anymore ● Ex : CsvLoad.scala
  • 25. Advantage of built in connector ● No dependency on third party library. ● Easy to experiment on spark-shell or notebook systems ● Better performance for ○ Schema Inference ○ Joins ● Ex : CsvJoin.scala
  • 26. Moving RDD to Dataset
  • 27. DataFrame Abstraction in 1.x vs 2.x DataFrame RDD Spark Catalyst Dataset RDD Spark Catalyst DataFrame
  • 28. Dataframe functional API’s ● In spark 1.x, data frame exposed structural API’s using dataframe dsl or spark sql interfaces ● Whenever developer needed functional API’s like map, flatMap spark automatically fall back to RDD abstraction ● This movement between dataframe to RDD and back made sure that developer can choose the right tool for work ● But it also came with cost of non optimised RDD code ● Ex : DFMapExample.scala
  • 29. Dataset functional API’s ● Dataset borrows it’s API’s both from RDD and Dataframe ● So when you call functional API’s on dataframe in 2.x, it’s no more returns a RDD but it returns a dataset ● This bridges the performance gap between structured API’s and functional API’s ● But it also may break your code if it expecting an RDD ● So you need to migrate all the code to use rdd explicitly or use dataset functional API’s ● Ex : DFMapExample.scala
  • 31. What is Cross Joins? ● When we join two dataframes, without any join condition we run into cross join ● Also known as cross product ● Most of the time induced by accident or created when join condition is matching wrong data type columns ● Huge performance penalty ● Should be avoided in planning stage rather than in execution phase
  • 32. Cross Joins in 1.x vs 2.x ● In spark 1.x, there is no check to say whenever cross join happens which resulted in poor performance with large data ● In 2.x, spark has added check in logical plan to avoid cross joins ● If user want cross join, they have to be explicitly ● So if you have joins which are cross join they result in exceptions ● CrossJoin.scala
  • 34. ML Transformers ● ML uses dataframe as pipeline mechanism to build complex machine learning models ● Transformer is an API to represent the data pre processing needed for learning algorithm ● From spark 2.x, the ml pipeline will be using dataset as the abstraction rather than dataframe ● So if you have custom transformers in your code, you need to update to new API to support dataset abstraction ● Ex: CustomMLTransformer.scala