Introduction to Data
Source V2 API
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Technical Lead at Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
Agenda
● Structured Data Processing
● Shortcomings of Datasource API
● Goals of Datasource V2
● Anatomy of Datasource V2
● In Memory Datasource
● Mysql Data source
Structured Data Processing
Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
Spark SQL Components
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server
Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc
Datasource V1 API
● Introduced in Spark 1.3 version
● Spark has evolved quite a bit after 1.3
○ Custom Memory Management
○ Dataset abstraction
○ Structured Streaming
● Datasource API has not evolved as with new versions of
spark
Difficulty in evolving DataSource API
● Data source is lowest level abstraction in structured
processing which talks to data sources directly
● Datasource are often written by the third party vendors
to connect to different sources
● They will be not often updated as spark codes which
makes changing API challenging
● So data source API remained same when rest of the
spark changed quite a bit
Shortcomings of V1 API
Dependency on High Level API
● Data sources are lowest level abstraction in the stack
● Data source V1 API depended on high level user facing
abstractions like SQLContext and DataFrame
def createRelation(sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
● As the spark evolved, these abstractions got deprecated
and replaced by better ones
Dependency on High Level API
● But the as data source was not changed, it still stuck
with old abstractions
● So having low level abstractions depend upon high level
abstractions is not a good idea
Lack of Support for Columnar Read
def buildScan(): RDD[Row]
● From above API , it’s apparent that API reads the data
in row format
● But many analytics sources today are columnar in
nature
● So if the underneath columnar source is read into row
format it loses all the performance benefits.
Lack of Partition and Sorting Info
● Many data sources distribute the data using partitioning
over multiple machines
● In datasource V1 API there was no support to share the
partition locality information with spark engine
● This resulted in random reads from spark engine with lot
of network traffic
● Spark builtin sources solved this issue using internal
API’s
● But all third party connectors suffered from it
No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file
Limited Extendability
● Data source v1 API only supports filter push down and
column pruning
● But many smart sources, data sources with processing
power, do more capabilities than that.
● These sources can do sorting, aggregation in the
source level itself
● Currently data source API doesn’t good mechanism to
push more catalyst expressions to underneath source.
Introduction to Datasource V2 API
V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.
Goals
● No Dependency on High level abstractions like
SparkContext,DataFrame etc
● Java Friendly
● Support for Filter Pushdown and Pruning
● Support for Partition locality
● Support for both columnar and row based reading
● Transaction support in write interface
● Get rid of internal API’s like HadoopFsRelation
Anatomy of V2 Read API
Java API
● Being Java friendly is one of the goal of the V2 API
● So all the base interfaces for the V2 API are defined as
Java interfaces
● This makes interoperability between Java much easier
compared to V1 API
● Little painful in scala to deal with Java models
● The next slides covers the all the different interfaces in
the read path
Interfaces
Master
Datasource Reader
Data Reader Factory
User code
Worker
Data Reader
Worker
Data Reader
DataSourceReader Interface
● Entry Point to the Data Source
● SubClass of ReaderSupport
● Has Two Methods
○ def readSchema():StructType
○ def createDataReaderFactories:List[DataReaderFactory]
● Responsible for Schema inference and creating data
factories
● What may be list indicate here?
DataReaderFactory Interface
● Follow factory design pattern of Java to create actual
data reader
● This code runs in master which may be responsible for
common code , like jdbc connections, across the data
readers.
● It has one method to create data reader
def createDataReader:DataReader
DataReader Interface
● Interface Responsible for actual reading of data
● Runs in worker nodes
● Should be serializable
● Method exposed are
def next : Boolean
def get : T
● Looks very similar to Iterator interface
Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● The return type of DataReader is T which suggests that
it can support both row and columnar read
● Reader interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.
In-memory Data Source
Simple In-Memory Interface
● In-memory Interface which reads the data from
in-memory array
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleDataSource.scala
Multiple Partitions
● In this example, we extend our simple in-memory data
source to have multiple partitions
● In the code, we will have multiple data reader factories
compared one in earlier example
● In Data Reader code, we track the partition using it’s
start and end
● Mimics the HDFS InputFormat
● Ex : SimpleMultiDataSource.scala
Filter Push
Filter Push with Mysql Datasource
● Filter push allows data sources to push the spark SQL
filters to the data source
● In the smart sources, like relational databases, source
has capability to filter efficiently
● In this example, filter push is implemented for Mysql
Source
● This source will be using JDBC interface to
communicate with Mysql
● Ex : SimpleMysqlDataSource.scala
References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Introduction to Datasource V2 API

  • 1.
    Introduction to Data SourceV2 API Next Generation Datasource API for Spark 2.0 https://github.com/phatak-dev/spark2.0-examples
  • 2.
    ● Madhukara Phatak ●Technical Lead at Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  • 3.
    Agenda ● Structured DataProcessing ● Shortcomings of Datasource API ● Goals of Datasource V2 ● Anatomy of Datasource V2 ● In Memory Datasource ● Mysql Data source
  • 4.
  • 5.
    Spark SQL Architecture CSVJSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  • 6.
    Spark SQL Components ●Data source API Universal API for Loading/ Saving structured data ● DataFrame API Higher level representation for structured data ● SQL interpreter and optimizer Express data transformation in SQL ● SQL service Hive thrift server
  • 7.
    Data source API ●Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Introduced in Spark 1.3 version along with DataFrame ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra (in works) etc
  • 8.
    Datasource V1 API ●Introduced in Spark 1.3 version ● Spark has evolved quite a bit after 1.3 ○ Custom Memory Management ○ Dataset abstraction ○ Structured Streaming ● Datasource API has not evolved as with new versions of spark
  • 9.
    Difficulty in evolvingDataSource API ● Data source is lowest level abstraction in structured processing which talks to data sources directly ● Datasource are often written by the third party vendors to connect to different sources ● They will be not often updated as spark codes which makes changing API challenging ● So data source API remained same when rest of the spark changed quite a bit
  • 10.
  • 11.
    Dependency on HighLevel API ● Data sources are lowest level abstraction in the stack ● Data source V1 API depended on high level user facing abstractions like SQLContext and DataFrame def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation ● As the spark evolved, these abstractions got deprecated and replaced by better ones
  • 12.
    Dependency on HighLevel API ● But the as data source was not changed, it still stuck with old abstractions ● So having low level abstractions depend upon high level abstractions is not a good idea
  • 13.
    Lack of Supportfor Columnar Read def buildScan(): RDD[Row] ● From above API , it’s apparent that API reads the data in row format ● But many analytics sources today are columnar in nature ● So if the underneath columnar source is read into row format it loses all the performance benefits.
  • 14.
    Lack of Partitionand Sorting Info ● Many data sources distribute the data using partitioning over multiple machines ● In datasource V1 API there was no support to share the partition locality information with spark engine ● This resulted in random reads from spark engine with lot of network traffic ● Spark builtin sources solved this issue using internal API’s ● But all third party connectors suffered from it
  • 15.
    No Transaction Supportin Write ● V1 API only supported generic write interface which was primarily meant for write once sources like HDFS ● The interface did not have any transactional support which is needed for sophisticated sources like databases ● For example, when data is written to partially to database and job aborts, it will not cleanup those rows. ● It’s not an issue in HDFS because it will track non successful writes using _successful file
  • 16.
    Limited Extendability ● Datasource v1 API only supports filter push down and column pruning ● But many smart sources, data sources with processing power, do more capabilities than that. ● These sources can do sorting, aggregation in the source level itself ● Currently data source API doesn’t good mechanism to push more catalyst expressions to underneath source.
  • 17.
  • 18.
    V2 API ● DatasourceV2 is a new API introduced in Spark 2.3 to address the shortcomings of V1 API ● V2 API mimics the simplicity of the Hadoop Input/Output layers still keeping the all the powerful features of V1 ● Currently it’s in beta. It will become GA in future releases ● V1 API will be deprecated ● No user facing code change needed to use v2 data sources.
  • 19.
    Goals ● No Dependencyon High level abstractions like SparkContext,DataFrame etc ● Java Friendly ● Support for Filter Pushdown and Pruning ● Support for Partition locality ● Support for both columnar and row based reading ● Transaction support in write interface ● Get rid of internal API’s like HadoopFsRelation
  • 20.
    Anatomy of V2Read API
  • 21.
    Java API ● BeingJava friendly is one of the goal of the V2 API ● So all the base interfaces for the V2 API are defined as Java interfaces ● This makes interoperability between Java much easier compared to V1 API ● Little painful in scala to deal with Java models ● The next slides covers the all the different interfaces in the read path
  • 22.
    Interfaces Master Datasource Reader Data ReaderFactory User code Worker Data Reader Worker Data Reader
  • 23.
    DataSourceReader Interface ● EntryPoint to the Data Source ● SubClass of ReaderSupport ● Has Two Methods ○ def readSchema():StructType ○ def createDataReaderFactories:List[DataReaderFactory] ● Responsible for Schema inference and creating data factories ● What may be list indicate here?
  • 24.
    DataReaderFactory Interface ● Followfactory design pattern of Java to create actual data reader ● This code runs in master which may be responsible for common code , like jdbc connections, across the data readers. ● It has one method to create data reader def createDataReader:DataReader
  • 25.
    DataReader Interface ● InterfaceResponsible for actual reading of data ● Runs in worker nodes ● Should be serializable ● Method exposed are def next : Boolean def get : T ● Looks very similar to Iterator interface
  • 26.
    Observations from API ●The API doesn’t use any high level API’s like SparkContext,DataFrame etc ● The return type of DataReader is T which suggests that it can support both row and columnar read ● Reader interface is quite simple which can be used for wide variety of sources ● No more fiddling with RDD in Data source layer.
  • 27.
  • 28.
    Simple In-Memory Interface ●In-memory Interface which reads the data from in-memory array ● Implements all the interfaces discussed earlier ● Has single partition ● Shows how all the different API’s come together to build a full fledged source ● Ex : SimpleDataSource.scala
  • 29.
    Multiple Partitions ● Inthis example, we extend our simple in-memory data source to have multiple partitions ● In the code, we will have multiple data reader factories compared one in earlier example ● In Data Reader code, we track the partition using it’s start and end ● Mimics the HDFS InputFormat ● Ex : SimpleMultiDataSource.scala
  • 30.
  • 31.
    Filter Push withMysql Datasource ● Filter push allows data sources to push the spark SQL filters to the data source ● In the smart sources, like relational databases, source has capability to filter efficiently ● In this example, filter push is implemented for Mysql Source ● This source will be using JDBC interface to communicate with Mysql ● Ex : SimpleMysqlDataSource.scala
  • 32.