Introduction to Datasource V2 API

Introduction to Data
Source V2 API
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples

● Madhukara Phatak
● Technical Lead at Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com

Agenda
● Structured Data Processing
● Shortcomings of Datasource API
● Goals of Datasource V2
● Anatomy of Datasource V2
● In Memory Datasource
● Mysql Data source

Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL

Spark SQL Components
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server

Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc

Datasource V1 API
● Introduced in Spark 1.3 version
● Spark has evolved quite a bit after 1.3
○ Custom Memory Management
○ Dataset abstraction
○ Structured Streaming
● Datasource API has not evolved as with new versions of
spark

Difficulty in evolving DataSource API
● Data source is lowest level abstraction in structured
processing which talks to data sources directly
● Datasource are often written by the third party vendors
to connect to different sources
● They will be not often updated as spark codes which
makes changing API challenging
● So data source API remained same when rest of the
spark changed quite a bit

Dependency on High Level API
● Data sources are lowest level abstraction in the stack
● Data source V1 API depended on high level user facing
abstractions like SQLContext and DataFrame
def createRelation(sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
● As the spark evolved, these abstractions got deprecated
and replaced by better ones

Dependency on High Level API
● But the as data source was not changed, it still stuck
with old abstractions
● So having low level abstractions depend upon high level
abstractions is not a good idea

Lack of Support for Columnar Read
def buildScan(): RDD[Row]
● From above API , it’s apparent that API reads the data
in row format
● But many analytics sources today are columnar in
nature
● So if the underneath columnar source is read into row
format it loses all the performance benefits.

Lack of Partition and Sorting Info
● Many data sources distribute the data using partitioning
over multiple machines
● In datasource V1 API there was no support to share the
partition locality information with spark engine
● This resulted in random reads from spark engine with lot
of network traffic
● Spark builtin sources solved this issue using internal
API’s
● But all third party connectors suffered from it

No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file

Limited Extendability
● Data source v1 API only supports filter push down and
column pruning
● But many smart sources, data sources with processing
power, do more capabilities than that.
● These sources can do sorting, aggregation in the
source level itself
● Currently data source API doesn’t good mechanism to
push more catalyst expressions to underneath source.

Introduction to Datasource V2 API

V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.

Goals
● No Dependency on High level abstractions like
SparkContext,DataFrame etc
● Java Friendly
● Support for Filter Pushdown and Pruning
● Support for Partition locality
● Support for both columnar and row based reading
● Transaction support in write interface
● Get rid of internal API’s like HadoopFsRelation

Java API
● Being Java friendly is one of the goal of the V2 API
● So all the base interfaces for the V2 API are defined as
Java interfaces
● This makes interoperability between Java much easier
compared to V1 API
● Little painful in scala to deal with Java models
● The next slides covers the all the different interfaces in
the read path

Interfaces
Master
Datasource Reader
Data Reader Factory
User code
Worker
Data Reader
Worker
Data Reader

DataSourceReader Interface
● Entry Point to the Data Source
● SubClass of ReaderSupport
● Has Two Methods
○ def readSchema():StructType
○ def createDataReaderFactories:List[DataReaderFactory]
● Responsible for Schema inference and creating data
factories
● What may be list indicate here?

DataReaderFactory Interface
● Follow factory design pattern of Java to create actual
data reader
● This code runs in master which may be responsible for
common code , like jdbc connections, across the data
readers.
● It has one method to create data reader
def createDataReader:DataReader

DataReader Interface
● Interface Responsible for actual reading of data
● Runs in worker nodes
● Should be serializable
● Method exposed are
def next : Boolean
def get : T
● Looks very similar to Iterator interface

Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● The return type of DataReader is T which suggests that
it can support both row and columnar read
● Reader interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.

Simple In-Memory Interface
● In-memory Interface which reads the data from
in-memory array
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleDataSource.scala

Multiple Partitions
● In this example, we extend our simple in-memory data
source to have multiple partitions
● In the code, we will have multiple data reader factories
compared one in earlier example
● In Data Reader code, we track the partition using it’s
start and end
● Mimics the HDFS InputFormat
● Ex : SimpleMultiDataSource.scala

Filter Push with Mysql Datasource
● Filter push allows data sources to push the spark SQL
filters to the data source
● In the smart sources, like relational databases, source
has capability to filter efficiently
● In this example, filter push is implemented for Mysql
Source
● This source will be using JDBC interface to
communicate with Mysql
● Ex : SimpleMysqlDataSource.scala

References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Introduction to Datasource V2 API

More Related Content

What's hot

Similar to Introduction to Datasource V2 API

More from datamantra

Recently uploaded

Introduction to Datasource V2 API