Understanding transactional writes in datasource v2

Transactional Writes in
Datasource V2
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples

● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com

Agenda
● Introduction to Data Source V2
● Shortcomings of Datasource Write API
● Anatomy of Datasource V2 Write API
● Per Partition Transaction
● Source Level Transaction
● Partition Affinity

Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL

Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra
etc

ShortComings of V1 API
● Introduced in 1.3 but not evolved compared to other
parts of spark
● Dependency on High Level API like DataFrame,
SparkContext etc
● Lack of support for column read
● Lack of partition awareness
● No Transaction support in write API
● Lack of extendibility

Introduction to Datasource V2 API

V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.

No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file

Interfaces
Master
Datasource Writer
DataWriter Factory
User code
Worker
Data Writer
Worker
Data Writer
Writer Support

WriterSupport Interface
● Entry Point to the Data Source
● Has One Method
def createWriter(jobId: String, schema: StructType, mode: SaveMode,options:
DataSourceOptions): Optional[DataSourceWriter]
● SaveMode and Schema same as V1 API
● Returns Optional for ready only sources

DataSourceWriter Interface
● Entry Point to Writer
● Has Three Methods
○ def createWriterFactory(): DataWriterFactory[Row]
○ def commit(messages: Array[WriterCommitMessage])
○ def abort(messages: Array[WriterCommitMessage])
● Responsible for create writer factory
● WriterCommitMessage is interface for communication
● Can see transactional support throughout of the API

DataWriterFactory Interface
● Follow factory design pattern of Java to create actual
data writes
● This code for creating writers for uniquely identifying
different partitions
● It has one method to create data write
def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[Row]
● attemptNumber for retrying tasks

DataWriter Interface
● Interface Responsible for actual write of data
● Runs in worker nodes
● Method exposed are
● def write(record: Row)
● def commit(): WriterCommitMessage
● def abort()
● Looks very similar to Hadoop Write interface

Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● Transaction support throughout the API
● Write interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.

Mysql Source
● Mysql source is responsible for writing data using jdbc
API
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleMysqlWriter.scala

Distributed Writes
● Distributed writing is hard
● There are many reasons write can fail
○ Connection is dropped
○ Error in Writing Data for partition
○ Duplicate data because of retrying
● Many of these issues crop up very frequently in spark
applications
● Ex : MysqlTransactionExample.scala

Transactional Support
● In Datasource V2 API, there is good support for
transaction
● Transaction can be implemented at
○ Partition level
○ Job level
● This transaction support helps to handle issue in error in
partial write data
● Ex : MysqlWithTransaction

Partition Locations
● Many data sources today provide native support for
partitioning
● These partitions can be distributed over cluster of
machines
● Making spark aware of these partition scheme makes
reading much faster
● Works best for co located data sources

Preferred Locations
● DataReader Factory expose getPreferredLocations API
to send the partitioning information to spark
● This API returns the host name of the machines where
partition is available
● Spark uses it as hint. It may not use it
● If we return hostname which is not recognisable to
spark , it just ignores it
● Store this information in the RDD
● Ex : SimpleDataSourceWithPartitionAffinity.scala

References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Understanding transactional writes in datasource v2

More Related Content

What's hot

Similar to Understanding transactional writes in datasource v2

More from datamantra

Recently uploaded

Understanding transactional writes in datasource v2