Transactional Writes in
Datasource V2
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
Agenda
● Introduction to Data Source V2
● Shortcomings of Datasource Write API
● Anatomy of Datasource V2 Write API
● Per Partition Transaction
● Source Level Transaction
● Partition Affinity
Structured Data Processing
Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra
etc
ShortComings of V1 API
● Introduced in 1.3 but not evolved compared to other
parts of spark
● Dependency on High Level API like DataFrame,
SparkContext etc
● Lack of support for column read
● Lack of partition awareness
● No Transaction support in write API
● Lack of extendibility
Introduction to Datasource V2 API
V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.
Shortcomings of V1 Write API
No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file
Anatomy of V2 Write API
Interfaces
Master
Datasource Writer
DataWriter Factory
User code
Worker
Data Writer
Worker
Data Writer
Writer Support
WriterSupport Interface
● Entry Point to the Data Source
● Has One Method
def createWriter(jobId: String, schema: StructType, mode: SaveMode,options:
DataSourceOptions): Optional[DataSourceWriter]
● SaveMode and Schema same as V1 API
● Returns Optional for ready only sources
DataSourceWriter Interface
● Entry Point to Writer
● Has Three Methods
○ def createWriterFactory(): DataWriterFactory[Row]
○ def commit(messages: Array[WriterCommitMessage])
○ def abort(messages: Array[WriterCommitMessage])
● Responsible for create writer factory
● WriterCommitMessage is interface for communication
● Can see transactional support throughout of the API
DataWriterFactory Interface
● Follow factory design pattern of Java to create actual
data writes
● This code for creating writers for uniquely identifying
different partitions
● It has one method to create data write
def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[Row]
● attemptNumber for retrying tasks
DataWriter Interface
● Interface Responsible for actual write of data
● Runs in worker nodes
● Method exposed are
● def write(record: Row)
● def commit(): WriterCommitMessage
● def abort()
● Looks very similar to Hadoop Write interface
Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● Transaction support throughout the API
● Write interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.
Simple Mysql Data Source
Mysql Source
● Mysql source is responsible for writing data using jdbc
API
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleMysqlWriter.scala
Transaction Writes
Distributed Writes
● Distributed writing is hard
● There are many reasons write can fail
○ Connection is dropped
○ Error in Writing Data for partition
○ Duplicate data because of retrying
● Many of these issues crop up very frequently in spark
applications
● Ex : MysqlTransactionExample.scala
Transactional Support
● In Datasource V2 API, there is good support for
transaction
● Transaction can be implemented at
○ Partition level
○ Job level
● This transaction support helps to handle issue in error in
partial write data
● Ex : MysqlWithTransaction
Partition Affinity
Partition Locations
● Many data sources today provide native support for
partitioning
● These partitions can be distributed over cluster of
machines
● Making spark aware of these partition scheme makes
reading much faster
● Works best for co located data sources
Preferred Locations
● DataReader Factory expose getPreferredLocations API
to send the partitioning information to spark
● This API returns the host name of the machines where
partition is available
● Spark uses it as hint. It may not use it
● If we return hostname which is not recognisable to
spark , it just ignores it
● Store this information in the RDD
● Ex : SimpleDataSourceWithPartitionAffinity.scala
References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Understanding transactional writes in datasource v2

  • 1.
    Transactional Writes in DatasourceV2 Next Generation Datasource API for Spark 2.0 https://github.com/phatak-dev/spark2.0-examples
  • 2.
    ● Madhukara Phatak ●Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  • 3.
    Agenda ● Introduction toData Source V2 ● Shortcomings of Datasource Write API ● Anatomy of Datasource V2 Write API ● Per Partition Transaction ● Source Level Transaction ● Partition Affinity
  • 4.
  • 5.
    Spark SQL Architecture CSVJSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  • 6.
    Data source API ●Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Introduced in Spark 1.3 version along with DataFrame ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra etc
  • 7.
    ShortComings of V1API ● Introduced in 1.3 but not evolved compared to other parts of spark ● Dependency on High Level API like DataFrame, SparkContext etc ● Lack of support for column read ● Lack of partition awareness ● No Transaction support in write API ● Lack of extendibility
  • 8.
  • 9.
    V2 API ● DatasourceV2 is a new API introduced in Spark 2.3 to address the shortcomings of V1 API ● V2 API mimics the simplicity of the Hadoop Input/Output layers still keeping the all the powerful features of V1 ● Currently it’s in beta. It will become GA in future releases ● V1 API will be deprecated ● No user facing code change needed to use v2 data sources.
  • 10.
  • 11.
    No Transaction Supportin Write ● V1 API only supported generic write interface which was primarily meant for write once sources like HDFS ● The interface did not have any transactional support which is needed for sophisticated sources like databases ● For example, when data is written to partially to database and job aborts, it will not cleanup those rows. ● It’s not an issue in HDFS because it will track non successful writes using _successful file
  • 12.
    Anatomy of V2Write API
  • 13.
    Interfaces Master Datasource Writer DataWriter Factory Usercode Worker Data Writer Worker Data Writer Writer Support
  • 14.
    WriterSupport Interface ● EntryPoint to the Data Source ● Has One Method def createWriter(jobId: String, schema: StructType, mode: SaveMode,options: DataSourceOptions): Optional[DataSourceWriter] ● SaveMode and Schema same as V1 API ● Returns Optional for ready only sources
  • 15.
    DataSourceWriter Interface ● EntryPoint to Writer ● Has Three Methods ○ def createWriterFactory(): DataWriterFactory[Row] ○ def commit(messages: Array[WriterCommitMessage]) ○ def abort(messages: Array[WriterCommitMessage]) ● Responsible for create writer factory ● WriterCommitMessage is interface for communication ● Can see transactional support throughout of the API
  • 16.
    DataWriterFactory Interface ● Followfactory design pattern of Java to create actual data writes ● This code for creating writers for uniquely identifying different partitions ● It has one method to create data write def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[Row] ● attemptNumber for retrying tasks
  • 17.
    DataWriter Interface ● InterfaceResponsible for actual write of data ● Runs in worker nodes ● Method exposed are ● def write(record: Row) ● def commit(): WriterCommitMessage ● def abort() ● Looks very similar to Hadoop Write interface
  • 18.
    Observations from API ●The API doesn’t use any high level API’s like SparkContext,DataFrame etc ● Transaction support throughout the API ● Write interface is quite simple which can be used for wide variety of sources ● No more fiddling with RDD in Data source layer.
  • 19.
  • 20.
    Mysql Source ● Mysqlsource is responsible for writing data using jdbc API ● Implements all the interfaces discussed earlier ● Has single partition ● Shows how all the different API’s come together to build a full fledged source ● Ex : SimpleMysqlWriter.scala
  • 21.
  • 22.
    Distributed Writes ● Distributedwriting is hard ● There are many reasons write can fail ○ Connection is dropped ○ Error in Writing Data for partition ○ Duplicate data because of retrying ● Many of these issues crop up very frequently in spark applications ● Ex : MysqlTransactionExample.scala
  • 23.
    Transactional Support ● InDatasource V2 API, there is good support for transaction ● Transaction can be implemented at ○ Partition level ○ Job level ● This transaction support helps to handle issue in error in partial write data ● Ex : MysqlWithTransaction
  • 24.
  • 25.
    Partition Locations ● Manydata sources today provide native support for partitioning ● These partitions can be distributed over cluster of machines ● Making spark aware of these partition scheme makes reading much faster ● Works best for co located data sources
  • 26.
    Preferred Locations ● DataReaderFactory expose getPreferredLocations API to send the partitioning information to spark ● This API returns the host name of the machines where partition is available ● Spark uses it as hint. It may not use it ● If we return hostname which is not recognisable to spark , it just ignores it ● Store this information in the RDD ● Ex : SimpleDataSourceWithPartitionAffinity.scala
  • 27.