Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Understanding transactional writes in datasource v2


Published on

Next Generation Datasource API for Spark 2.0

Published in: Data & Analytics
  • Be the first to comment

Understanding transactional writes in datasource v2

  1. 1. Transactional Writes in Datasource V2 Next Generation Datasource API for Spark 2.0
  2. 2. ● Madhukara Phatak ● Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ●
  3. 3. Agenda ● Introduction to Data Source V2 ● Shortcomings of Datasource Write API ● Anatomy of Datasource V2 Write API ● Per Partition Transaction ● Source Level Transaction ● Partition Affinity
  4. 4. Structured Data Processing
  5. 5. Spark SQL Architecture CSV JSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  6. 6. Data source API ● Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Introduced in Spark 1.3 version along with DataFrame ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra etc
  7. 7. ShortComings of V1 API ● Introduced in 1.3 but not evolved compared to other parts of spark ● Dependency on High Level API like DataFrame, SparkContext etc ● Lack of support for column read ● Lack of partition awareness ● No Transaction support in write API ● Lack of extendibility
  8. 8. Introduction to Datasource V2 API
  9. 9. V2 API ● Datasource V2 is a new API introduced in Spark 2.3 to address the shortcomings of V1 API ● V2 API mimics the simplicity of the Hadoop Input/Output layers still keeping the all the powerful features of V1 ● Currently it’s in beta. It will become GA in future releases ● V1 API will be deprecated ● No user facing code change needed to use v2 data sources.
  10. 10. Shortcomings of V1 Write API
  11. 11. No Transaction Support in Write ● V1 API only supported generic write interface which was primarily meant for write once sources like HDFS ● The interface did not have any transactional support which is needed for sophisticated sources like databases ● For example, when data is written to partially to database and job aborts, it will not cleanup those rows. ● It’s not an issue in HDFS because it will track non successful writes using _successful file
  12. 12. Anatomy of V2 Write API
  13. 13. Interfaces Master Datasource Writer DataWriter Factory User code Worker Data Writer Worker Data Writer Writer Support
  14. 14. WriterSupport Interface ● Entry Point to the Data Source ● Has One Method def createWriter(jobId: String, schema: StructType, mode: SaveMode,options: DataSourceOptions): Optional[DataSourceWriter] ● SaveMode and Schema same as V1 API ● Returns Optional for ready only sources
  15. 15. DataSourceWriter Interface ● Entry Point to Writer ● Has Three Methods ○ def createWriterFactory(): DataWriterFactory[Row] ○ def commit(messages: Array[WriterCommitMessage]) ○ def abort(messages: Array[WriterCommitMessage]) ● Responsible for create writer factory ● WriterCommitMessage is interface for communication ● Can see transactional support throughout of the API
  16. 16. DataWriterFactory Interface ● Follow factory design pattern of Java to create actual data writes ● This code for creating writers for uniquely identifying different partitions ● It has one method to create data write def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[Row] ● attemptNumber for retrying tasks
  17. 17. DataWriter Interface ● Interface Responsible for actual write of data ● Runs in worker nodes ● Method exposed are ● def write(record: Row) ● def commit(): WriterCommitMessage ● def abort() ● Looks very similar to Hadoop Write interface
  18. 18. Observations from API ● The API doesn’t use any high level API’s like SparkContext,DataFrame etc ● Transaction support throughout the API ● Write interface is quite simple which can be used for wide variety of sources ● No more fiddling with RDD in Data source layer.
  19. 19. Simple Mysql Data Source
  20. 20. Mysql Source ● Mysql source is responsible for writing data using jdbc API ● Implements all the interfaces discussed earlier ● Has single partition ● Shows how all the different API’s come together to build a full fledged source ● Ex : SimpleMysqlWriter.scala
  21. 21. Transaction Writes
  22. 22. Distributed Writes ● Distributed writing is hard ● There are many reasons write can fail ○ Connection is dropped ○ Error in Writing Data for partition ○ Duplicate data because of retrying ● Many of these issues crop up very frequently in spark applications ● Ex : MysqlTransactionExample.scala
  23. 23. Transactional Support ● In Datasource V2 API, there is good support for transaction ● Transaction can be implemented at ○ Partition level ○ Job level ● This transaction support helps to handle issue in error in partial write data ● Ex : MysqlWithTransaction
  24. 24. Partition Affinity
  25. 25. Partition Locations ● Many data sources today provide native support for partitioning ● These partitions can be distributed over cluster of machines ● Making spark aware of these partition scheme makes reading much faster ● Works best for co located data sources
  26. 26. Preferred Locations ● DataReader Factory expose getPreferredLocations API to send the partitioning information to spark ● This API returns the host name of the machines where partition is available ● Spark uses it as hint. It may not use it ● If we return hostname which is not recognisable to spark , it just ignores it ● Store this information in the RDD ● Ex : SimpleDataSourceWithPartitionAffinity.scala
  27. 27. References ● /datasource-v2-series ● ducing-apache-spark-2-3.html ● -15689