Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Datasource V2 API

325 views

Published on

A brief introduction to Datasource V2 API in Spark 2.3.0, Comparison with the previous Datasource API.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Introduction to Datasource V2 API

  1. 1. Introduction to Data Source V2 API Next Generation Datasource API for Spark 2.0 https://github.com/phatak-dev/spark2.0-examples
  2. 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  3. 3. Agenda ● Structured Data Processing ● Shortcomings of Datasource API ● Goals of Datasource V2 ● Anatomy of Datasource V2 ● In Memory Datasource ● Mysql Data source
  4. 4. Structured Data Processing
  5. 5. Spark SQL Architecture CSV JSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  6. 6. Spark SQL Components ● Data source API Universal API for Loading/ Saving structured data ● DataFrame API Higher level representation for structured data ● SQL interpreter and optimizer Express data transformation in SQL ● SQL service Hive thrift server
  7. 7. Data source API ● Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Introduced in Spark 1.3 version along with DataFrame ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra (in works) etc
  8. 8. Datasource V1 API ● Introduced in Spark 1.3 version ● Spark has evolved quite a bit after 1.3 ○ Custom Memory Management ○ Dataset abstraction ○ Structured Streaming ● Datasource API has not evolved as with new versions of spark
  9. 9. Difficulty in evolving DataSource API ● Data source is lowest level abstraction in structured processing which talks to data sources directly ● Datasource are often written by the third party vendors to connect to different sources ● They will be not often updated as spark codes which makes changing API challenging ● So data source API remained same when rest of the spark changed quite a bit
  10. 10. Shortcomings of V1 API
  11. 11. Dependency on High Level API ● Data sources are lowest level abstraction in the stack ● Data source V1 API depended on high level user facing abstractions like SQLContext and DataFrame def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation ● As the spark evolved, these abstractions got deprecated and replaced by better ones
  12. 12. Dependency on High Level API ● But the as data source was not changed, it still stuck with old abstractions ● So having low level abstractions depend upon high level abstractions is not a good idea
  13. 13. Lack of Support for Columnar Read def buildScan(): RDD[Row] ● From above API , it’s apparent that API reads the data in row format ● But many analytics sources today are columnar in nature ● So if the underneath columnar source is read into row format it loses all the performance benefits.
  14. 14. Lack of Partition and Sorting Info ● Many data sources distribute the data using partitioning over multiple machines ● In datasource V1 API there was no support to share the partition locality information with spark engine ● This resulted in random reads from spark engine with lot of network traffic ● Spark builtin sources solved this issue using internal API’s ● But all third party connectors suffered from it
  15. 15. No Transaction Support in Write ● V1 API only supported generic write interface which was primarily meant for write once sources like HDFS ● The interface did not have any transactional support which is needed for sophisticated sources like databases ● For example, when data is written to partially to database and job aborts, it will not cleanup those rows. ● It’s not an issue in HDFS because it will track non successful writes using _successful file
  16. 16. Limited Extendability ● Data source v1 API only supports filter push down and column pruning ● But many smart sources, data sources with processing power, do more capabilities than that. ● These sources can do sorting, aggregation in the source level itself ● Currently data source API doesn’t good mechanism to push more catalyst expressions to underneath source.
  17. 17. Introduction to Datasource V2 API
  18. 18. V2 API ● Datasource V2 is a new API introduced in Spark 2.3 to address the shortcomings of V1 API ● V2 API mimics the simplicity of the Hadoop Input/Output layers still keeping the all the powerful features of V1 ● Currently it’s in beta. It will become GA in future releases ● V1 API will be deprecated ● No user facing code change needed to use v2 data sources.
  19. 19. Goals ● No Dependency on High level abstractions like SparkContext,DataFrame etc ● Java Friendly ● Support for Filter Pushdown and Pruning ● Support for Partition locality ● Support for both columnar and row based reading ● Transaction support in write interface ● Get rid of internal API’s like HadoopFsRelation
  20. 20. Anatomy of V2 Read API
  21. 21. Java API ● Being Java friendly is one of the goal of the V2 API ● So all the base interfaces for the V2 API are defined as Java interfaces ● This makes interoperability between Java much easier compared to V1 API ● Little painful in scala to deal with Java models ● The next slides covers the all the different interfaces in the read path
  22. 22. Interfaces Master Datasource Reader Data Reader Factory User code Worker Data Reader Worker Data Reader
  23. 23. DataSourceReader Interface ● Entry Point to the Data Source ● SubClass of ReaderSupport ● Has Two Methods ○ def readSchema():StructType ○ def createDataReaderFactories:List[DataReaderFactory] ● Responsible for Schema inference and creating data factories ● What may be list indicate here?
  24. 24. DataReaderFactory Interface ● Follow factory design pattern of Java to create actual data reader ● This code runs in master which may be responsible for common code , like jdbc connections, across the data readers. ● It has one method to create data reader def createDataReader:DataReader
  25. 25. DataReader Interface ● Interface Responsible for actual reading of data ● Runs in worker nodes ● Should be serializable ● Method exposed are def next : Boolean def get : T ● Looks very similar to Iterator interface
  26. 26. Observations from API ● The API doesn’t use any high level API’s like SparkContext,DataFrame etc ● The return type of DataReader is T which suggests that it can support both row and columnar read ● Reader interface is quite simple which can be used for wide variety of sources ● No more fiddling with RDD in Data source layer.
  27. 27. In-memory Data Source
  28. 28. Simple In-Memory Interface ● In-memory Interface which reads the data from in-memory array ● Implements all the interfaces discussed earlier ● Has single partition ● Shows how all the different API’s come together to build a full fledged source ● Ex : SimpleDataSource.scala
  29. 29. Multiple Partitions ● In this example, we extend our simple in-memory data source to have multiple partitions ● In the code, we will have multiple data reader factories compared one in earlier example ● In Data Reader code, we track the partition using it’s start and end ● Mimics the HDFS InputFormat ● Ex : SimpleMultiDataSource.scala
  30. 30. Filter Push
  31. 31. Filter Push with Mysql Datasource ● Filter push allows data sources to push the spark SQL filters to the data source ● In the smart sources, like relational databases, source has capability to filter efficiently ● In this example, filter push is implemented for Mysql Source ● This source will be using JDBC interface to communicate with Mysql ● Ex : SimpleMysqlDataSource.scala
  32. 32. References ● http://blog.madhukaraphatak.com/categories /datasource-v2-series ● https://databricks.com/blog/2018/02/28/intro ducing-apache-spark-2-3.html ● https://issues.apache.org/jira/browse/SPARK -15689

×