As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
1. Data Source API V2
Wenchen Fan
2018-6-6 | SF | Spark + AI Summit
1
2. Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streaming
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data Engineers
and Data Scientists
Unifies Data and AI
Technologies
Eliminates infrastructure
complexity
4. What is Data Source API?
• Hadoop: InputFormat/OutputFormat
• Hive: Serde
• Presto: Connector
…….
Defines how to read/write data from/to a storage system.
4
5. Ancient Age: Custom RDD
HadoopRDD/CassandraRDD/HBaseRDD/…
rdd.mapPartitions { it =>
// custom logic to write to external storage
}
Good in the ancient ages when users writing Spark applications
with RDD API.
5
7. How to read data?
• How to read data concurrently and distributedly? (RDD only
satisfy this)
• How to skip reading data by filters?
• How to speed up certain operations? (aggregate, limit, etc.)
• How to convert data using Spark’s data encoding?
• How to report extra information to Spark? (data statistics,
data partitioning, etc.)
• Structured Streaming Support
…….
7
8. How to write data?
• How to write data concurrently and distributedly? (RDD only
satisfy this)
• How to make the write operation atomic?
• How to clean up if write failed?
• How to convert data using Spark’s data encoding?
• Structured streaming support
…….
8
16. Data Source API V1
16
buildScan(limit)
buildScan(limit, requiredCols)
buildScan(limit, filters)
buildScan(limit, requiredCols, filters)
...
17. Data Source API V1
Cons:
• Coupled with other APIs. (SQLContext, RDD, DataFrame)
• Hard to push down other operators.
• Hard to add different data encoding. (columnar scan)
17
19. Data Source API V1
Cons:
• Coupled with other APIs. (SQLContext, RDD, DataFrame)
• Hard to push down other operators.
• Hard to add different data encoding. (columnar scan)
• Hard to implement writing.
19
21. Data Source API V1
Cons:
• Coupled with other APIs. (SQLContext, RDD, DataFrame)
• Hard to push down other operators.
• Hard to add different data encoding. (columnar scan)
• Hard to implement writing.
• No streaming support
21
22. How to read data?
• How to read data concurrently and distributedly?
• How to skip reading data by filters?
• How to speed up certain operations?
• How to convert data using Spark’s data encoding?
• How to report extra information to Spark?
• Structured streaming support
22
23. How to write data?
• How to write data concurrently and distributedly?
• How to make the write operation atomic?
• How to clean up if write failed?
• How to convert data using Spark’s data encoding?
• Structured streaming support
23
30. Read Process
30
1. a query plan generated by user
2. the leaf data scan node generates
DataSourceReader
Spark Driver
External Storage Spark Executors
48. Write Process
48
Spark Driver
External Storage Spark Executors
DataWriter:
succeed, commit this
task and send a message
to the driver.
CommitMessage
commit
50. Write Process
50
Spark Driver
External Storage Spark Executors
Exception
DataWriter:
fail, abort this task.
Propagate exception
to driver.
abort and clean up
56. Streaming Data Source API V2
Structured Streaming Deep Dive:
https://tinyurl.com/y9bze7ae
Continuous Processing in Structured Streaming:
https://tinyurl.com/ydbdhxbz
56
57. Ongoing Improvements
• Catalog Supports: standardize the DDL logical plans, proxy
DDL commands to data source, integrate data source catalog.
(SPARK-24252)
• More operator pushdown: limit pushdown, aggregate
pushdown, join pushdown, etc. (SPARK-22388, SPARK-22390,
SPARK-24130, ...)
57
61. Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streaming
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data Engineers
and Data Scientists
Unifies Data and AI
Technologies
Eliminates infrastructure
complexity
62. About this talk
• Part II of Apache Data Source V2 session.
• See Wenchen’s talk for background and design
details.
• How to implement Parquet data source with the
V2 API
4
70. Prune partition columns
12
Events year=2018
year=2017
year=2016
year=2015
parquet
files
parquet file
row group 0
city
timestamp
OS
browser
other columns..
row group 1
.
.
row group N
Footer
spark
.read
.parquet("/data/events")
.where("city = 'San Francisco' and
year = 2018")
.select("timestamp").collect()
71. Skip row groups
13
Events year=2018
year=2017
year=2016
year=2015
parquet
files
parquet file
row group 0
city
timestamp
OS
browser
other columns..
row group 1
.
.
row group N
Footer
spark
.read
.parquet("/data/events")
.where("city = 'San Francisco' and
year = 2018")
.select("timestamp").collect()
72. pseudo-code
class ParquetDataSource extends DataSourceReader with SupportsPushDownFilters {
override def pushFilters(filters: Array[Filter]): Array[Filter] = {
(partitionFilters, dataFilters) =
filters.span(_.outputSet.subsetOf(partitionColumns))
dataFilters
}
}
// For the selected row groups, we still need to evaluate data filters in Spark
// To be continued in #planInputPartitions
14
74. pseudo-code
class ParquetDataSource extends DataSourceReader with SupportsPushDownFilters
with SupportsPushDownRequiredColumns {
var requiredSchema: StructType = _
override def pruneColumns(requiredSchema: StructType): Unit = {
this.requiredSchema = requiredSchema
}
}
// To be continued in #planInputPartitions
16
75. Goal
• Understand data and skip unneeded data
• Split files into partitions for parallel read
17
76. Partitions of same size
18
File 0 File 1
Partition 0 Partition 1 Partition 2
File 2HDFS
Spark
77. Driver: plan input partitions
19
Spark
Driver
Partition 0 Partition 1 Partition 2
1. Split into
partitions
83. Query example
data = spark.read.parquet("/data/events")
.where("city = 'San Francisco' and year = 2018")
.select("timestamp")
data.write.parquet("/data/results")
25
88. Files should be isolated between jobs
30
results _temporary job id
job id
89. Task output is also temporary
results _temporary job id _temporary
90. Files should be isolated between tasks
32
results _temporary job id _temporary task
attempt id
parquet
files
task
attempt id
parquet
files
task
attempt id
parquet
files
92. File layout
34
results _temporary job id task
attempt id
parquet
files
task id parquet
files
task id parquet
files
_temporary
In
progress
Committed
94. File layout
36
results _temporary job id task
attempt id
parquet
files
task id parquet
files
task id parquet
files
_temporary
On task abort,
delete the task
output path
98. Almost transactional
40
Spark stages
output
files to a
temporary
location
Commit?
Move to final
locations
Abort; Delete
staged files
The window of
failure is small
See Eric Liang’s talk in Spark summit 2017