Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
3. Introduction to Apache Arrow
■ A columnar, in-memory data format and supporting libraries
■ Supported in many languages including C++, Java, Python, Go
■ Data is strongly typed. Each row has the same schema.
■ Includes libraries for working with the format:
● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis.
● Interprocess communication.
● Serialization / deserialization from file formats.
■ Fully open source with a permissive license.
4. Arrow powers dozens of open source
& commercial technologies
10+ programming languages
supported
6. Why is Arrow Flight Needed?
■ An open protocol that the community can support.
■ Designed for data in the modern world
● Older protocols are row oriented and geared towards large numbers of columns and low
numbers of rows.
● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows.
■ Supports distributed computing as a client-side concept
● A data request can return multiple endpoints to a client.
● The client can retrieve from each endpoint in parallel.
7. Arrow Way: Data is sent, transported and
received in the Arrow format
Arrow Flight
■ Protocol for serialization-free transport of Arrow data
● This is particularly efficient if the client application will just work with Arrow data directly.
DATABASE
Column Based
DATABASE
Column Based
Convert
CLIENT
Column Based
Convert
CLIENT
Column Based
JDBC/ODBC Connector
Arrow Flight Connector transporting data in Arrow Format
Status Quo: Serializing/Deserializing
data at each step
Row Based
Column Based
8. Distributed Computing:
Single Node with Arrow Flight
Coordinator /
Executor
CLIENT
CPU
memory
1 - GetFlightInfo(<query>)
2 - FlightInfo<Schema, Endpoints>
3 - DoGet(<ticket>)
Endpoint = {location, ticket}
CPU
memory
9. Distributed Computing:
Multiple Nodes with Arrow Flight
CLIENT
Node 2
Node N
Node 1
CPU
memory
CPU
memory
CPU
memory
CPU
memory
DoGet(<ticket>)
DoGet(<ticket>)
DoGet(<ticket>)
Omitting GetFlightInfo call...
10. Arrow Flight as a Development Framework
■ Includes a fully-built client library
■ Includes a high-performance, scalable server
● Built on top of Google’s gRPC technology and compatible with existing tooling.
● Server implementation details such as thread-pooling, asynchronous IO, request cancellation
are already implemented.
■ Server deployment is a matter of implementing a few RPC request handlers.
12. Why Extend Arrow Flight?
■ Client sends a byte stream, server sends a result
● The content of the byte stream is opaque in the interface.
● It only has meaning for a particular server.
● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query.
■ Catalog information is not part of Arrow Flight’s design
● There is no RPC call to describe how to build the byte stream the client sends.
● Generic tools cannot be built.
■ Arrow Flight is meant to serve any tabular data from any source.
■ ODBC/JDBC standardize query execution and catalog access, but have
drawbacks.
■ Enter Arrow Flight SQL.
13. What is Arrow Flight SQL?
■ Initiative to allow databases to use Arrow Flight as the transport protocol
● Leverage the performance of Arrow and Flight for database access.
■ Extended set of RPC calls to standardize a SQL interface on Flight
● Query execution
● Prepared statements
● Database catalog metadata
● SQL syntax capabilities
■ Generic client libraries
● A Flight SQL application can be used against any Flight SQL server without code changes.
● ODBC and JDBC clients provided on top.
14. Common Tool Workflow
SERVER
2 - FlightInfo<Schema, Endpoints>
1 - GetFlightInfo(GetTables)
GetTables
4 - Arrow record batches
3 - DoGet(<ticket>)
DoGet
6 - FlightInfo<Schema, Endpoints>
5 - GetFlightInfo(StatementExecute)
Execute
7 - DoGet(<ticket>)
DoGet
CPU
memory
Listing tables
Retrieving query data
CLIENT
CPU
memory
15. Flight SQL vs. Legacy
Legacy (ODBC / JDBC)
■ Each database vendor must implement,
maintain, and distribute a driver.
■ Each database vendor must implement their
entire server.
■ Implementation details may be closed source.
■ Protocol is usually proprietary.
Flight SQL
■ Single client that works against any Flight SQL
server.
■ Server implementation is part of Flight. Only
RPC handlers need to be implemented.
■ Flight and Arrow components are open and the
community is actively improving them.
■ Protocol is open and integrates with gRPC and
Arrow tooling.
16. Flight SQL Status
■ Initial version released with Arrow 7.0.0
● Includes support for C++ and Java clients and servers
■ Enhancements to column and data type metadata have been accepted into
more recent versions of Arrow.
■ Support for transactions and query cancellation have been accepted.
■ Open for contributions
● Support for additional languages (Python, Go, C#, etc.).
● More features such as small result enhancements.
17. Flight SQL Status
■ JDBC Driver
● Connect legacy JDBC applications to databases with the Flight SQL protocol
with no code changes.
■ Examples: DBeaver, DBVisualizer
● Merged into Apache/master. To be released in Arrow 10.0.0
■ ODBC Driver
● Released by Dremio.
● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight
SQL-enabled databases.
19. Practical Example: pyodbc vs. PyArrow
● PyArrow is columnar
■ Consume columnar data returned using the Arrow Flight without deserialization costs.
● pyodbc is row-oriented
■ All data values must be converted to scalars to expose to the python application.
■ This process incurs significant deserialization costs.
20. Practical Example: pyodbc vs. PyArrow
● Comparison: 500,000 rows queried from a remote server. (No parallelism).
■ pyodbc: 8.00s. PyArrow: 0.900s.
21. Query Execution: pyodbc vs. PyArrow
cursor = connection.cursor()
cursor.execute(sql)
data = cursor.fetchall()
■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example).
■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data
can be retrieved in parallel and even from separate processes or client nodes.
pyodbc (ODBC)
options = flight.FlightCallOptions(headers=headers)
descriptor = flight.FlightDescriptor.for_command(sql)
flight_info = client.get_flight_info(descriptor, options)
reader = client.do_get(flight_info.endpoints[0].ticket, options)
data = reader.read_chunk()
PyArrow (Arrow Flight SQL)
22. Arrow Client Design Tips
■ Minimize copying of data.
■ Avoid manual calculations on data.
● Prefer library calls using the Compute library to analyze data (for
example, arithmetic or aggregation on Arrow data).
● Arrow libraries use SIMD instructions for high-performance calculations!
■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and
uncompressed Arrow files. Avoid serializing Arrow data by hand.