SlideShare a Scribd company logo
1 of 24
Brought to you by
High-speed Database
Throughput Using Apache
Arrow Flight SQL
Kyle Porter
Architect at Dremio
James Duong
Architect at Dremio
Introduction to Arrow Flight
Introduction to Apache Arrow
■ A columnar, in-memory data format and supporting libraries
■ Supported in many languages including C++, Java, Python, Go
■ Data is strongly typed. Each row has the same schema.
■ Includes libraries for working with the format:
● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis.
● Interprocess communication.
● Serialization / deserialization from file formats.
■ Fully open source with a permissive license.
Arrow powers dozens of open source
& commercial technologies
10+ programming languages
supported
>70M downloads
per month
Apache Arrow Adoption
Why is Arrow Flight Needed?
■ An open protocol that the community can support.
■ Designed for data in the modern world
● Older protocols are row oriented and geared towards large numbers of columns and low
numbers of rows.
● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows.
■ Supports distributed computing as a client-side concept
● A data request can return multiple endpoints to a client.
● The client can retrieve from each endpoint in parallel.
Arrow Way: Data is sent, transported and
received in the Arrow format
Arrow Flight
■ Protocol for serialization-free transport of Arrow data
● This is particularly efficient if the client application will just work with Arrow data directly.
DATABASE
Column Based
DATABASE
Column Based
Convert
CLIENT
Column Based
Convert
CLIENT
Column Based
JDBC/ODBC Connector
Arrow Flight Connector transporting data in Arrow Format
Status Quo: Serializing/Deserializing
data at each step
Row Based
Column Based
Distributed Computing:
Single Node with Arrow Flight
Coordinator /
Executor
CLIENT
CPU
memory
1 - GetFlightInfo(<query>)
2 - FlightInfo<Schema, Endpoints>
3 - DoGet(<ticket>)
Endpoint = {location, ticket}
CPU
memory
Distributed Computing:
Multiple Nodes with Arrow Flight
CLIENT
Node 2
Node N
Node 1
CPU
memory
CPU
memory
CPU
memory
CPU
memory
DoGet(<ticket>)
DoGet(<ticket>)
DoGet(<ticket>)
Omitting GetFlightInfo call...
Arrow Flight as a Development Framework
■ Includes a fully-built client library
■ Includes a high-performance, scalable server
● Built on top of Google’s gRPC technology and compatible with existing tooling.
● Server implementation details such as thread-pooling, asynchronous IO, request cancellation
are already implemented.
■ Server deployment is a matter of implementing a few RPC request handlers.
Flight SQL Enhancements
for Arrow Flight
Why Extend Arrow Flight?
■ Client sends a byte stream, server sends a result
● The content of the byte stream is opaque in the interface.
● It only has meaning for a particular server.
● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query.
■ Catalog information is not part of Arrow Flight’s design
● There is no RPC call to describe how to build the byte stream the client sends.
● Generic tools cannot be built.
■ Arrow Flight is meant to serve any tabular data from any source.
■ ODBC/JDBC standardize query execution and catalog access, but have
drawbacks.
■ Enter Arrow Flight SQL.
What is Arrow Flight SQL?
■ Initiative to allow databases to use Arrow Flight as the transport protocol
● Leverage the performance of Arrow and Flight for database access.
■ Extended set of RPC calls to standardize a SQL interface on Flight
● Query execution
● Prepared statements
● Database catalog metadata
● SQL syntax capabilities
■ Generic client libraries
● A Flight SQL application can be used against any Flight SQL server without code changes.
● ODBC and JDBC clients provided on top.
Common Tool Workflow
SERVER
2 - FlightInfo<Schema, Endpoints>
1 - GetFlightInfo(GetTables)
GetTables
4 - Arrow record batches
3 - DoGet(<ticket>)
DoGet
6 - FlightInfo<Schema, Endpoints>
5 - GetFlightInfo(StatementExecute)
Execute
7 - DoGet(<ticket>)
DoGet
CPU
memory
Listing tables
Retrieving query data
CLIENT
CPU
memory
Flight SQL vs. Legacy
Legacy (ODBC / JDBC)
■ Each database vendor must implement,
maintain, and distribute a driver.
■ Each database vendor must implement their
entire server.
■ Implementation details may be closed source.
■ Protocol is usually proprietary.
Flight SQL
■ Single client that works against any Flight SQL
server.
■ Server implementation is part of Flight. Only
RPC handlers need to be implemented.
■ Flight and Arrow components are open and the
community is actively improving them.
■ Protocol is open and integrates with gRPC and
Arrow tooling.
Flight SQL Status
■ Initial version released with Arrow 7.0.0
● Includes support for C++ and Java clients and servers
■ Enhancements to column and data type metadata have been accepted into
more recent versions of Arrow.
■ Support for transactions and query cancellation have been accepted.
■ Open for contributions
● Support for additional languages (Python, Go, C#, etc.).
● More features such as small result enhancements.
Flight SQL Status
■ JDBC Driver
● Connect legacy JDBC applications to databases with the Flight SQL protocol
with no code changes.
■ Examples: DBeaver, DBVisualizer
● Merged into Apache/master. To be released in Arrow 10.0.0
■ ODBC Driver
● Released by Dremio.
● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight
SQL-enabled databases.
Performance
Practical Example: pyodbc vs. PyArrow
● PyArrow is columnar
■ Consume columnar data returned using the Arrow Flight without deserialization costs.
● pyodbc is row-oriented
■ All data values must be converted to scalars to expose to the python application.
■ This process incurs significant deserialization costs.
Practical Example: pyodbc vs. PyArrow
● Comparison: 500,000 rows queried from a remote server. (No parallelism).
■ pyodbc: 8.00s. PyArrow: 0.900s.
Query Execution: pyodbc vs. PyArrow
cursor = connection.cursor()
cursor.execute(sql)
data = cursor.fetchall()
■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example).
■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data
can be retrieved in parallel and even from separate processes or client nodes.
pyodbc (ODBC)
options = flight.FlightCallOptions(headers=headers)
descriptor = flight.FlightDescriptor.for_command(sql)
flight_info = client.get_flight_info(descriptor, options)
reader = client.do_get(flight_info.endpoints[0].ticket, options)
data = reader.read_chunk()
PyArrow (Arrow Flight SQL)
Arrow Client Design Tips
■ Minimize copying of data.
■ Avoid manual calculations on data.
● Prefer library calls using the Compute library to analyze data (for
example, arithmetic or aggregation on Arrow data).
● Arrow libraries use SIMD instructions for high-performance calculations!
■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and
uncompressed Arrow files. Avoid serializing Arrow data by hand.
References
■ Arrow Flight SQL Announcement:
https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
■ Arrow Flight SQL ODBC Driver: https://github.com/dremio/flightsql-odbc and
https://github.com/dremio/warpdrive
■ Arrow Flight SQL JDBC Driver:
https://github.com/apache/arrow/tree/master/java/flight/flight-sql-jdbc-driver
■ Arrow Flight SQL JDBC Driver Improvements:
https://issues.apache.org/jira/browse/ARROW-17729
Brought to you by
Kyle Porter
kporter@dremio.com
James Duong
jduong@dremio.com

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 

Similar to High-speed Database Throughput Using Apache Arrow Flight SQL

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Anant Corporation
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and CaliforniumJulien Vermillard
 
Module 1: ConfD Technical Introduction
Module 1: ConfD Technical IntroductionModule 1: ConfD Technical Introduction
Module 1: ConfD Technical IntroductionTail-f Systems
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex ApplicationApache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache ApexYogi Devendra Vyavahare
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2Linaro
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Asp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentationAsp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentationabhishek singh
 
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep diveBob Cotton
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 
Byte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptxByte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptxRockyBhai46825
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Introduction to Backend Engineering
Introduction to Backend EngineeringIntroduction to Backend Engineering
Introduction to Backend EngineeringUdayYadav90
 
.NET Core Today and Tomorrow
.NET Core Today and Tomorrow.NET Core Today and Tomorrow
.NET Core Today and TomorrowJon Galloway
 

Similar to High-speed Database Throughput Using Apache Arrow Flight SQL (20)

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
 
Module 1: ConfD Technical Introduction
Module 1: ConfD Technical IntroductionModule 1: ConfD Technical Introduction
Module 1: ConfD Technical Introduction
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Asp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentationAsp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentation
 
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Byte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptxByte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptx
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Introduction to Backend Engineering
Introduction to Backend EngineeringIntroduction to Backend Engineering
Introduction to Backend Engineering
 
.NET Core Today and Tomorrow
.NET Core Today and Tomorrow.NET Core Today and Tomorrow
.NET Core Today and Tomorrow
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Webinar - Payscale Innovation Unleashed: New features and data evolving the c...
Webinar - Payscale Innovation Unleashed: New features and data evolving the c...Webinar - Payscale Innovation Unleashed: New features and data evolving the c...
Webinar - Payscale Innovation Unleashed: New features and data evolving the c...PayScale, Inc.
 
Ways to Make the Most of Temporary Part Time Jobs
Ways to Make the Most of Temporary Part Time JobsWays to Make the Most of Temporary Part Time Jobs
Ways to Make the Most of Temporary Part Time JobsSnapJob
 
Employee Engagement Trend Analysis.pptx.
Employee Engagement Trend Analysis.pptx.Employee Engagement Trend Analysis.pptx.
Employee Engagement Trend Analysis.pptx.ShrayasiRoy
 
Employee Roles & Responsibilities: Driving Organizational Success
Employee Roles & Responsibilities: Driving Organizational SuccessEmployee Roles & Responsibilities: Driving Organizational Success
Employee Roles & Responsibilities: Driving Organizational SuccessHireQuotient
 
Intern Welcome LinkedIn Periodical (1).pdf
Intern Welcome LinkedIn Periodical (1).pdfIntern Welcome LinkedIn Periodical (1).pdf
Intern Welcome LinkedIn Periodical (1).pdfmarketing659039
 
Mastering Vendor Selection and Partnership Management
Mastering Vendor Selection and Partnership ManagementMastering Vendor Selection and Partnership Management
Mastering Vendor Selection and Partnership ManagementBoundless HQ
 
Intern Exit Interview Questions and Answers
Intern Exit Interview Questions and AnswersIntern Exit Interview Questions and Answers
Intern Exit Interview Questions and AnswersHireQuotient
 
Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝soniya singh
 
VIP Russian Call Girls in Indore Komal 💚😋 9256729539 🚀 Indore Escorts
VIP Russian Call Girls in Indore Komal 💚😋  9256729539 🚀 Indore EscortsVIP Russian Call Girls in Indore Komal 💚😋  9256729539 🚀 Indore Escorts
VIP Russian Call Girls in Indore Komal 💚😋 9256729539 🚀 Indore Escortsaditipandeya
 
Situational Questions for Team Leader Interviews in BPO with Sample Answers
Situational Questions for Team Leader Interviews in BPO with Sample AnswersSituational Questions for Team Leader Interviews in BPO with Sample Answers
Situational Questions for Team Leader Interviews in BPO with Sample AnswersHireQuotient
 
Cleared Job Fair Handbook | May 2, 2024
Cleared Job Fair Handbook  |  May 2, 2024Cleared Job Fair Handbook  |  May 2, 2024
Cleared Job Fair Handbook | May 2, 2024ClearedJobs.Net
 
Cheap Rate ➥8448380779 ▻Call Girls In Sector 29 Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Sector 29 GurgaonCheap Rate ➥8448380779 ▻Call Girls In Sector 29 Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Sector 29 GurgaonDelhi Call girls
 
Copy of Periodical - Employee Spotlight (8).pdf
Copy of Periodical - Employee Spotlight (8).pdfCopy of Periodical - Employee Spotlight (8).pdf
Copy of Periodical - Employee Spotlight (8).pdfmarketing659039
 
Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...
Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...
Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...makika9823
 
Creative Director vs. Design Director: Key Differences for Recruiters
Creative Director vs. Design Director: Key Differences for RecruitersCreative Director vs. Design Director: Key Differences for Recruiters
Creative Director vs. Design Director: Key Differences for RecruitersHireQuotient
 
Advantages of Human Resource Management System
Advantages of Human Resource Management SystemAdvantages of Human Resource Management System
Advantages of Human Resource Management SystemHireQuotient
 
How Leading Companies Deliver Value with People Analytics
How Leading Companies Deliver Value with People AnalyticsHow Leading Companies Deliver Value with People Analytics
How Leading Companies Deliver Value with People AnalyticsDavid Green
 
Austin Recruiter Network Meeting April 25, 2024
Austin Recruiter Network Meeting April 25, 2024Austin Recruiter Network Meeting April 25, 2024
Austin Recruiter Network Meeting April 25, 2024Dan Medlin
 

Recently uploaded (20)

Webinar - Payscale Innovation Unleashed: New features and data evolving the c...
Webinar - Payscale Innovation Unleashed: New features and data evolving the c...Webinar - Payscale Innovation Unleashed: New features and data evolving the c...
Webinar - Payscale Innovation Unleashed: New features and data evolving the c...
 
escort service sasti (*~Call Girls in Rajender Nagar Metro❤️9953056974
escort service sasti (*~Call Girls in Rajender Nagar Metro❤️9953056974escort service sasti (*~Call Girls in Rajender Nagar Metro❤️9953056974
escort service sasti (*~Call Girls in Rajender Nagar Metro❤️9953056974
 
Ways to Make the Most of Temporary Part Time Jobs
Ways to Make the Most of Temporary Part Time JobsWays to Make the Most of Temporary Part Time Jobs
Ways to Make the Most of Temporary Part Time Jobs
 
Employee Engagement Trend Analysis.pptx.
Employee Engagement Trend Analysis.pptx.Employee Engagement Trend Analysis.pptx.
Employee Engagement Trend Analysis.pptx.
 
Employee Roles & Responsibilities: Driving Organizational Success
Employee Roles & Responsibilities: Driving Organizational SuccessEmployee Roles & Responsibilities: Driving Organizational Success
Employee Roles & Responsibilities: Driving Organizational Success
 
Intern Welcome LinkedIn Periodical (1).pdf
Intern Welcome LinkedIn Periodical (1).pdfIntern Welcome LinkedIn Periodical (1).pdf
Intern Welcome LinkedIn Periodical (1).pdf
 
Mastering Vendor Selection and Partnership Management
Mastering Vendor Selection and Partnership ManagementMastering Vendor Selection and Partnership Management
Mastering Vendor Selection and Partnership Management
 
Intern Exit Interview Questions and Answers
Intern Exit Interview Questions and AnswersIntern Exit Interview Questions and Answers
Intern Exit Interview Questions and Answers
 
Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Keshav Puram Delhi reach out to us at 🔝8264348440🔝
 
9953330565 Low Rate Call Girls In Vijay Nagar Delhi NCR
9953330565 Low Rate Call Girls In Vijay Nagar Delhi NCR9953330565 Low Rate Call Girls In Vijay Nagar Delhi NCR
9953330565 Low Rate Call Girls In Vijay Nagar Delhi NCR
 
VIP Russian Call Girls in Indore Komal 💚😋 9256729539 🚀 Indore Escorts
VIP Russian Call Girls in Indore Komal 💚😋  9256729539 🚀 Indore EscortsVIP Russian Call Girls in Indore Komal 💚😋  9256729539 🚀 Indore Escorts
VIP Russian Call Girls in Indore Komal 💚😋 9256729539 🚀 Indore Escorts
 
Situational Questions for Team Leader Interviews in BPO with Sample Answers
Situational Questions for Team Leader Interviews in BPO with Sample AnswersSituational Questions for Team Leader Interviews in BPO with Sample Answers
Situational Questions for Team Leader Interviews in BPO with Sample Answers
 
Cleared Job Fair Handbook | May 2, 2024
Cleared Job Fair Handbook  |  May 2, 2024Cleared Job Fair Handbook  |  May 2, 2024
Cleared Job Fair Handbook | May 2, 2024
 
Cheap Rate ➥8448380779 ▻Call Girls In Sector 29 Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Sector 29 GurgaonCheap Rate ➥8448380779 ▻Call Girls In Sector 29 Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Sector 29 Gurgaon
 
Copy of Periodical - Employee Spotlight (8).pdf
Copy of Periodical - Employee Spotlight (8).pdfCopy of Periodical - Employee Spotlight (8).pdf
Copy of Periodical - Employee Spotlight (8).pdf
 
Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...
Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...
Escorts in Lucknow 9548273370 WhatsApp visit your hotel or office Independent...
 
Creative Director vs. Design Director: Key Differences for Recruiters
Creative Director vs. Design Director: Key Differences for RecruitersCreative Director vs. Design Director: Key Differences for Recruiters
Creative Director vs. Design Director: Key Differences for Recruiters
 
Advantages of Human Resource Management System
Advantages of Human Resource Management SystemAdvantages of Human Resource Management System
Advantages of Human Resource Management System
 
How Leading Companies Deliver Value with People Analytics
How Leading Companies Deliver Value with People AnalyticsHow Leading Companies Deliver Value with People Analytics
How Leading Companies Deliver Value with People Analytics
 
Austin Recruiter Network Meeting April 25, 2024
Austin Recruiter Network Meeting April 25, 2024Austin Recruiter Network Meeting April 25, 2024
Austin Recruiter Network Meeting April 25, 2024
 

High-speed Database Throughput Using Apache Arrow Flight SQL

  • 1. Brought to you by High-speed Database Throughput Using Apache Arrow Flight SQL Kyle Porter Architect at Dremio James Duong Architect at Dremio
  • 3. Introduction to Apache Arrow ■ A columnar, in-memory data format and supporting libraries ■ Supported in many languages including C++, Java, Python, Go ■ Data is strongly typed. Each row has the same schema. ■ Includes libraries for working with the format: ● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis. ● Interprocess communication. ● Serialization / deserialization from file formats. ■ Fully open source with a permissive license.
  • 4. Arrow powers dozens of open source & commercial technologies 10+ programming languages supported
  • 6. Why is Arrow Flight Needed? ■ An open protocol that the community can support. ■ Designed for data in the modern world ● Older protocols are row oriented and geared towards large numbers of columns and low numbers of rows. ● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows. ■ Supports distributed computing as a client-side concept ● A data request can return multiple endpoints to a client. ● The client can retrieve from each endpoint in parallel.
  • 7. Arrow Way: Data is sent, transported and received in the Arrow format Arrow Flight ■ Protocol for serialization-free transport of Arrow data ● This is particularly efficient if the client application will just work with Arrow data directly. DATABASE Column Based DATABASE Column Based Convert CLIENT Column Based Convert CLIENT Column Based JDBC/ODBC Connector Arrow Flight Connector transporting data in Arrow Format Status Quo: Serializing/Deserializing data at each step Row Based Column Based
  • 8. Distributed Computing: Single Node with Arrow Flight Coordinator / Executor CLIENT CPU memory 1 - GetFlightInfo(<query>) 2 - FlightInfo<Schema, Endpoints> 3 - DoGet(<ticket>) Endpoint = {location, ticket} CPU memory
  • 9. Distributed Computing: Multiple Nodes with Arrow Flight CLIENT Node 2 Node N Node 1 CPU memory CPU memory CPU memory CPU memory DoGet(<ticket>) DoGet(<ticket>) DoGet(<ticket>) Omitting GetFlightInfo call...
  • 10. Arrow Flight as a Development Framework ■ Includes a fully-built client library ■ Includes a high-performance, scalable server ● Built on top of Google’s gRPC technology and compatible with existing tooling. ● Server implementation details such as thread-pooling, asynchronous IO, request cancellation are already implemented. ■ Server deployment is a matter of implementing a few RPC request handlers.
  • 12. Why Extend Arrow Flight? ■ Client sends a byte stream, server sends a result ● The content of the byte stream is opaque in the interface. ● It only has meaning for a particular server. ● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query. ■ Catalog information is not part of Arrow Flight’s design ● There is no RPC call to describe how to build the byte stream the client sends. ● Generic tools cannot be built. ■ Arrow Flight is meant to serve any tabular data from any source. ■ ODBC/JDBC standardize query execution and catalog access, but have drawbacks. ■ Enter Arrow Flight SQL.
  • 13. What is Arrow Flight SQL? ■ Initiative to allow databases to use Arrow Flight as the transport protocol ● Leverage the performance of Arrow and Flight for database access. ■ Extended set of RPC calls to standardize a SQL interface on Flight ● Query execution ● Prepared statements ● Database catalog metadata ● SQL syntax capabilities ■ Generic client libraries ● A Flight SQL application can be used against any Flight SQL server without code changes. ● ODBC and JDBC clients provided on top.
  • 14. Common Tool Workflow SERVER 2 - FlightInfo<Schema, Endpoints> 1 - GetFlightInfo(GetTables) GetTables 4 - Arrow record batches 3 - DoGet(<ticket>) DoGet 6 - FlightInfo<Schema, Endpoints> 5 - GetFlightInfo(StatementExecute) Execute 7 - DoGet(<ticket>) DoGet CPU memory Listing tables Retrieving query data CLIENT CPU memory
  • 15. Flight SQL vs. Legacy Legacy (ODBC / JDBC) ■ Each database vendor must implement, maintain, and distribute a driver. ■ Each database vendor must implement their entire server. ■ Implementation details may be closed source. ■ Protocol is usually proprietary. Flight SQL ■ Single client that works against any Flight SQL server. ■ Server implementation is part of Flight. Only RPC handlers need to be implemented. ■ Flight and Arrow components are open and the community is actively improving them. ■ Protocol is open and integrates with gRPC and Arrow tooling.
  • 16. Flight SQL Status ■ Initial version released with Arrow 7.0.0 ● Includes support for C++ and Java clients and servers ■ Enhancements to column and data type metadata have been accepted into more recent versions of Arrow. ■ Support for transactions and query cancellation have been accepted. ■ Open for contributions ● Support for additional languages (Python, Go, C#, etc.). ● More features such as small result enhancements.
  • 17. Flight SQL Status ■ JDBC Driver ● Connect legacy JDBC applications to databases with the Flight SQL protocol with no code changes. ■ Examples: DBeaver, DBVisualizer ● Merged into Apache/master. To be released in Arrow 10.0.0 ■ ODBC Driver ● Released by Dremio. ● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight SQL-enabled databases.
  • 19. Practical Example: pyodbc vs. PyArrow ● PyArrow is columnar ■ Consume columnar data returned using the Arrow Flight without deserialization costs. ● pyodbc is row-oriented ■ All data values must be converted to scalars to expose to the python application. ■ This process incurs significant deserialization costs.
  • 20. Practical Example: pyodbc vs. PyArrow ● Comparison: 500,000 rows queried from a remote server. (No parallelism). ■ pyodbc: 8.00s. PyArrow: 0.900s.
  • 21. Query Execution: pyodbc vs. PyArrow cursor = connection.cursor() cursor.execute(sql) data = cursor.fetchall() ■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example). ■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data can be retrieved in parallel and even from separate processes or client nodes. pyodbc (ODBC) options = flight.FlightCallOptions(headers=headers) descriptor = flight.FlightDescriptor.for_command(sql) flight_info = client.get_flight_info(descriptor, options) reader = client.do_get(flight_info.endpoints[0].ticket, options) data = reader.read_chunk() PyArrow (Arrow Flight SQL)
  • 22. Arrow Client Design Tips ■ Minimize copying of data. ■ Avoid manual calculations on data. ● Prefer library calls using the Compute library to analyze data (for example, arithmetic or aggregation on Arrow data). ● Arrow libraries use SIMD instructions for high-performance calculations! ■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and uncompressed Arrow files. Avoid serializing Arrow data by hand.
  • 23. References ■ Arrow Flight SQL Announcement: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/ ■ Arrow Flight SQL ODBC Driver: https://github.com/dremio/flightsql-odbc and https://github.com/dremio/warpdrive ■ Arrow Flight SQL JDBC Driver: https://github.com/apache/arrow/tree/master/java/flight/flight-sql-jdbc-driver ■ Arrow Flight SQL JDBC Driver Improvements: https://issues.apache.org/jira/browse/ARROW-17729
  • 24. Brought to you by Kyle Porter kporter@dremio.com James Duong jduong@dremio.com