Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for High-Performance Data Transfers from Data

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 26 Ad

Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for High-Performance Data Transfers from Data

Download to read offline

This lunch covers why ODBC & JDBC don’t cut it in today’s data world and the problems solved by Arrow, Arrow Flight, and Arrow Flight SQL. Alex will go through how each of these building blocks works as well as an overview of universal ODBC & JDBC drivers built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Sign Up For Our Newsletter: http://eepurl.com/grdMkn

Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers...

Cassandra.Link:
https://cassandra.link/

Follow Us and Reach Us At:

Anant:
https://www.anant.us/

Awesome Cassandra:
https://github.com/Anant/awesome-cass...

Email:
solutions@anant.us

LinkedIn:
https://www.linkedin.com/company/anant/

Twitter:
https://twitter.com/anantcorp

Eventbrite:
https://www.eventbrite.com/o/anant-10...

Facebook:
https://www.facebook.com/AnantCorp/

Join The Anant Team:
https://www.careers.anant.us

This lunch covers why ODBC & JDBC don’t cut it in today’s data world and the problems solved by Arrow, Arrow Flight, and Arrow Flight SQL. Alex will go through how each of these building blocks works as well as an overview of universal ODBC & JDBC drivers built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Sign Up For Our Newsletter: http://eepurl.com/grdMkn

Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers...

Cassandra.Link:
https://cassandra.link/

Follow Us and Reach Us At:

Anant:
https://www.anant.us/

Awesome Cassandra:
https://github.com/Anant/awesome-cass...

Email:
solutions@anant.us

LinkedIn:
https://www.linkedin.com/company/anant/

Twitter:
https://twitter.com/anantcorp

Eventbrite:
https://www.eventbrite.com/o/anant-10...

Facebook:
https://www.facebook.com/AnantCorp/

Join The Anant Team:
https://www.careers.anant.us

Advertisement
Advertisement

More Related Content

More from Anant Corporation (20)

Advertisement

Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for High-Performance Data Transfers from Data

  1. 1. Confidential - Do Not Share or Distribute Apache Arrow Flight SQL A universal standard for high performance data transfers from databases Alex Merced Developer Advocate @ Dremio
  2. 2. Confidential - Do Not Share or Distribute Agenda ● What if we could have for our data transfers… ● What options do we have? ● What is Apache Arrow? Is it enough? ● What is Arrow Flight? Is it enough? ● What is Arrow Flight SQL? Is it enough? ● What are the Arrow Flight SQL ODBC & JDBC Drivers?
  3. 3. Confidential - Do Not Share or Distribute What if we could have for our data transfers… ✓ Standardized APIs ✓ No need for many different drivers ✓ High performance transfers ✓ Parallel data transfers (1:many, many:many) ✓ Easy to implement on server side, easy to use on client side
  4. 4. Confidential - Do Not Share or Distribute ● Built in 1992 & 1997, respectively ○ Inherently row-based ○ Prevented the many:many of client-to-database-APIs ○ But resulted in one:many of API-to-database-protocols Why don’t ODBC & JDBC fit in today’s analytics world? Today Yesterday ● Today: heterogeneous big data workloads ● Most tools involved with OLAP workloads today are columnar ○ But there is no columnar transfer standard
  5. 5. Confidential - Do Not Share or Distribute Increasingly, columnar → columnar is the standard Yesterday Today
  6. 6. Confidential - Do Not Share or Distribute ● Custom protocols ○ One per server ○ Implemented individually by each client What options do we have to keep things columnar?
  7. 7. Confidential - Do Not Share or Distribute Pick your poison Enter: Apache Arrow Flight JDBC/ODBC Custom protocol Standardize client interface ✅ ❌ Standardize client database access interface ✅ ❌ Standardize server interface ❌ ❌ Prevent unnecessary serialization ❌ ✅
  8. 8. Confidential - Do Not Share or Distribute What if? Enter: Arrow Flight
  9. 9. Confidential - Do Not Share or Distribute Before we go there: Apache Arrow
  10. 10. Confidential - Do Not Share or Distribute Introduction to Apache Arrow ● An in-memory columnar format ○ Includes libraries for working with the format ■ E.g., computation engine, IPC, serialization / deserialization from file formats. ○ Support for many languages (12 currently) ● Co-created by Dremio and Wes McKinney ● Rapid adoption, popularity, and capability growth
  11. 11. Confidential - Do Not Share or Distribute ● Arrow is primarily geared towards operations on data in a single-node ○ No specification for cross-network transfers ■ Within a cluster ■ Client-server 11 Isn’t Arrow enough? Enter: Arrow Flight
  12. 12. Confidential - Do Not Share or Distribute 12 What is Arrow Flight? ● A general-purpose client-server framework to simplify high performance transport of large datasets over network interfaces ● Designed for data in the modern world ○ Column-oriented, rather than row-oriented ○ Arrow’s columnar format is designed for high compressibility and large numbers of rows ● Makes including the client-side in distributed computing easier ● Enables parallel data transfers
  13. 13. Confidential - Do Not Share or Distribute 13 Parallel data transfers with Arrow Flight Client Node 2 Node N Node 1 CPU memory CPU memory CPU memory CPU memory DoGet(<ticket>) DoGet(<ticket>) DoGet(<ticket>)
  14. 14. Confidential - Do Not Share or Distribute Isn’t Arrow Flight Enough? JDBC/ODBC Custom protocols Arrow Flight Standardize client interface ✅ ❌ ✅ Standardize client database access interface ✅ ❌ ❌ Standardize server interface ❌ ❌ ✅ Prevent unnecessary serialization ❌ ✅ ✅
  15. 15. Confidential - Do Not Share or Distribute Isn’t Arrow Flight Enough? ● Client sends an arbitrary byte stream, server sends a result ○ The byte stream only has meaning for a particular server that’s chosen to support it ○ Example – Dremio interprets the byte stream to be a UTF-8 encoded SQL query string ● Flight is meant to serve any tabular data, not databases in particular ○ Catalog information is not part of Arrow Flight’s design ● Haven’t we seen these problems before? ○ ODBC and JDBC standardize database activities (e.g., query execution, catalog access) Enter: Arrow Flight SQL
  16. 16. Confidential - Do Not Share or Distribute 16 ● Client-server protocol for interacting with SQL databases built on Arrow Flight ○ Allows databases to use Arrow Flight as the transport protocol ○ Leverage the performance of Arrow and Flight for database access ● Set of commands to standardize a SQL interface on Flight: ○ Query execution ○ Prepared statements ○ Database catalog metadata (tables, columns, data types). ○ SQL syntax capabilities ● Language and database-agnostic ○ A single set of client libraries to connect to any Flight SQL server ○ Clients can talk to any database implementing the protocol ○ No need for a client-side driver per database What is Arrow Flight SQL?
  17. 17. Confidential - Do Not Share or Distribute Client Database 2 - FlightInfo<Schema, Endpoints> 1 - GetFlightInfo(GetTables) 4 - Arrow record batches 3 - DoGet(<ticket>) 6 - FlightInfo<Schema, Endpoints> 5 - GetFlightInfo(CommandStatementQuery) 7 - DoGet(<ticket>) Memory Listing Tables: Querying a table: Memory 8 - Arrow record batches
  18. 18. Confidential - Do Not Share or Distribute Queries: ● CommandStatementQuery ● CommandStatementUpdate Prepared statements: ● ActionClosePreparedStatementRequest ● ActionCreatePreparedStatementRequest ● CommandPreparedStatementQuery ● CommandPreparedStatementUpdate Metadata requests: ● CommandGetCatalogs ● CommandGetCrossReference ● CommandGetDbSchemas ● CommandGetExportedKeys ● CommandGetImportedKeys ● CommandGetPrimaryKeys ● CommandGetSqlInfo ● CommandGetTables ● CommandGetTableTypes Arrow Flight SQL Commands
  19. 19. Confidential - Do Not Share or Distribute Isn’t Arrow Flight SQL Enough? JDBC/ODBC Custom protocols Arrow Flight Arrow Flight SQL Standardize client interface ✅ ❌ ✅ ✅ Standardize client database access interface ✅ ❌ ❌ ✅ Standardize server interface ❌ ❌ ✅ ✅ Prevent unnecessary serialization ❌ ✅ ✅ ✅
  20. 20. Confidential - Do Not Share or Distribute ● Arrow Flight SQL is fairly new, will take time to be widely adopted by existing client tools (especially the enterprise BI ones) ○ Most BI tools already support ODBC or JDBC Isn’t Arrow Flight SQL Enough? Enter: Arrow Flight SQL ODBC & JDBC Drivers
  21. 21. Confidential - Do Not Share or Distribute 21 Arrow Flight SQL ODBC & JDBC Drivers ● ODBC & JDBC Drivers built on top of Flight SQL client libraries ○ A single driver to connect to any Flight SQL server ○ Supports arbitrary server-side options as connection properties ● The drivers will work with any ODBC/JDBC client tool without any code changes ● Completely open source
  22. 22. Confidential - Do Not Share or Distribute 22 Traditional ODBC/JDBC vs Arrow Flight SQL ODBC/JDBC Traditional ODBC/JDBC Arrow Flight SQL ODBC/JDBC Driver Management Performance install and manage a driver for each database row-based transfer, so not great for large amounts of data column-based transfer, so benefit from compression of data over the wire install and manage a single driver for all databases
  23. 23. Confidential - Do Not Share or Distribute When wouldn’t I use Arrow Flight ODBC/JDBC Drivers? ● Generally preferable to have native Flight SQL applications ○ Easier to harness future features like multiple endpoints ○ Better performance if your client is column-based Client is row-based Client is column-based Network gRPC Arrow Arrow Flight Arrow Flight SQL JDBC Legacy Client SQL Native Client Non-SQL Client
  24. 24. Confidential - Do Not Share or Distribute What if we could have for our data transfers… ✓ Standardized APIs ○ Arrow Flight for arbitrary tabular transfers, Arrow Flight SQL for transfers from databases ✓ No need for many different drivers ○ Both client and server side APIs are standardized ✓ High performance transfers ○ Transfers via highly optimized Arrow format ✓ Parallel data transfers (1:many, many:many) ○ Lays the foundation for parallel data transfers ✓ Easy to implement on server side, easy to use on client side ○ Server-side implementation is as easy as implementing a few RPC request handlers ○ A single set of client libraries to connect to any Flight SQL server
  25. 25. Confidential - Do Not Share or Distribute Q&A
  26. 26. Confidential - Do Not Share or Distribute Thanks! jason@dremio.com

×