Apache Arrow Flight Overview

Apache Arrow
Apache Arrow Flight
By Jacques Nadeau, PMC Apache Arrow
Apache Arrow
Why Arrow Flight: Arrow Promises Interoperability
• But it’s primary medium is in-memory
• Some work to support shared memory in-process
• But not all systems can be collocated
– Especially in a modern K8s/containerized deployment
• Shared memory has other problems:
– Reference management and security are complex
– Different requirements for long-term datasets versus
ephemeral datasets
Arrow Needs an RPC layer to simplify the creation of Data Applications
Apache Arrow
Arrow Messaging Paradigm: Batch Streams
Primary Communication:
• A Stream of Arrow Record
Batches
• Bulk transfer targeting efficient
movement
• Effectively Peer to Peer
Client Server
Put HeaderDataDataDataend
Thanks
endDataDataDataHeader
Get Descriptor
Specific Methods:
• Put Stream: Client sends a stream
to server
• Get Stream: Server sends a stream
to client
• Both Initiated by Client
Apache Arrow
Endpoint: Retrieved with Ticket
Flight
Location 1
Location 2
Arrow Messaging Paradigm: Stream Management
• Parallel consumption and locality awareness
– A flight is composed of streams
– Each stream has a FlightEndpoint: A opaque stream
ticket along with a consumption location
– Systems can take advantage of location information to
improve data locality
• Flights have two reference systems:
– Dotted path namespace for simple services (e.g.
marketing.yesterday.sales)
– Arbitrary binary command descriptor: (e.g. “select a,b
from foo where c > 10”)
• Support for Stream Listing
– ListFlights(Criteria)
– GetFlightInfo(FlightDescriptor)
Stream
Stream
Stream
Stream
Apache Arrow
Arrow Messaging Paradigm: Data as a Service Customization
• Arrow Flight Also support a simple Generic Messaging Framework
– Support Customization and Extensibility within the Arrow Flight context
• ListActions()
– Each Data Service can expose actions along with descriptions about what they support
– Each action should describe how to structure the action and corresponding result
– Normal HTTP2 exceptions can be used to manage error states
• DoAction(Action) => Result
– Generic Containers that can carry execute Data Service specific operations
– Examples might include: forget stream, load stream from disk,
• Actions and Results, each have:
– ActionType String token
– Body: JSON body of instruction
• Arrow Flight Clients can be written without knowledge of custom Actions/Results
– Lightweight wrappers can be built for Data Services as needed
– Or Simply use existing JSON tooling on top of generic API
Apache Arrow
But How? GRPC as a Foundation
• Generic RPC generation framework
• Built on HTTP/2 Standard
• Many language bindings (see right)
• Supports security &compression
• Uses Protobuf as primary format
• Designed primarily for application messaging
Apache Arrow
Extend GRPC To Better Work With Arrow Streams
• Streams are valid Protobuf Objects so systems that don’t
have custom processing can still consume Arrow streams
– The entirety of the Arrow RecordBatch is a single length
delimited Protobuf “bytes” field.
• For high performance situations, do direct byte encoding
and one-copy reads/zero-copy writes to avoid extra
copies/overhead
– Java Flight implementation cuts through multiple layers to
achieve this using currently released GRPC (despite no formal
support for it).
Apache Arrow
Check it out
• Arrow Flight Proposal
– https://github.com/jacques-n/arrow
• Example Usage in Dremio Formation
– https://github.com/jacques-n/formation
1 of 8

Recommended

Apache Arrow: In Theory, In Practice by
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
9.5K views31 slides
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
10.8K views45 slides
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro by
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
1.6K views34 slides
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache by
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
4.7K views37 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 slides

More Related Content

What's hot

Introduction to DataFusion An Embeddable Query Engine Written in Rust by
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
413 views50 slides
Diving into Delta Lake: Unpacking the Transaction Log by
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
807 views33 slides
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da... by
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
181 views21 slides
Apache Iceberg - A Table Format for Hige Analytic Datasets by
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
6.6K views28 slides
Apache Spark Core—Deep Dive—Proper Optimization by
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
6.1K views50 slides
Radical Speed for SQL Queries on Databricks: Photon Under the Hood by
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
1.1K views48 slides

What's hot(20)

Introduction to DataFusion An Embeddable Query Engine Written in Rust by Andrew Lamb
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb413 views
Diving into Delta Lake: Unpacking the Transaction Log by Databricks
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks807 views
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da... by Andrew Lamb
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb181 views
Apache Iceberg - A Table Format for Hige Analytic Datasets by Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
Apache Spark Core—Deep Dive—Proper Optimization by Databricks
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks6.1K views
Radical Speed for SQL Queries on Databricks: Photon Under the Hood by Databricks
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks1.1K views
The Parquet Format and Performance Optimization Opportunities by Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.2K views
Deep Dive: Memory Management in Apache Spark by Databricks
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks14.5K views
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... by Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks8.4K views
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa... by StreamNative
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative509 views
Hudi architecture, fundamentals and capabilities by Nishith Agarwal
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal2.8K views
Presto on Apache Spark: A Tale of Two Computation Engines by Databricks
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks1.6K views
Adaptive Query Execution: Speeding Up Spark SQL at Runtime by Databricks
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks894 views
Photon Technical Deep Dive: How to Think Vectorized by Databricks
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks1.4K views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Apache Hudi: The Path Forward by Alluxio, Inc.
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.495 views
Parquet performance tuning: the missing guide by Ryan Blue
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue40.5K views
A Deep Dive into Query Execution Engine of Spark SQL by Databricks
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks6.6K views
Data Science Across Data Sources with Apache Arrow by Databricks
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks671 views
Understanding and Improving Code Generation by Databricks
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks1.3K views

Similar to Apache Arrow Flight Overview

Python WSGI introduction by
Python WSGI introductionPython WSGI introduction
Python WSGI introductionAgeeleshwar K
1.8K views13 slides
3.2 Streaming and Messaging by
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
219 views16 slides
HPC Controls Future by
HPC Controls FutureHPC Controls Future
HPC Controls Futurercastain
1K views74 slides
CHP-4.pptx by
CHP-4.pptxCHP-4.pptx
CHP-4.pptxFamiDan
19 views42 slides
2. RINA overview - TF workshop by
2. RINA overview - TF workshop2. RINA overview - TF workshop
2. RINA overview - TF workshopARCFIRE ICT
1.6K views21 slides
Cs556 section3 by
Cs556 section3Cs556 section3
Cs556 section3farshad33
432 views65 slides

Similar to Apache Arrow Flight Overview(20)

Python WSGI introduction by Ageeleshwar K
Python WSGI introductionPython WSGI introduction
Python WSGI introduction
Ageeleshwar K1.8K views
3.2 Streaming and Messaging by 振东 刘
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
振东 刘219 views
HPC Controls Future by rcastain
HPC Controls FutureHPC Controls Future
HPC Controls Future
rcastain1K views
CHP-4.pptx by FamiDan
CHP-4.pptxCHP-4.pptx
CHP-4.pptx
FamiDan19 views
2. RINA overview - TF workshop by ARCFIRE ICT
2. RINA overview - TF workshop2. RINA overview - TF workshop
2. RINA overview - TF workshop
ARCFIRE ICT1.6K views
Cs556 section3 by farshad33
Cs556 section3Cs556 section3
Cs556 section3
farshad33432 views
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e... by Edward Burns
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
HTTP/2 Comes to Java: Servlet 4.0 and what it means for the Java/Jakarta EE e...
Edward Burns203 views
Building high performance microservices in finance with Apache Thrift by RX-M Enterprises LLC
Building high performance microservices in finance with Apache ThriftBuilding high performance microservices in finance with Apache Thrift
Building high performance microservices in finance with Apache Thrift
Intro to web services by Neil Ghosh
Intro to web servicesIntro to web services
Intro to web services
Neil Ghosh4.5K views
ONOS Platform Architecture by OpenDaylight
ONOS Platform ArchitectureONOS Platform Architecture
ONOS Platform Architecture
OpenDaylight5.4K views
Module 5 Application and presentation Layer .pptx by AASTHAJAJOO
Module 5 Application and presentation Layer .pptxModule 5 Application and presentation Layer .pptx
Module 5 Application and presentation Layer .pptx
AASTHAJAJOO19 views
Apache frameworks for Big and Fast Data by Naveen Korakoppa
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa423 views
Asynchronous Python with Twisted by Adam Englander
Asynchronous Python with TwistedAsynchronous Python with Twisted
Asynchronous Python with Twisted
Adam Englander1.8K views
Asp.net and .Net Framework ppt presentation by abhishek singh
Asp.net and .Net Framework ppt presentationAsp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentation
abhishek singh3.9K views
Ietf91 ad hoc-coap-lwm2m-ipso by Michael Koster
Ietf91 ad hoc-coap-lwm2m-ipsoIetf91 ad hoc-coap-lwm2m-ipso
Ietf91 ad hoc-coap-lwm2m-ipso
Michael Koster1.4K views

Recently uploaded

DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...Anowar Hossain
13 views34 slides
Searching in Data Structure by
Searching in Data StructureSearching in Data Structure
Searching in Data Structureraghavbirla63
7 views8 slides
sam_software_eng_cv.pdf by
sam_software_eng_cv.pdfsam_software_eng_cv.pdf
sam_software_eng_cv.pdfsammyigbinovia
5 views5 slides
Investor Presentation by
Investor PresentationInvestor Presentation
Investor Presentationeser sevinç
25 views26 slides
START Newsletter 3 by
START Newsletter 3START Newsletter 3
START Newsletter 3Start Project
5 views25 slides
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...AakashShakya12
72 views115 slides

Recently uploaded(20)

DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by Anowar Hossain
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
Anowar Hossain13 views
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1272 views
Control Systems Feedback.pdf by LGGaming5
Control Systems Feedback.pdfControl Systems Feedback.pdf
Control Systems Feedback.pdf
LGGaming56 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi8 views
Generative AI Models & Their Applications by SN
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their Applications
SN8 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy32 views
Introduction to CAD-CAM.pptx by suyogpatil49
Introduction to CAD-CAM.pptxIntroduction to CAD-CAM.pptx
Introduction to CAD-CAM.pptx
suyogpatil495 views
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ... by AltinKaradagli
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
AltinKaradagli9 views
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)

Apache Arrow Flight Overview

  • 1. Apache Arrow Apache Arrow Flight By Jacques Nadeau, PMC Apache Arrow
  • 2. Apache Arrow Why Arrow Flight: Arrow Promises Interoperability • But it’s primary medium is in-memory • Some work to support shared memory in-process • But not all systems can be collocated – Especially in a modern K8s/containerized deployment • Shared memory has other problems: – Reference management and security are complex – Different requirements for long-term datasets versus ephemeral datasets Arrow Needs an RPC layer to simplify the creation of Data Applications
  • 3. Apache Arrow Arrow Messaging Paradigm: Batch Streams Primary Communication: • A Stream of Arrow Record Batches • Bulk transfer targeting efficient movement • Effectively Peer to Peer Client Server Put HeaderDataDataDataend Thanks endDataDataDataHeader Get Descriptor Specific Methods: • Put Stream: Client sends a stream to server • Get Stream: Server sends a stream to client • Both Initiated by Client
  • 4. Apache Arrow Endpoint: Retrieved with Ticket Flight Location 1 Location 2 Arrow Messaging Paradigm: Stream Management • Parallel consumption and locality awareness – A flight is composed of streams – Each stream has a FlightEndpoint: A opaque stream ticket along with a consumption location – Systems can take advantage of location information to improve data locality • Flights have two reference systems: – Dotted path namespace for simple services (e.g. marketing.yesterday.sales) – Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) • Support for Stream Listing – ListFlights(Criteria) – GetFlightInfo(FlightDescriptor) Stream Stream Stream Stream
  • 5. Apache Arrow Arrow Messaging Paradigm: Data as a Service Customization • Arrow Flight Also support a simple Generic Messaging Framework – Support Customization and Extensibility within the Arrow Flight context • ListActions() – Each Data Service can expose actions along with descriptions about what they support – Each action should describe how to structure the action and corresponding result – Normal HTTP2 exceptions can be used to manage error states • DoAction(Action) => Result – Generic Containers that can carry execute Data Service specific operations – Examples might include: forget stream, load stream from disk, • Actions and Results, each have: – ActionType String token – Body: JSON body of instruction • Arrow Flight Clients can be written without knowledge of custom Actions/Results – Lightweight wrappers can be built for Data Services as needed – Or Simply use existing JSON tooling on top of generic API
  • 6. Apache Arrow But How? GRPC as a Foundation • Generic RPC generation framework • Built on HTTP/2 Standard • Many language bindings (see right) • Supports security &compression • Uses Protobuf as primary format • Designed primarily for application messaging
  • 7. Apache Arrow Extend GRPC To Better Work With Arrow Streams • Streams are valid Protobuf Objects so systems that don’t have custom processing can still consume Arrow streams – The entirety of the Arrow RecordBatch is a single length delimited Protobuf “bytes” field. • For high performance situations, do direct byte encoding and one-copy reads/zero-copy writes to avoid extra copies/overhead – Java Flight implementation cuts through multiple layers to achieve this using currently released GRPC (despite no formal support for it).
  • 8. Apache Arrow Check it out • Arrow Flight Proposal – https://github.com/jacques-n/arrow • Example Usage in Dremio Formation – https://github.com/jacques-n/formation