Apache Arrow
Apache Arrow Flight
By Jacques Nadeau, PMC Apache Arrow
Apache Arrow
Why Arrow Flight: Arrow Promises Interoperability
• But it’s primary medium is in-memory
• Some work to support shared memory in-process
• But not all systems can be collocated
– Especially in a modern K8s/containerized deployment
• Shared memory has other problems:
– Reference management and security are complex
– Different requirements for long-term datasets versus
ephemeral datasets
Arrow Needs an RPC layer to simplify the creation of Data Applications
Apache Arrow
Arrow Messaging Paradigm: Batch Streams
Primary Communication:
• A Stream of Arrow Record
Batches
• Bulk transfer targeting efficient
movement
• Effectively Peer to Peer
Client Server
Put HeaderDataDataDataend
Thanks
endDataDataDataHeader
Get Descriptor
Specific Methods:
• Put Stream: Client sends a stream
to server
• Get Stream: Server sends a stream
to client
• Both Initiated by Client
Apache Arrow
Endpoint: Retrieved with Ticket
Flight
Location 1
Location 2
Arrow Messaging Paradigm: Stream Management
• Parallel consumption and locality awareness
– A flight is composed of streams
– Each stream has a FlightEndpoint: A opaque stream
ticket along with a consumption location
– Systems can take advantage of location information to
improve data locality
• Flights have two reference systems:
– Dotted path namespace for simple services (e.g.
marketing.yesterday.sales)
– Arbitrary binary command descriptor: (e.g. “select a,b
from foo where c > 10”)
• Support for Stream Listing
– ListFlights(Criteria)
– GetFlightInfo(FlightDescriptor)
Stream
Stream
Stream
Stream
Apache Arrow
Arrow Messaging Paradigm: Data as a Service Customization
• Arrow Flight Also support a simple Generic Messaging Framework
– Support Customization and Extensibility within the Arrow Flight context
• ListActions()
– Each Data Service can expose actions along with descriptions about what they support
– Each action should describe how to structure the action and corresponding result
– Normal HTTP2 exceptions can be used to manage error states
• DoAction(Action) => Result
– Generic Containers that can carry execute Data Service specific operations
– Examples might include: forget stream, load stream from disk,
• Actions and Results, each have:
– ActionType String token
– Body: JSON body of instruction
• Arrow Flight Clients can be written without knowledge of custom Actions/Results
– Lightweight wrappers can be built for Data Services as needed
– Or Simply use existing JSON tooling on top of generic API
Apache Arrow
But How? GRPC as a Foundation
• Generic RPC generation framework
• Built on HTTP/2 Standard
• Many language bindings (see right)
• Supports security &compression
• Uses Protobuf as primary format
• Designed primarily for application messaging
Apache Arrow
Extend GRPC To Better Work With Arrow Streams
• Streams are valid Protobuf Objects so systems that don’t
have custom processing can still consume Arrow streams
– The entirety of the Arrow RecordBatch is a single length
delimited Protobuf “bytes” field.
• For high performance situations, do direct byte encoding
and one-copy reads/zero-copy writes to avoid extra
copies/overhead
– Java Flight implementation cuts through multiple layers to
achieve this using currently released GRPC (despite no formal
support for it).
Apache Arrow
Check it out
• Arrow Flight Proposal
– https://github.com/jacques-n/arrow
• Example Usage in Dremio Formation
– https://github.com/jacques-n/formation

Apache Arrow Flight Overview

  • 1.
    Apache Arrow Apache ArrowFlight By Jacques Nadeau, PMC Apache Arrow
  • 2.
    Apache Arrow Why ArrowFlight: Arrow Promises Interoperability • But it’s primary medium is in-memory • Some work to support shared memory in-process • But not all systems can be collocated – Especially in a modern K8s/containerized deployment • Shared memory has other problems: – Reference management and security are complex – Different requirements for long-term datasets versus ephemeral datasets Arrow Needs an RPC layer to simplify the creation of Data Applications
  • 3.
    Apache Arrow Arrow MessagingParadigm: Batch Streams Primary Communication: • A Stream of Arrow Record Batches • Bulk transfer targeting efficient movement • Effectively Peer to Peer Client Server Put HeaderDataDataDataend Thanks endDataDataDataHeader Get Descriptor Specific Methods: • Put Stream: Client sends a stream to server • Get Stream: Server sends a stream to client • Both Initiated by Client
  • 4.
    Apache Arrow Endpoint: Retrievedwith Ticket Flight Location 1 Location 2 Arrow Messaging Paradigm: Stream Management • Parallel consumption and locality awareness – A flight is composed of streams – Each stream has a FlightEndpoint: A opaque stream ticket along with a consumption location – Systems can take advantage of location information to improve data locality • Flights have two reference systems: – Dotted path namespace for simple services (e.g. marketing.yesterday.sales) – Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) • Support for Stream Listing – ListFlights(Criteria) – GetFlightInfo(FlightDescriptor) Stream Stream Stream Stream
  • 5.
    Apache Arrow Arrow MessagingParadigm: Data as a Service Customization • Arrow Flight Also support a simple Generic Messaging Framework – Support Customization and Extensibility within the Arrow Flight context • ListActions() – Each Data Service can expose actions along with descriptions about what they support – Each action should describe how to structure the action and corresponding result – Normal HTTP2 exceptions can be used to manage error states • DoAction(Action) => Result – Generic Containers that can carry execute Data Service specific operations – Examples might include: forget stream, load stream from disk, • Actions and Results, each have: – ActionType String token – Body: JSON body of instruction • Arrow Flight Clients can be written without knowledge of custom Actions/Results – Lightweight wrappers can be built for Data Services as needed – Or Simply use existing JSON tooling on top of generic API
  • 6.
    Apache Arrow But How?GRPC as a Foundation • Generic RPC generation framework • Built on HTTP/2 Standard • Many language bindings (see right) • Supports security &compression • Uses Protobuf as primary format • Designed primarily for application messaging
  • 7.
    Apache Arrow Extend GRPCTo Better Work With Arrow Streams • Streams are valid Protobuf Objects so systems that don’t have custom processing can still consume Arrow streams – The entirety of the Arrow RecordBatch is a single length delimited Protobuf “bytes” field. • For high performance situations, do direct byte encoding and one-copy reads/zero-copy writes to avoid extra copies/overhead – Java Flight implementation cuts through multiple layers to achieve this using currently released GRPC (despite no formal support for it).
  • 8.
    Apache Arrow Check itout • Arrow Flight Proposal – https://github.com/jacques-n/arrow • Example Usage in Dremio Formation – https://github.com/jacques-n/formation