2. Apache Arrow
Why Arrow Flight: Arrow Promises Interoperability
• But it’s primary medium is in-memory
• Some work to support shared memory in-process
• But not all systems can be collocated
– Especially in a modern K8s/containerized deployment
• Shared memory has other problems:
– Reference management and security are complex
– Different requirements for long-term datasets versus
ephemeral datasets
Arrow Needs an RPC layer to simplify the creation of Data Applications
3. Apache Arrow
Arrow Messaging Paradigm: Batch Streams
Primary Communication:
• A Stream of Arrow Record
Batches
• Bulk transfer targeting efficient
movement
• Effectively Peer to Peer
Client Server
Put HeaderDataDataDataend
Thanks
endDataDataDataHeader
Get Descriptor
Specific Methods:
• Put Stream: Client sends a stream
to server
• Get Stream: Server sends a stream
to client
• Both Initiated by Client
4. Apache Arrow
Endpoint: Retrieved with Ticket
Flight
Location 1
Location 2
Arrow Messaging Paradigm: Stream Management
• Parallel consumption and locality awareness
– A flight is composed of streams
– Each stream has a FlightEndpoint: A opaque stream
ticket along with a consumption location
– Systems can take advantage of location information to
improve data locality
• Flights have two reference systems:
– Dotted path namespace for simple services (e.g.
marketing.yesterday.sales)
– Arbitrary binary command descriptor: (e.g. “select a,b
from foo where c > 10”)
• Support for Stream Listing
– ListFlights(Criteria)
– GetFlightInfo(FlightDescriptor)
Stream
Stream
Stream
Stream
5. Apache Arrow
Arrow Messaging Paradigm: Data as a Service Customization
• Arrow Flight Also support a simple Generic Messaging Framework
– Support Customization and Extensibility within the Arrow Flight context
• ListActions()
– Each Data Service can expose actions along with descriptions about what they support
– Each action should describe how to structure the action and corresponding result
– Normal HTTP2 exceptions can be used to manage error states
• DoAction(Action) => Result
– Generic Containers that can carry execute Data Service specific operations
– Examples might include: forget stream, load stream from disk,
• Actions and Results, each have:
– ActionType String token
– Body: JSON body of instruction
• Arrow Flight Clients can be written without knowledge of custom Actions/Results
– Lightweight wrappers can be built for Data Services as needed
– Or Simply use existing JSON tooling on top of generic API
6. Apache Arrow
But How? GRPC as a Foundation
• Generic RPC generation framework
• Built on HTTP/2 Standard
• Many language bindings (see right)
• Supports security &compression
• Uses Protobuf as primary format
• Designed primarily for application messaging
7. Apache Arrow
Extend GRPC To Better Work With Arrow Streams
• Streams are valid Protobuf Objects so systems that don’t
have custom processing can still consume Arrow streams
– The entirety of the Arrow RecordBatch is a single length
delimited Protobuf “bytes” field.
• For high performance situations, do direct byte encoding
and one-copy reads/zero-copy writes to avoid extra
copies/overhead
– Java Flight implementation cuts through multiple layers to
achieve this using currently released GRPC (despite no formal
support for it).
8. Apache Arrow
Check it out
• Arrow Flight Proposal
– https://github.com/jacques-n/arrow
• Example Usage in Dremio Formation
– https://github.com/jacques-n/formation