Jayjeet Chakraborty
Towards an Arrow-Native Storage System
SkyhookDM
Mentored by: Carlos Maltzahn, Ivo Jimenez, Je
ff
LeFevre
1
Who am I ?
• Incoming Grad Student at UC Santa Cruz

• CS Graduate from NIT Durgapur, India

• IRIS-HEP Fellow Summer 2020

• Twitter: @heyjc25

• Github: JayjeetAtGithub

• LinkedIn: https://www.linkedin.com/in/jayjeet-chakraborty-077579162/

• E-Mail: jchakra1@ucsc.edu
2
Problem
• CPU is the new bottleneck with high speed network and storage devices.

• Client-side processing of data from highly e
ffi
cient storage formats like
Parquet, ORC exhausts the CPUs.

• Severely hampered scalability.
• O
ffl
oad computation from client to the storage layer.

• Take advantage of the idle CPUs of storage systems for increased processing
rates and faster queries.

• Results in less data movement and network tra
ffi
c.
Our Solution
3
Introduction to Ceph
1.Provides 3 types of storage interface:
File, Object, Block.

2.No central point of failure. Uses
CRUSH maps that contains object -
OSD mapping. A CRUSH map in each
client. Client talks directly to OSD.

3.Highly extensible Object storage layer
via the Ceph Object Classes SDK.

4
• Language-independent columnar memory format for
fl
at and hierarchical data,
organised for e
ffi
cient analytic operations on modern hardware.

• Share data between processes without serialization overhead.
Before
Arrow
After Arrow
5
Components of Arrow
6
Arrow components
used by Skyhook
Design Paradigm
• Extend client and storage layers of
programmable storage systems
with data access libraries.

• Embed a FS shim inside storage
nodes to have
fi
le-like view over
objects.

• Allow direct interaction with objects
in an object store while bypassing
the
fi
lesystem layer utilising FS
metadata.
7
Architecture
• Arrow data access libraries embedded inside Ceph OSDs to allow
fi
le fragment scanning inside the storage
layer. 

• Expose the functionality through the Arrow Dataset API by creating a new
fi
le format abstraction
“RadosParquetFileFormat”.
8
File Layout Design
• Large multi-gigabyte Parquet
fi
les are split into smaller ~128 MB Parquet
fi
les.
• Each Parquet
fi
le is stored in a single RADOS object for SkyhookDM to access.
9
Experiments: Latency
• O
ffl
oading makes queries with higher
selectivity faster as less amount of data
is moved around the system. Also, less
time goes in data (de)serialization and
more into processing.

• LZ4 compressed Arrow IPC
fi
les
(Bottom) makes SkyhookDM better
performing than Parquet
fi
les (Top) since
they are faster to R/W.
Parquet
on Disk
LZ4 IPC on
Disk
10
Experiments: CPU Usage
• SkyhookDM nicely o
ffl
oads CPU usage from client layer to storage layer. For
example with 4 OSDs and 100% selectivity,
Without
Skyhook
With Skyhook
11
Experiments: Network Traffic
• SkyhookDM saves network
bandwidth by transferring only
the data that is requested by the
client.

• We end up transferring a little
more data in case of 100% as
LZ4 compressed Arrow is larger
than Parquet binary data.
1%
10%
100%
12
Experiments: Crash Recovery
• In SkyhookDM, since processing is colocated with storage nodes, the crash recovery
and consistency semantics of the storage layer apply naturally to query processing.
Crash Point
13
Coffea + SkyhookDM
• Implemented a run_parquet_job executor method in Co
ff
ea to be able to read from
Parquet
fi
les using the Arrow Dataset API. This in turn allowed integrating Co
ff
ea with
SkyhookDM seamlessly.
14
41.5%
30.5%
24.6%
3
.
3
4
%
0.103%
0.0324%
0.00855%
0.00511%
[6] Serialize Result Table
[5] Scan Parquet Data
[7] Result Transfer
[4] Disk I/O
[3] Deserialize Scan Request
[1] Stat Fragment
[8] Deserialize Result Table
[2] Serialize Scan Request
Sending uncompressed IPC
Ongoing Work
• Arrow’s memory layout requires internal memory copies to serialize it to a
contiguous on the wire format and this has a very high overhead.
48.3%
29.5%
11.7%
5.37%
5.11%
0.0513%
0.0304%
0.00771%
[5] Scan Parquet Data
[6] Serialize Result Table
[7] Result Transfer
[8] Deserialize Result Table
[4] Disk I/O
[3] Deserialize Scan Request
[1] Stat Fragment
[2] Serialize Scan Request
Sending LZ4 compressed IPC
• Collaborating with ServiceX and Co
ff
ea team to integrate SkyhookDM into the
larger analysis facility ecosystem.
15
Checkout our work
• Github Repository: https://github.com/uccross/skyhookdm-arrow

• Docker containers: https://github.com/uccross/skyhookdm-arrow-docker

• ArXiv Paper: https://arxiv.org/pdf/2105.09894.pdf

• Co
ff
ea Skyhook Plugin: https://github.com/Co
ff
eaTeam/co
ff
ea/tree/master/
docker/co
ff
ea_rados_parquet

• Several bugs found and reported in Apache Arrow: ARROW-13161,
ARROW-13126, ARROW-13088.
16
Thank You


Questions ?


17

SkyhookDM - Towards an Arrow-Native Storage System

  • 1.
    Jayjeet Chakraborty Towards anArrow-Native Storage System SkyhookDM Mentored by: Carlos Maltzahn, Ivo Jimenez, Je ff LeFevre 1
  • 2.
    Who am I? • Incoming Grad Student at UC Santa Cruz • CS Graduate from NIT Durgapur, India • IRIS-HEP Fellow Summer 2020 • Twitter: @heyjc25 • Github: JayjeetAtGithub • LinkedIn: https://www.linkedin.com/in/jayjeet-chakraborty-077579162/ • E-Mail: jchakra1@ucsc.edu 2
  • 3.
    Problem • CPU isthe new bottleneck with high speed network and storage devices. • Client-side processing of data from highly e ffi cient storage formats like Parquet, ORC exhausts the CPUs. • Severely hampered scalability. • O ffl oad computation from client to the storage layer. • Take advantage of the idle CPUs of storage systems for increased processing rates and faster queries. • Results in less data movement and network tra ffi c. Our Solution 3
  • 4.
    Introduction to Ceph 1.Provides3 types of storage interface: File, Object, Block.
 2.No central point of failure. Uses CRUSH maps that contains object - OSD mapping. A CRUSH map in each client. Client talks directly to OSD.
 3.Highly extensible Object storage layer via the Ceph Object Classes SDK.
 4
  • 5.
    • Language-independent columnarmemory format for fl at and hierarchical data, organised for e ffi cient analytic operations on modern hardware. • Share data between processes without serialization overhead. Before Arrow After Arrow 5
  • 6.
    Components of Arrow 6 Arrowcomponents used by Skyhook
  • 7.
    Design Paradigm • Extendclient and storage layers of programmable storage systems with data access libraries. • Embed a FS shim inside storage nodes to have fi le-like view over objects. • Allow direct interaction with objects in an object store while bypassing the fi lesystem layer utilising FS metadata. 7
  • 8.
    Architecture • Arrow dataaccess libraries embedded inside Ceph OSDs to allow fi le fragment scanning inside the storage layer. • Expose the functionality through the Arrow Dataset API by creating a new fi le format abstraction “RadosParquetFileFormat”. 8
  • 9.
    File Layout Design •Large multi-gigabyte Parquet fi les are split into smaller ~128 MB Parquet fi les. • Each Parquet fi le is stored in a single RADOS object for SkyhookDM to access. 9
  • 10.
    Experiments: Latency • O ffl oadingmakes queries with higher selectivity faster as less amount of data is moved around the system. Also, less time goes in data (de)serialization and more into processing. • LZ4 compressed Arrow IPC fi les (Bottom) makes SkyhookDM better performing than Parquet fi les (Top) since they are faster to R/W. Parquet on Disk LZ4 IPC on Disk 10
  • 11.
    Experiments: CPU Usage •SkyhookDM nicely o ffl oads CPU usage from client layer to storage layer. For example with 4 OSDs and 100% selectivity, Without Skyhook With Skyhook 11
  • 12.
    Experiments: Network Traffic •SkyhookDM saves network bandwidth by transferring only the data that is requested by the client. • We end up transferring a little more data in case of 100% as LZ4 compressed Arrow is larger than Parquet binary data. 1% 10% 100% 12
  • 13.
    Experiments: Crash Recovery •In SkyhookDM, since processing is colocated with storage nodes, the crash recovery and consistency semantics of the storage layer apply naturally to query processing. Crash Point 13
  • 14.
    Coffea + SkyhookDM •Implemented a run_parquet_job executor method in Co ff ea to be able to read from Parquet fi les using the Arrow Dataset API. This in turn allowed integrating Co ff ea with SkyhookDM seamlessly. 14
  • 15.
    41.5% 30.5% 24.6% 3 . 3 4 % 0.103% 0.0324% 0.00855% 0.00511% [6] Serialize ResultTable [5] Scan Parquet Data [7] Result Transfer [4] Disk I/O [3] Deserialize Scan Request [1] Stat Fragment [8] Deserialize Result Table [2] Serialize Scan Request Sending uncompressed IPC Ongoing Work • Arrow’s memory layout requires internal memory copies to serialize it to a contiguous on the wire format and this has a very high overhead. 48.3% 29.5% 11.7% 5.37% 5.11% 0.0513% 0.0304% 0.00771% [5] Scan Parquet Data [6] Serialize Result Table [7] Result Transfer [8] Deserialize Result Table [4] Disk I/O [3] Deserialize Scan Request [1] Stat Fragment [2] Serialize Scan Request Sending LZ4 compressed IPC • Collaborating with ServiceX and Co ff ea team to integrate SkyhookDM into the larger analysis facility ecosystem. 15
  • 16.
    Checkout our work •Github Repository: https://github.com/uccross/skyhookdm-arrow • Docker containers: https://github.com/uccross/skyhookdm-arrow-docker • ArXiv Paper: https://arxiv.org/pdf/2105.09894.pdf • Co ff ea Skyhook Plugin: https://github.com/Co ff eaTeam/co ff ea/tree/master/ docker/co ff ea_rados_parquet • Several bugs found and reported in Apache Arrow: ARROW-13161, ARROW-13126, ARROW-13088. 16
  • 17.