DAOS (Distributed Application Object Storage) is a high-performance storage architecture and software stack that delivers scalable object storage capabilities. It uses Intel Optane memory and NVMe SSDs to provide high IOPS, bandwidth, and low latency storage. DAOS supports various data models and interfaces like POSIX, HDF5, Spark, and Python. It allows applications to access storage with library calls instead of system calls for high performance.
2. Intel CorporationCloud & Enterprise Solutions Group 2
2
DAOS Storage Architecture
2
DAOS Storage Engine
Intel® Optane Memory
SPDK
NVMe
Interface
Metadata, low-latency I/Os &
indexing/query
Bulk data
3D-NAND/Optane SSD
PMDK
Memory
Interface
HDD
AI/Analytics/Simulation Workflow
DAOS library
POSIX I/O HDF5 Spark…
Compute Nodes
MPI-I/O Python
Libfabric • Low-latency & high-message-rate communications
• Native support for RDMA & scalable collective operations
• Support for iWarp, RoCE, Infiniband, OPA, Slingshot, …
• DAOS library directly linked with the applications
• No need for dedicated cores
• Low memory/CPU footprint
• End-to-end OS bypass
• Non-blocking, lockless, snapshot support, …
• Fine-grained I/O with media selection strategy
• Only application data on SSD to maximize throughput
• Small I/Os aggregated in pmem & migrated to SSD in large
chunks
• Full userspace model with no system calls on I/O path
• Built-in storage management infrastructure (control plane)
• NFSv4-like ACL
Delivers high-IOPS, high-bandwidth and low-latency
storage with advanced features in a single tier
Storage Nodes
3. Intel CorporationCloud & Enterprise Solutions Group 3
Aggregate related datasets into manageable and
coherent entities
• Distributed consistency & automated recovery
• Full Versioning
• Simplified data management
• Snapshot
• Cross-tier Migration
• Indexing
Storage Containers
DAOS Container
datadatadatafile
dir
datadatafile
dir
datadatadatadatafile
dir
root
Encapsulated POSIX Namespace File-per-process
DAOS Container
datadatadatadatafile
datadatadatadatafile
datadatadatadatafile
datadatadatadatafile
DAOS Container
datadatadatadataset
group
datadatadataset
group
datadatadatadatadataset
group
group
HDF5 « File » Key-value store
Graph
DAOS Container
valuekey
valuekey
valuekey
valuekey valuekey
DAOS Container
node
node
node
node
node
node
DAOS Container
Columnar Database
key
key
key
key
Value
Value
Value
Value
Value
Value
Value
Value
4. Intel CorporationCloud & Enterprise Solutions Group 4
Native support for structured, semi-structured & unstructured data
models
• Built on top of DCPMM
• Unconstrained by POSIX serialization
• Custom attributes
• Data access time orders of magnitude faster (µs)
• Scalable concurrent updates & high IOPS
• Non-blocking
• Enable in-storage computing
DAOS Objects
DAOS Storage Engine
Open Source Apache 2.0 License
Data Model Library/Framework
Array KV Store Multi-level KV Store
key1
val1
key3
val3
@
@
Application
NVMe SSD
DAOS
key1
val1
root
@
key3
val3
@
val2
key2
@@
@
@
val2 con’d
val2
key2
@
Application
7. Intel CorporationCloud & Enterprise Solutions Group 7
DAOS API is new and very flexible
• Multi-level keys
• Different value types supported
• Can build all data models / IO middleware on top of it
Most applications still based on POSIX
• Need a smooth migration path with little to no application changes
• Quick path to realize performance of DCPMM & DAOS
POSIX implemented as a middleware instead of being the
building block of all data models.
POSIX I/O Support
Application / Framework
DAOS library (libdaos)
DAOS File System (libdfs)
Interception Library
dfuse
Single process address space
DAOS Storage Engine
RPC RDMA
End-to-end
user space
No system calls
Intel® QLC
3D Nand
SSD
8. Intel CorporationCloud & Enterprise Solutions Group 8
MPI-IO Driver for DAOS
The DAOS MPI-IO driver is implemented within the
I/O library in MPICH (ROMIO).
• Added as an ADIO driver
• Portable to Open-MPI, Intel MPI, etc.
• https://github.com/pmodels/mpich
MPI Files use the same DFS mapping to the
DAOS Object Model
• MPI Files can be accessed through the DFS
API
• MPI Files can be accessed through regular
POSIX with a dfuse mount over the container.
Application works seamlessly by just specifying the use of the driver by appending “daos:” to the path.
MPI-IO ROMIO driver (https://github.com/pmodels/mpich/tree/master/src/mpi/romio/adio/ad_daos)
POSIX / MPI-IO File
DAOS Byte Array Object
Special DAOS Object:
1 Level Key
1 Byte records
Configurable Chunk Size
9. Intel CorporationCloud & Enterprise Solutions Group 9The information on this page is subject to the use and disclosure restrictions provided on the second page to this document.
HDF5 VOL Architecture
Three main components:
• HDF5 Library
• DAOS VOL Connector
• (External) HDF5 Test Suite
HDF5 API
VOL Layer
VFD Layer
Native VOL
DAOS
VOL
SEC2
MPIO
File System DAOS
HDF5 Tools Test Suite
New Component
…
…
DAOS APIPOSIX API
Core HDF5
Library
External
Test
Suite
External
VOL
Connector
Enhanced Component
Native Component
Through
MPI I/O
10. Intel CorporationCloud & Enterprise Solutions Group 10
No longer requires separate version of HDF5
• Compatible with main develop branch of HDF5
• Compatible with 1.12.x release series of HDF5 with VOL support
Currently supported features
• New H5M MAP API to expose K/V interface to HDF5 users
• Variable length datatypes are now also supported
• Chunking is recommended storage layout to get most of DAOS performance
• Tools support (h5dump, h5ls, h5repack etc)
Coming by end of the year
• Independent metadata writes (= independent object creation)
• Asynchronous I/O
Available from: https://git.hdfgroup.org/projects/HDF5VOL/repos/daos-vol/
• See user’s guide for more detailed list of supported features
HDF5 DAOS VOL Connector – Current Status
11. Intel CorporationCloud & Enterprise Solutions Group 11
11
Unified Namespace
DAOS Object Store:
Addresses pools, containers with uuids; objects with 128-bit IDs.
Applications/Users are used to access files / directories in a traditional namespace
Unified Namespace allows users to create links between a file/dir in a system
namespace to DAOS pools & containers:
daos container create --path=/mnt/project1/userA/NS1 --pool=uuid --
type=POSIX/HDF5/etc.
Path created above becomes a special file or directory (depending on container type) with
an extended attribute with the pool and container information.
Accessing that path from DAOS aware middleware will make the link on the fly with the
DAOS UNS library.
12. Intel CorporationCloud & Enterprise Solutions Group 12
12
Spark Input/Output Support
DAOS
FileSystem
DAOS POSIX API
Java Wrapper
HDFS
Hadoop FileSystem Abstract Class
Disks
DAOS
Storage
DAOS Library, DPDK/PMDK/SPDK used, Kernel bypass, zero copy
DAOS/Arrow
Wrapper
Apache Arrow Data Source
DAOS API
Wrapper
DAOS Native Data Source
(Parquet, ORC, etc.)
Scan/Parse
Project, Join, ML, etc.
Native Scan/Parse
(CSV, Parquet, ORC, etc.)
Current Spark
Implemented
Planned
13. Intel CorporationCloud & Enterprise Solutions Group 13
Pythonic bindings called pydaos
• Export key-value store objects
• Integrated with python dictionaries
• Support python iterator, direct assignments, …
• Scalable & performant
• Bulk insert/retrieve
• Core written in C
• Python 2.7 & 3 support
Python integration
• dbm
• pyprob
TODO
• Expose snapshots
• Integration with NumPy, …
Python Support
14. Intel CorporationCloud & Enterprise Solutions Group 14
Resources
Source code on GitHub
• https://github.com/daos-stack/daos
Documentation
• http://daos.io
Community mailing list
• https://daos.groups.io
DAOS solution brief
• https://www.intel.com/content/www/us/en/high-performance-computing/
Overall DAOS architecture
Distributed object store built from the ground up for NVM technology.
Includes 2 types of storage:
SSD with different media, 3d-NAND, Optane. Available with NVMe interface, provide fast block-based storage
Intel developing new user-space nvme driver, SPDK
Pmem, Optane pmem. Fine grained storage directly memory mapped. Can do direct load store to the storage. No block abstraction
Intel developing new storage stack PMDK to manage transactional update to pmem
We don’t have any Object store to leverage these object stores. Lustre, GPFS built into kernel. Ceph still uses kernel IOs by going through Linux block layer. Here everything in userspace.
DAOS new OS built on top of pmdk and spdk. highIOPS high BW.
More advanced storage API, not just byte arrays, KV, etc.
Still need to look into communication. Reuse several decade effort to optimize latency and BW for MPI. Re-use same communication middleware as MPI (libfabric).
Client side, very lightweight with support to many middleware.
All metadata stored on pmem. Small IOs land in pmem and later aggregated into SSD in the background to free space in Pmem. Application large data go in the SSD.
Baseline API based on Storage container -> Object address space inside a pool.
Container can be encapsulated POSIX namespace, or just a flat one.
HDF5 has it’s own representations for a single HDF5. KV store can be one.
Database, ACG 1 graph
Container provides a baseline API with object. Unit of storage management is the container not a file.
When you have millions of file, data migration is done on all those. We Simplified that. Fewer number of containers than files, so we can index them.
Data migration of entire container. unit of Snapshot. Snapshot not typically available traditionally (used to be at FS level) but now at container level and exposed to user
POSIX not the foundation any more. New key-array store API provided by DAOS with advanced features (ad-hoc concurrency control, snapshot support, asynchronous API, ..).
DAOS API to be directly integrated with rich data model for better performance and extended capabilities. POSIX (i.e. File) support built as a middleware over the DAOS API.
DAOS has a native KV API which means that the wire protocol is KV fetch/lookup. On the server, the tree structure is maintained in DCPM and values are either stored in DCPM or SSD depending on sizes. No file serialization is required. A client can directly lookup or insert a key/value pair over the wire and this can happen from many clients concurrently. Keys can be ordered, so we support range query. KV store are partitioned/sharded/replicated/erasure coded over multiple servers for performance & resilience. DAOS can also support complex query since it understand the key-value structure. The native indexing can also be augmented with ad-hoc application-driven index.
Spark Integration with DAOS – different team at intel
Written Hadoop filesystem abstraction class for DAOS. Integrated in DAOS source code.
Use DAOS in user-space through libdfs transparently with this Hadoop connector.
More value/acceleration to Spark, by integrating DAOS as a datasource through parquet / apache arrow, or through native DAOS API.
Native Python bindings (pydaos) integrate KV store object directly with python dictionaries.
Very pythonic API, you can have iterator, direct assignments, bulk insertion/retrieve
Allow people who are not very familiar with C using high level language.
Some integration with dbm, pyprob. More work to do