Di Wang Extreme Storage Architecture & Development (ESAD), Intel
SPDK, PMDK & Vtune™ Summit 2
Agenda
• DAOS (Distributed Asynchronous Object Storage) Overview
• DAOS Architecture & features
• DAOS Storage Model
• DAOS with PMDK & SPDK
• Current Performance & Resource
SPDK, PMDK & Vtune™ Summit 3
Storage revolution
90
25
20
15
10
5
0
NAND SSD
(4kB Read)
Intel® Optane SSD
(4kB Read)
Legend
NVM Media Read
PCIe & NVMe protocol
Software (File System, OS, Driver)
LatencyfromApp(uS)
Intel® Optane NVDIMMs
(64B Read)
SPDK, PMDK & Vtune™ Summit 4
DAOS overview
DAOS Storage Engine
Open Source Apache 2.0 License
HDD
POSIX I/O
3rd Party Applications
Rich Data Models
Storage Platform
Storage Media
Workflow
HDF5 SQL …
Intel® QLC 3D Nand SSD
SPDK, PMDK & Vtune™ Summit 5
Lightweight I/O
Mercury userspace function shipping
§ MPI equivalent communications latency
§ Built over libfabric
Applications link directly with DAOS lib
§ Direct call, no context switch
§ Small memory footprint
§ No locking, caching or data copy
Userspace DAOS server
§ Mmap non-volatile memory via PMDK
§ NVMe access through SPDK/Blobstore
AI/Analytics/Simulation Workflow
DAOS library
Mercury/Libfabric
NVMe
SSDs
Bulk
transfers
SPDK
PMDK
RPC
HDF5
SCM
File (No)SQL…
DAOS
Service
SPDK, PMDK & Vtune™ Summit 6
Storage Model
DAOS provides a rich storage API
§ New scalable storage model suitable for both structured &
unstructured data
– key-value stores, multi-dimensional arrays, columnar
databases, …
– Accelerate data analytic/AI frameworks
§ Non-blocking data & metadata operations
§ Ad-hoc concurrency control mechanism
Pool
§ Reservation of distributed storage
§ Predictable/extendable performance/capacity
Container
§ Aggregate related datasets into manageable entity
§ Unit of snapshot/transaction
Object
§ Key-array store with own distribution/resilience schema
§ Multi-level key for fine-grain control over colocation of related data
Record
§ Arbitrary binary blob from single byte to several Mbytes
Storage Pool Container Object Record
SPDK, PMDK & Vtune™ Summit 7
Fine-grained I/O
Mix of storage technologies
§ Storage Class Memory
– DAOS metadata & application metadata
– Byte-granular application data
§ NVMe SSD (*NAND)
– Cheaper storage for bulk data (e.g. checkpoints)
– Multi-KB
I/Os are logged & inserted into persistent index
§ Non-destructive write & consistent read
§ No alignment constraints
§ No read-modify-write
v1
v2
v3
read@v3 Application
Buffer
Server-side
Index
Bulk descriptor segments
SPDK, PMDK & Vtune™ Summit 8
DATA Management
Data Security & Reduction
§ Online real-time data encryption &
compression
§ Hardware acceleration
Data Distribution
§ Algorithmic placement
Data Protection
§ Declustered replication & erasure code
§ Fault-domain aware placement
§ Self-healing
§ End-to-end data integrity
Hash (object.Dkey)
Hash (object.Dkey)
Fault
domain
separation
SPDK, PMDK & Vtune™ Summit 9
Pool Storage on DAOS Server
DAOS Service
Argobots Xstream
PMDK
pmemobj
SPDK Blob
SCM
NVMe SSD
PMDK
pmemobj
PMDK
pmemobj
PMDK
pmemobj
PMDK
pmemobj
SPDK Blob SPDK Blob SPDK Blob SPDK Blob
NVMe block
allocation Info
PMDK
pmemobj
SPDK Blob
SPDK, PMDK & Vtune™ Summit 10
DAOS I/O over PMDK/SPDK
SCM
NVMe
DAOS Xstream
§ Reserve new buffer
§ Either reserve by pmemobj_reserve
§ Or reserve in NVME SSD
SPDK, PMDK & Vtune™ Summit 11
DAOS I/O over PMDK/SPDK
11
SCM
NVMe
DAOS Xstream
§ Reserve new buffer
§ Either reserve by pmemobj_reserve
§ Or reserve in NVME SSD
§ Start RDMA transfer to newly allocated buffer
§ Either transfer to PMEM
§ Or transfer to DMA buffer then to NVME SSD
§ Start pmemobj transaction
SPDK, PMDK & Vtune™ Summit 12
DAOS I/O over PMDK/SPDK
SCM
NVMe
DAOS Xstream
§ Reserve new buffer
§ Either reserve by pmemobj_reserve
§ Or reserve in NVME SSD
§ Start RDMA transfer to newly allocated buffer
§ Either transfer to PMEM
§ Or transfer to DMA buffer then to NVME SSD
§ Start pmemobj transaction
§ Modify index to insert new extent
SPDK, PMDK & Vtune™ Summit 13
DAOS I/O over PMDK/SPDK
13
SCM
NVMe
DAOS Xstream
§ Reserve new buffer
§ Either reserve by pmemobj_reserve
§ Or reserve in NVME SSD
§ Start RDMA transfer to newly allocated buffer
§ Either transfer to PMEM
§ Or transfer to DMA buffer then to NVME SSD
§ Start pmemobj transaction
§ Modify index to insert new extent
§ Publish the reserve the space.
§ Either pmemobj_tx_publish() for SCM.
§ Or publish the space for NVMe SSD.
§ Commit pmemobj transaction and reply to client
SPDK, PMDK & Vtune™ Summit 14
DAOS Performance
34996
188782
282017
407431
469666 472509 502516
0
200000
400000
600000
800000
1000000
1200000
1 8 16 32 64 128 256
IOPS
Number of Clients
IOR Write - 1024 I/O size
62392
326432
434839
829526 875873
773290
1019720
0
200000
400000
600000
800000
1000000
1200000
1 8 16 32 64 128 256
IOPS
Number of Clients
IOR Read - 1024B I/O size
• IOR runs on remote clients sending the I/O requests to the single DAOS server over the fabric
• Intel Omni-Path Host Adapter 100HFA016LS
• Using the DAOS MPI-IO driver with the full DAOS stack (client, network, server)
• Cascade Lake CPUs, 6 Dimms 512G AEP NMA1XBD512GQSE
SPDK, PMDK & Vtune™ Summit 15
DAOS Community Roadmap
All information provided in this roadmap is subject to change without notice.
1Q19 2Q19 3Q19 4Q19 1Q20 2Q20 3Q20 4Q20 1Q21 2Q21 3Q21 4Q21 1Q22 2Q22 3Q22
Pre-1.0 releases & RCs 1.0 1.2 1.4 2.0 2.2 2.4
DAOS:
- Replication with self-healing
- Persistent Memory support
- NVMe SSD support
- Self monitoring & bootstrap
- Initial control plane
- python/golang API bindings
I/O Middleware:
- MPI-IO driver
- HDF5 DAOS Connector (proto)
- POSIX I/O (proto)
DAOS:
- Per-pool ACL
- Lustre integration
I/O Middleware:
- HDF5 DAOS Connector
- POSIX I/O support
- Spark
DAOS:
- End-to-end data integrity
- Per-container ACL
- SmartNICs & accelerators
- Improved control plane
DAOS:
- Online server addition
- Advanced control plane
I/O Middleware:
- POSIX data mover
- Async HDF5 operations over DAOS
DAOS:
- Erasure code
- Telemetry & per-job statistics
- Multi OFI provider support
I/O Middleware:
- Advanced POSIX I/O support
- Advanced data mover
Partner engagement & PoCs
DAOS:
- Progressive layout / GIGA+
- Placement optimizations
- Checksum scrubbing
I/O Middleware:
- Apache Arrow (not POR)
DAOS:
- Catastrophic recovery tools
SPDK, PMDK & Vtune™ Summit 16
Resource
Source code on GitHub
https://github.com/daos-stack/daos
Community mailing list on Groups.io
daos@daos.groups.io or https://daos.groups.io/g/daos
Wiki
http://daos.io or https://wiki.hpdd.intel.com
Bug tracker
https://jira.hpdd.intel.com
Big Data Uses with Distributed Asynchronous Object Storage

Big Data Uses with Distributed Asynchronous Object Storage

  • 1.
    Di Wang ExtremeStorage Architecture & Development (ESAD), Intel
  • 2.
    SPDK, PMDK &Vtune™ Summit 2 Agenda • DAOS (Distributed Asynchronous Object Storage) Overview • DAOS Architecture & features • DAOS Storage Model • DAOS with PMDK & SPDK • Current Performance & Resource
  • 3.
    SPDK, PMDK &Vtune™ Summit 3 Storage revolution 90 25 20 15 10 5 0 NAND SSD (4kB Read) Intel® Optane SSD (4kB Read) Legend NVM Media Read PCIe & NVMe protocol Software (File System, OS, Driver) LatencyfromApp(uS) Intel® Optane NVDIMMs (64B Read)
  • 4.
    SPDK, PMDK &Vtune™ Summit 4 DAOS overview DAOS Storage Engine Open Source Apache 2.0 License HDD POSIX I/O 3rd Party Applications Rich Data Models Storage Platform Storage Media Workflow HDF5 SQL … Intel® QLC 3D Nand SSD
  • 5.
    SPDK, PMDK &Vtune™ Summit 5 Lightweight I/O Mercury userspace function shipping § MPI equivalent communications latency § Built over libfabric Applications link directly with DAOS lib § Direct call, no context switch § Small memory footprint § No locking, caching or data copy Userspace DAOS server § Mmap non-volatile memory via PMDK § NVMe access through SPDK/Blobstore AI/Analytics/Simulation Workflow DAOS library Mercury/Libfabric NVMe SSDs Bulk transfers SPDK PMDK RPC HDF5 SCM File (No)SQL… DAOS Service
  • 6.
    SPDK, PMDK &Vtune™ Summit 6 Storage Model DAOS provides a rich storage API § New scalable storage model suitable for both structured & unstructured data – key-value stores, multi-dimensional arrays, columnar databases, … – Accelerate data analytic/AI frameworks § Non-blocking data & metadata operations § Ad-hoc concurrency control mechanism Pool § Reservation of distributed storage § Predictable/extendable performance/capacity Container § Aggregate related datasets into manageable entity § Unit of snapshot/transaction Object § Key-array store with own distribution/resilience schema § Multi-level key for fine-grain control over colocation of related data Record § Arbitrary binary blob from single byte to several Mbytes Storage Pool Container Object Record
  • 7.
    SPDK, PMDK &Vtune™ Summit 7 Fine-grained I/O Mix of storage technologies § Storage Class Memory – DAOS metadata & application metadata – Byte-granular application data § NVMe SSD (*NAND) – Cheaper storage for bulk data (e.g. checkpoints) – Multi-KB I/Os are logged & inserted into persistent index § Non-destructive write & consistent read § No alignment constraints § No read-modify-write v1 v2 v3 read@v3 Application Buffer Server-side Index Bulk descriptor segments
  • 8.
    SPDK, PMDK &Vtune™ Summit 8 DATA Management Data Security & Reduction § Online real-time data encryption & compression § Hardware acceleration Data Distribution § Algorithmic placement Data Protection § Declustered replication & erasure code § Fault-domain aware placement § Self-healing § End-to-end data integrity Hash (object.Dkey) Hash (object.Dkey) Fault domain separation
  • 9.
    SPDK, PMDK &Vtune™ Summit 9 Pool Storage on DAOS Server DAOS Service Argobots Xstream PMDK pmemobj SPDK Blob SCM NVMe SSD PMDK pmemobj PMDK pmemobj PMDK pmemobj PMDK pmemobj SPDK Blob SPDK Blob SPDK Blob SPDK Blob NVMe block allocation Info PMDK pmemobj SPDK Blob
  • 10.
    SPDK, PMDK &Vtune™ Summit 10 DAOS I/O over PMDK/SPDK SCM NVMe DAOS Xstream § Reserve new buffer § Either reserve by pmemobj_reserve § Or reserve in NVME SSD
  • 11.
    SPDK, PMDK &Vtune™ Summit 11 DAOS I/O over PMDK/SPDK 11 SCM NVMe DAOS Xstream § Reserve new buffer § Either reserve by pmemobj_reserve § Or reserve in NVME SSD § Start RDMA transfer to newly allocated buffer § Either transfer to PMEM § Or transfer to DMA buffer then to NVME SSD § Start pmemobj transaction
  • 12.
    SPDK, PMDK &Vtune™ Summit 12 DAOS I/O over PMDK/SPDK SCM NVMe DAOS Xstream § Reserve new buffer § Either reserve by pmemobj_reserve § Or reserve in NVME SSD § Start RDMA transfer to newly allocated buffer § Either transfer to PMEM § Or transfer to DMA buffer then to NVME SSD § Start pmemobj transaction § Modify index to insert new extent
  • 13.
    SPDK, PMDK &Vtune™ Summit 13 DAOS I/O over PMDK/SPDK 13 SCM NVMe DAOS Xstream § Reserve new buffer § Either reserve by pmemobj_reserve § Or reserve in NVME SSD § Start RDMA transfer to newly allocated buffer § Either transfer to PMEM § Or transfer to DMA buffer then to NVME SSD § Start pmemobj transaction § Modify index to insert new extent § Publish the reserve the space. § Either pmemobj_tx_publish() for SCM. § Or publish the space for NVMe SSD. § Commit pmemobj transaction and reply to client
  • 14.
    SPDK, PMDK &Vtune™ Summit 14 DAOS Performance 34996 188782 282017 407431 469666 472509 502516 0 200000 400000 600000 800000 1000000 1200000 1 8 16 32 64 128 256 IOPS Number of Clients IOR Write - 1024 I/O size 62392 326432 434839 829526 875873 773290 1019720 0 200000 400000 600000 800000 1000000 1200000 1 8 16 32 64 128 256 IOPS Number of Clients IOR Read - 1024B I/O size • IOR runs on remote clients sending the I/O requests to the single DAOS server over the fabric • Intel Omni-Path Host Adapter 100HFA016LS • Using the DAOS MPI-IO driver with the full DAOS stack (client, network, server) • Cascade Lake CPUs, 6 Dimms 512G AEP NMA1XBD512GQSE
  • 15.
    SPDK, PMDK &Vtune™ Summit 15 DAOS Community Roadmap All information provided in this roadmap is subject to change without notice. 1Q19 2Q19 3Q19 4Q19 1Q20 2Q20 3Q20 4Q20 1Q21 2Q21 3Q21 4Q21 1Q22 2Q22 3Q22 Pre-1.0 releases & RCs 1.0 1.2 1.4 2.0 2.2 2.4 DAOS: - Replication with self-healing - Persistent Memory support - NVMe SSD support - Self monitoring & bootstrap - Initial control plane - python/golang API bindings I/O Middleware: - MPI-IO driver - HDF5 DAOS Connector (proto) - POSIX I/O (proto) DAOS: - Per-pool ACL - Lustre integration I/O Middleware: - HDF5 DAOS Connector - POSIX I/O support - Spark DAOS: - End-to-end data integrity - Per-container ACL - SmartNICs & accelerators - Improved control plane DAOS: - Online server addition - Advanced control plane I/O Middleware: - POSIX data mover - Async HDF5 operations over DAOS DAOS: - Erasure code - Telemetry & per-job statistics - Multi OFI provider support I/O Middleware: - Advanced POSIX I/O support - Advanced data mover Partner engagement & PoCs DAOS: - Progressive layout / GIGA+ - Placement optimizations - Checksum scrubbing I/O Middleware: - Apache Arrow (not POR) DAOS: - Catastrophic recovery tools
  • 16.
    SPDK, PMDK &Vtune™ Summit 16 Resource Source code on GitHub https://github.com/daos-stack/daos Community mailing list on Groups.io daos@daos.groups.io or https://daos.groups.io/g/daos Wiki http://daos.io or https://wiki.hpdd.intel.com Bug tracker https://jira.hpdd.intel.com