More Related Content
Similar to Zoned Storage (20)
Zoned Storage
- 1. © 2020 Western Digital Corporation or its affiliates. All rights reserved. 7/21/20
Delivering Zoned Storage Across
the Storage Ecosystem
Dave Landsman
Director Industry Standards
System and Software Technologies
22-July-2020
- 2. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 2
Why Zoned Block Devices?
SMR HDD and NAND Media Both Require Sequential Write Within Zones
Zone
• SMR HDDs consist of regions (zones) in which the
tracks are overlapped
• Within each zone, sequential write only
• NAND die are composed of erase blocks,
consisting of many pages
• Within each erase block, sequential write only
NAND Die
Erase Block
- 3. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 3
What are Zoned Block Devices?
• LBA range is divided into Zones
• Writes within a Zone must be sequential
• Each Zone has a write pointer that keeps
track of the position for the next write
• Data in a Zone cannot be overwritten
• Zone must be erased before it can be
rewritten
Zone 0 Zone 1 Zone 2 Zone X
Write pointer position
WRITE commands
advance the write pointer
ZONE RESET command
rewinds the write pointer
Written data
LBA 0 LBA n
- 4. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 4
Zoned Block Devices need new contract w/ host
• Host/Device cooperate on data
placement in Zones
• Benefits for HDD
– Areal density growth w/ SMR
– Reduced background activities
• Benefits for SSD
• Reduces Write Amplification & OP
• Reduces DRAM
• Improves latency outliers and throughput
• But to realize the benefit, we needed
– Zoned Block Device Interface Standards
– Zoned Enabled Storage Stack and Apps
Application 1
Data
Application 2
Data
Application 3
Data
Zoned Block Device
LBA Space Zone
Zoned Storage
Zone Enabled Storage Stack and Apps
Application 1
Data
Application 2
Data
Application 3
Data
Traditional Device
LBA Space
Traditional Storage
Traditional Storage Stack and Apps
Zoned Block Device Interface Standards
- 5. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 5
Zoned Block Interface Device Standards
• Command set interfaces standardized for SMR HDDs
– Zoned Block Commands (SCSI)
– Zoned ATA Commands (ATA)
• ZNS (Zoned Namespaces) recently released for NVMe SSDs
– Available on www.nvmexpress.org (NVMe 1.4 Ratified TPs)
• ZAC/ZBC and ZNS share same basic storage model
ZNS State Machine
ZAC/ZBC State Machine
- 6. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 6
Zoned Block Enabled SW
• Foundation is Zoned Block Device (ZBD)
– Block layer generic interface for ZBC, ZAC and
ZNS commands and protocol
• Optimizations at different levels enable
different paths/results through SW stack
– Legacy App -> Unmodified FS -> ZBD
– Legacy App -> Modified FS -> ZBD
• FS knows which LBAs in which file
– Modified App -> ZBD
• App knows most about data; e.g., may use
zones as containers for objects
Synergy between ZAC/ZBC HDD and ZNS SSD
Log Blk Dev Mappers
ZBD File SystemsLegacy File System
(ext4, xfs)
User Space
Linux Kernel
+ ZNS
SCSI mid-layer
Block Layer
NVMe
driver + ZNS
ZBD Interface
SCSI/ATA low level drivers
SATA ZAC
HDD
SAS ZBC
HDD
POSIX behavior
Legacy Applications Passthroughzone ioctl()
NVMe
ZNS SSD
ZBD Compliant Applications
(libzbd) (libzbc, libnvme)
(f2fs, btrfs)
(dm-zoned)
• Sequential write constraint exposed to apps
- 7. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 7
Zoned Storage and the Developer
• Sequential write constraint must be handled
by lower layers
– POSIX compliant file systems with native ZBD
support or regular block device emulation
• Complexity is hidden from application
developer BUT performance overhead may
exist
– Zone garbage collection (GC)
• Application writes sequentially
– Requires well design write path
• Serialization with mutual exclusion lock, one
writter thread per zone, etc
• Difficulty often depends on application existing
design
• Can be very effective
– But can be more complex to implement
What does “keeping it sequential” really mean?
User
Space
File Access Block Access Direct Device AccessZoned Block AccessFile Access File Access
POSIX behavior Sequential write constraint exposed to users
Legacy Applications
ZBD Compliant Applications
passthrough
libzbc, libnvme
zone ioctl()
libzbd
- 8. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 8
App Example: RocksDB
• LSM-tree based Key-value store
– LSM-trees mostly sequential writes
• ZenFS – a new storage backend for
RocksDB
– Maps SSTables to Zones
• ~1X device write amplification w/ ZenFS
vs. 3-6X WA measured on standard SSD
– Files/tables don’t align to device physical
structure on standard SSD
• Patches posted upstream, review in
progress
– https://github.com/facebook/rocksdb/pull/6961
End-to-End Integration of Zones
PersistedInmemory
HotData
A ZZ A Z A Z A
A Z
sync
compaction
compaction
Append-only
A Z
Most Updated
Least Updated
e.g. 64MB
e.g. 128MB
e.g. 1G
e.g. 100G
Grows 10X
ColdData
- 9. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 9
Summary – Zoned Storage
• Builds on SMR HDD enhancements already delivered to industry
• Extending to SSDs
– Reduced wear
– Reduced overprovisioning
– Reduced SSD DRAM
• ZNS Specification released by NVMe in June-2020
• ZNS Linux kernel SW enabling under way, creating synergy w/ ZAC/ZBC SMR HDD
ecosystem
• Application enabling also under way
– CEPH, RocksDB, Hadoop, …
• See Zonedstorage.io for technical documentation on zoned storage software, kernel
interface, etc.
Making the world safe for Sequential IO!!
- 10. © 2020 Western Digital Corporation or its affiliates. All rights reserved. 7/21/20
- 12. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 12
Ceph
• Ceph Bluestore removes the local file system
– Bluestore backend writes data directly to the block device
and can handle the sequential write constraint
• No need to implement all the file system specific operations
– RocksDB uses LSM-trees that naturally generate no/few
random updates and can easily be stored on zoned block
devices
• Zones or group of zones can be mapped to failure
domains that may be smaller than the whole device
– Reduced network utilization and I/O workload for recovery
• Implementation is on-going (Abutalib)
– https://github.com/ceph/ceph/pull/35111
Ceph as part of the data infrastructure for zoned storage
Source: https://ceph.com/community/new-luminous-bluestore/
- 13. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 13
Hadoop / HDFS
Internal prototype working
• Internal prototype working in single server
environment with SMR disks
– Tests on multiple servers starting
• Support implemented using zonefs to avoid
language bindings of Linux zoned block device
interface
– Zoned block device support from JAVA !
– Simplifies development by keeping most POSIX semantic
used with regular files
– zonefs use will also naturally enable support of efficient
zone append write path where possible
Name Node
switch
client
Data
Nodes . . .
HW
Kernel
Space
ZAC/ZBC SMR disks, NVMe ZNS Devices
SCSI Layer/Driver, NVMe Driver
zonefs
Java VM
Block I/O Layer, Scheduler
User
Space
Data Node
File append:
direct I/O by
sector size
HADOOP
System
POSIX file
File read, truncate:
regular POSIX
file access
- 14. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 14
Zoned Namespaces TP
• ZNS model similar to ZAC/ZBC
– States: Empty, Full, Implicit Open, Explicit Open,
Closed, Read Only, Offline
– State Changes: Write, Zone Management Command
(Open, Close, Finish, Reset), Device Resets
• Zone Size vs. Zone Capacity(NEW)
– Zone Size is fixed
– Zone Capacity is variable
Zone Capacity
Zone Capacity (E.g., 500MB)
Zone Size (e.g., 512MB)
Zone Start LBA
Zone XZone X - 1 Zone X + 1
ZNS State Machine
- 15. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 15
Zoned Namespaces TP
• ZAC/ZBC requires strict write ordering
– Limits write performance, increases host overhead
• Low scalability with multiple writers to a zone
– One writer per zone -> Good performance
– Multiple writers per zone -> Lock contention
• Performance improves somewhat by writing to
multiple Zones
• With Zone Append, we scale
– Append data to a zone with implicit write pointer
– Drive returns LBA where data was written in zone
Zone Append
1500
1300
1100
900
700
500
100
-100
300
1 2 3 4
Number of Writers
Metal (i7-6700K, 4K, QD1, RW, libaio)
1 Zone 4 Zones Zone Append
KIOPS
- 16. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 16
Zoned Namespaces TP
How does Zone Append work?
• No host serialization; higher queue depth
• Scalable for HDDs and SSDs
• Host serializes I/O, forces low queue depth
• Insignificant lock contention when using HDDs
• Significant lock contention when using SSDs
4K Write0
8K Write1
16K Write2
WP
(after W0)
WP
(after W1)
WP
(after W2)
Zone
Zone Write Example
Queue Depth = 1
4K Write0
8K Write1
16K Write2
WP
(after all writes)
Zone
Zone Append Example
Queue Depth = 3
- 17. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 17
Zoned Namespaces TP
• For NVMeTM devices that implement the Zoned Command Set,
there is optional support for:
– Variable Capacity
• The completion of Reset Zone command may result in a notification that zone
capacity has changed
– Zone Excursions
• The device can transition a zone to Full before writes reaches the Zone Capacity.
Host will receive an AEN and write failure if writing after the transition
• If device implements, the host shall implement as well
– Incoherent state model if not – Software should be specifically be written
to know that zone capacity can change, or writes may suddenly fail
Attributes: Zone Excursions & Variable Capacity
Zone Capacity (E.g., 500MB)
Zone Size (e.g., 512MB)
Zone Start LBA
Zone XZone X - 1 Zone X + 1
Zone Excursion
- 18. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 18
…
The Zone Storage Model
• “Sequential Write Required”
• Write operations must be issued in order to a zone.
• A zone has a write pointer, that communicates where the next
write must be issued.
• A zone has a state machine associated:
• It controls how a zone is accessed. e.g.,
• Empty or Open -> writes operations are allowed.
• Full -> write operations fails.
• State machine and other zone attributes are maintained in Zone
Descriptors. The Zone Descriptors are accessed using the Zone
Management Receive command.
• Active Resources and Open Resources restrict how many zones can be
in specific state.
• A zone’s state can be manipulated by the host by using the Zone
Management Send command
• E.g., Open Zone, Close Zone, Finish Zone, Reset Zone, … An arrow to or from a shaded area indicates transitions to or from all states in that area.
Write pointer
position
Write operations
advance the write
pointer
A Zone Reset
rewinds the write pointer
Zone 0 Zone Y…Zone
0 1 2 XLBA X-1
Zone State
Machine
Write Pointer in a partially written Zone
Zone 1