SlideShare a Scribd company logo
1 of 18
Download to read offline
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 7/21/20
Delivering Zoned Storage Across
the Storage Ecosystem
Dave Landsman
Director Industry Standards
System and Software Technologies
22-July-2020
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 2
Why Zoned Block Devices?
SMR HDD and NAND Media Both Require Sequential Write Within Zones
Zone
• SMR HDDs consist of regions (zones) in which the
tracks are overlapped
• Within each zone, sequential write only
• NAND die are composed of erase blocks,
consisting of many pages
• Within each erase block, sequential write only
NAND Die
Erase Block
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 3
What are Zoned Block Devices?
• LBA range is divided into Zones
• Writes within a Zone must be sequential
• Each Zone has a write pointer that keeps
track of the position for the next write
• Data in a Zone cannot be overwritten
• Zone must be erased before it can be
rewritten
Zone 0 Zone 1 Zone 2 Zone X
Write pointer position
WRITE commands
advance the write pointer
ZONE RESET command
rewinds the write pointer
Written data
LBA 0 LBA n
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 4
Zoned Block Devices need new contract w/ host
• Host/Device cooperate on data
placement in Zones
• Benefits for HDD
– Areal density growth w/ SMR
– Reduced background activities
• Benefits for SSD
• Reduces Write Amplification & OP
• Reduces DRAM
• Improves latency outliers and throughput
• But to realize the benefit, we needed
– Zoned Block Device Interface Standards
– Zoned Enabled Storage Stack and Apps
Application 1
Data
Application 2
Data
Application 3
Data
Zoned Block Device
LBA Space Zone
Zoned Storage
Zone Enabled Storage Stack and Apps
Application 1
Data
Application 2
Data
Application 3
Data
Traditional Device
LBA Space
Traditional Storage
Traditional Storage Stack and Apps
Zoned Block Device Interface Standards
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 5
Zoned Block Interface Device Standards
• Command set interfaces standardized for SMR HDDs
– Zoned Block Commands (SCSI)
– Zoned ATA Commands (ATA)
• ZNS (Zoned Namespaces) recently released for NVMe SSDs
– Available on www.nvmexpress.org (NVMe 1.4 Ratified TPs)
• ZAC/ZBC and ZNS share same basic storage model
ZNS State Machine
ZAC/ZBC State Machine
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 6
Zoned Block Enabled SW
• Foundation is Zoned Block Device (ZBD)
– Block layer generic interface for ZBC, ZAC and
ZNS commands and protocol
• Optimizations at different levels enable
different paths/results through SW stack
– Legacy App -> Unmodified FS -> ZBD
– Legacy App -> Modified FS -> ZBD
• FS knows which LBAs in which file
– Modified App -> ZBD
• App knows most about data; e.g., may use
zones as containers for objects
Synergy between ZAC/ZBC HDD and ZNS SSD
Log Blk Dev Mappers
ZBD File SystemsLegacy File System
(ext4, xfs)
User Space
Linux Kernel
+ ZNS
SCSI mid-layer
Block Layer
NVMe
driver + ZNS
ZBD Interface
SCSI/ATA low level drivers
SATA ZAC
HDD
SAS ZBC
HDD
POSIX behavior
Legacy Applications Passthroughzone ioctl()
NVMe
ZNS SSD
ZBD Compliant Applications
(libzbd) (libzbc, libnvme)
(f2fs, btrfs)
(dm-zoned)
• Sequential write constraint exposed to apps
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 7
Zoned Storage and the Developer
• Sequential write constraint must be handled
by lower layers
– POSIX compliant file systems with native ZBD
support or regular block device emulation
• Complexity is hidden from application
developer BUT performance overhead may
exist
– Zone garbage collection (GC)
• Application writes sequentially
– Requires well design write path
• Serialization with mutual exclusion lock, one
writter thread per zone, etc
• Difficulty often depends on application existing
design
• Can be very effective
– But can be more complex to implement
What does “keeping it sequential” really mean?
User
Space
File Access Block Access Direct Device AccessZoned Block AccessFile Access File Access
POSIX behavior Sequential write constraint exposed to users
Legacy Applications
ZBD Compliant Applications
passthrough
libzbc, libnvme
zone ioctl()
libzbd
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 8
App Example: RocksDB
• LSM-tree based Key-value store
– LSM-trees mostly sequential writes
• ZenFS – a new storage backend for
RocksDB
– Maps SSTables to Zones
• ~1X device write amplification w/ ZenFS
vs. 3-6X WA measured on standard SSD
– Files/tables don’t align to device physical
structure on standard SSD
• Patches posted upstream, review in
progress
– https://github.com/facebook/rocksdb/pull/6961
End-to-End Integration of Zones
PersistedInmemory
HotData
A ZZ A Z A Z A
A Z
sync
compaction
compaction
Append-only
A Z
Most Updated
Least Updated
e.g. 64MB
e.g. 128MB
e.g. 1G
e.g. 100G
Grows 10X
ColdData
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 9
Summary – Zoned Storage
• Builds on SMR HDD enhancements already delivered to industry
• Extending to SSDs
– Reduced wear
– Reduced overprovisioning
– Reduced SSD DRAM
• ZNS Specification released by NVMe in June-2020
• ZNS Linux kernel SW enabling under way, creating synergy w/ ZAC/ZBC SMR HDD
ecosystem
• Application enabling also under way
– CEPH, RocksDB, Hadoop, …
• See Zonedstorage.io for technical documentation on zoned storage software, kernel
interface, etc.
Making the world safe for Sequential IO!!
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 7/21/20
7/21/20© 2020 Western Digital Corporation or its affiliates. All rights reserved. 11
BACKUP
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 12
Ceph
• Ceph Bluestore removes the local file system
– Bluestore backend writes data directly to the block device
and can handle the sequential write constraint
• No need to implement all the file system specific operations
– RocksDB uses LSM-trees that naturally generate no/few
random updates and can easily be stored on zoned block
devices
• Zones or group of zones can be mapped to failure
domains that may be smaller than the whole device
– Reduced network utilization and I/O workload for recovery
• Implementation is on-going (Abutalib)
– https://github.com/ceph/ceph/pull/35111
Ceph as part of the data infrastructure for zoned storage
Source: https://ceph.com/community/new-luminous-bluestore/
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 13
Hadoop / HDFS
Internal prototype working
• Internal prototype working in single server
environment with SMR disks
– Tests on multiple servers starting
• Support implemented using zonefs to avoid
language bindings of Linux zoned block device
interface
– Zoned block device support from JAVA !
– Simplifies development by keeping most POSIX semantic
used with regular files
– zonefs use will also naturally enable support of efficient
zone append write path where possible
Name Node
switch
client
Data
Nodes . . .
HW
Kernel
Space
ZAC/ZBC SMR disks, NVMe ZNS Devices
SCSI Layer/Driver, NVMe Driver
zonefs
Java VM
Block I/O Layer, Scheduler
User
Space
Data Node
File append:
direct I/O by
sector size
HADOOP
System
POSIX file
File read, truncate:
regular POSIX
file access
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 14
Zoned Namespaces TP
• ZNS model similar to ZAC/ZBC
– States: Empty, Full, Implicit Open, Explicit Open,
Closed, Read Only, Offline
– State Changes: Write, Zone Management Command
(Open, Close, Finish, Reset), Device Resets
• Zone Size vs. Zone Capacity(NEW)
– Zone Size is fixed
– Zone Capacity is variable
Zone Capacity
Zone Capacity (E.g., 500MB)
Zone Size (e.g., 512MB)
Zone Start LBA
Zone XZone X - 1 Zone X + 1
ZNS State Machine
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 15
Zoned Namespaces TP
• ZAC/ZBC requires strict write ordering
– Limits write performance, increases host overhead
• Low scalability with multiple writers to a zone
– One writer per zone -> Good performance
– Multiple writers per zone -> Lock contention
• Performance improves somewhat by writing to
multiple Zones
• With Zone Append, we scale
– Append data to a zone with implicit write pointer
– Drive returns LBA where data was written in zone
Zone Append
1500
1300
1100
900
700
500
100
-100
300
1 2 3 4
Number of Writers
Metal (i7-6700K, 4K, QD1, RW, libaio)
1 Zone 4 Zones Zone Append
KIOPS
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 16
Zoned Namespaces TP
How does Zone Append work?
• No host serialization; higher queue depth
• Scalable for HDDs and SSDs
• Host serializes I/O, forces low queue depth
• Insignificant lock contention when using HDDs
• Significant lock contention when using SSDs
4K Write0
8K Write1
16K Write2
WP
(after W0)
WP
(after W1)
WP
(after W2)
Zone
Zone Write Example
Queue Depth = 1
4K Write0
8K Write1
16K Write2
WP
(after all writes)
Zone
Zone Append Example
Queue Depth = 3
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 17
Zoned Namespaces TP
• For NVMeTM devices that implement the Zoned Command Set,
there is optional support for:
– Variable Capacity
• The completion of Reset Zone command may result in a notification that zone
capacity has changed
– Zone Excursions
• The device can transition a zone to Full before writes reaches the Zone Capacity.
Host will receive an AEN and write failure if writing after the transition
• If device implements, the host shall implement as well
– Incoherent state model if not – Software should be specifically be written
to know that zone capacity can change, or writes may suddenly fail
Attributes: Zone Excursions & Variable Capacity
Zone Capacity (E.g., 500MB)
Zone Size (e.g., 512MB)
Zone Start LBA
Zone XZone X - 1 Zone X + 1
Zone Excursion
7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 18
…
The Zone Storage Model
• “Sequential Write Required”
• Write operations must be issued in order to a zone.
• A zone has a write pointer, that communicates where the next
write must be issued.
• A zone has a state machine associated:
• It controls how a zone is accessed. e.g.,
• Empty or Open -> writes operations are allowed.
• Full -> write operations fails.
• State machine and other zone attributes are maintained in Zone
Descriptors. The Zone Descriptors are accessed using the Zone
Management Receive command.
• Active Resources and Open Resources restrict how many zones can be
in specific state.
• A zone’s state can be manipulated by the host by using the Zone
Management Send command
• E.g., Open Zone, Close Zone, Finish Zone, Reset Zone, … An arrow to or from a shaded area indicates transitions to or from all states in that area.
Write pointer
position
Write operations
advance the write
pointer
A Zone Reset
rewinds the write pointer
Zone 0 Zone Y…Zone
0 1 2 XLBA X-1
Zone State
Machine
Write Pointer in a partially written Zone
Zone 1

More Related Content

What's hot

What's hot (20)

Scaling for Performance
Scaling for PerformanceScaling for Performance
Scaling for Performance
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
 
Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits
 
MariaDB MaxScale
MariaDB MaxScaleMariaDB MaxScale
MariaDB MaxScale
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Bluestore
BluestoreBluestore
Bluestore
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architecture
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar Framework
 

Similar to Zoned Storage

OSS Presentation Accelerating VDI by Daniel Beveridge
OSS Presentation Accelerating VDI by Daniel BeveridgeOSS Presentation Accelerating VDI by Daniel Beveridge
OSS Presentation Accelerating VDI by Daniel Beveridge
OpenStorageSummit
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
Sisimon Soman
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talk
Sisimon Soman
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicom
Microsoft Singapore
 

Similar to Zoned Storage (20)

OSS Presentation Accelerating VDI by Daniel Beveridge
OSS Presentation Accelerating VDI by Daniel BeveridgeOSS Presentation Accelerating VDI by Daniel Beveridge
OSS Presentation Accelerating VDI by Daniel Beveridge
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 
Azure DBA with IaaS
Azure DBA with IaaSAzure DBA with IaaS
Azure DBA with IaaS
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged Environments
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talk
 
Azure Databases with IaaS
Azure Databases with IaaSAzure Databases with IaaS
Azure Databases with IaaS
 
TechNet Live spor 1 sesjon 6 - more vdi
TechNet Live spor 1   sesjon 6 - more vdiTechNet Live spor 1   sesjon 6 - more vdi
TechNet Live spor 1 sesjon 6 - more vdi
 
Linux on System z – disk I/O performance
Linux on System z – disk I/O performanceLinux on System z – disk I/O performance
Linux on System z – disk I/O performance
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicom
 
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld Europe 2014: Virtual SAN Best Practices and Use CasesVMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
 
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep DiveVMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
 
VMworld 2014: Virtual SAN Architecture Deep Dive
VMworld 2014: Virtual SAN Architecture Deep DiveVMworld 2014: Virtual SAN Architecture Deep Dive
VMworld 2014: Virtual SAN Architecture Deep Dive
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Cross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis EnterpriseCross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis Enterprise
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre
 
https://bit.ly/3LE329L
https://bit.ly/3LE329Lhttps://bit.ly/3LE329L
https://bit.ly/3LE329L
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 

Recently uploaded

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 

Zoned Storage

  • 1. © 2020 Western Digital Corporation or its affiliates. All rights reserved. 7/21/20 Delivering Zoned Storage Across the Storage Ecosystem Dave Landsman Director Industry Standards System and Software Technologies 22-July-2020
  • 2. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 2 Why Zoned Block Devices? SMR HDD and NAND Media Both Require Sequential Write Within Zones Zone • SMR HDDs consist of regions (zones) in which the tracks are overlapped • Within each zone, sequential write only • NAND die are composed of erase blocks, consisting of many pages • Within each erase block, sequential write only NAND Die Erase Block
  • 3. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 3 What are Zoned Block Devices? • LBA range is divided into Zones • Writes within a Zone must be sequential • Each Zone has a write pointer that keeps track of the position for the next write • Data in a Zone cannot be overwritten • Zone must be erased before it can be rewritten Zone 0 Zone 1 Zone 2 Zone X Write pointer position WRITE commands advance the write pointer ZONE RESET command rewinds the write pointer Written data LBA 0 LBA n
  • 4. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 4 Zoned Block Devices need new contract w/ host • Host/Device cooperate on data placement in Zones • Benefits for HDD – Areal density growth w/ SMR – Reduced background activities • Benefits for SSD • Reduces Write Amplification & OP • Reduces DRAM • Improves latency outliers and throughput • But to realize the benefit, we needed – Zoned Block Device Interface Standards – Zoned Enabled Storage Stack and Apps Application 1 Data Application 2 Data Application 3 Data Zoned Block Device LBA Space Zone Zoned Storage Zone Enabled Storage Stack and Apps Application 1 Data Application 2 Data Application 3 Data Traditional Device LBA Space Traditional Storage Traditional Storage Stack and Apps Zoned Block Device Interface Standards
  • 5. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 5 Zoned Block Interface Device Standards • Command set interfaces standardized for SMR HDDs – Zoned Block Commands (SCSI) – Zoned ATA Commands (ATA) • ZNS (Zoned Namespaces) recently released for NVMe SSDs – Available on www.nvmexpress.org (NVMe 1.4 Ratified TPs) • ZAC/ZBC and ZNS share same basic storage model ZNS State Machine ZAC/ZBC State Machine
  • 6. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 6 Zoned Block Enabled SW • Foundation is Zoned Block Device (ZBD) – Block layer generic interface for ZBC, ZAC and ZNS commands and protocol • Optimizations at different levels enable different paths/results through SW stack – Legacy App -> Unmodified FS -> ZBD – Legacy App -> Modified FS -> ZBD • FS knows which LBAs in which file – Modified App -> ZBD • App knows most about data; e.g., may use zones as containers for objects Synergy between ZAC/ZBC HDD and ZNS SSD Log Blk Dev Mappers ZBD File SystemsLegacy File System (ext4, xfs) User Space Linux Kernel + ZNS SCSI mid-layer Block Layer NVMe driver + ZNS ZBD Interface SCSI/ATA low level drivers SATA ZAC HDD SAS ZBC HDD POSIX behavior Legacy Applications Passthroughzone ioctl() NVMe ZNS SSD ZBD Compliant Applications (libzbd) (libzbc, libnvme) (f2fs, btrfs) (dm-zoned) • Sequential write constraint exposed to apps
  • 7. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 7 Zoned Storage and the Developer • Sequential write constraint must be handled by lower layers – POSIX compliant file systems with native ZBD support or regular block device emulation • Complexity is hidden from application developer BUT performance overhead may exist – Zone garbage collection (GC) • Application writes sequentially – Requires well design write path • Serialization with mutual exclusion lock, one writter thread per zone, etc • Difficulty often depends on application existing design • Can be very effective – But can be more complex to implement What does “keeping it sequential” really mean? User Space File Access Block Access Direct Device AccessZoned Block AccessFile Access File Access POSIX behavior Sequential write constraint exposed to users Legacy Applications ZBD Compliant Applications passthrough libzbc, libnvme zone ioctl() libzbd
  • 8. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 8 App Example: RocksDB • LSM-tree based Key-value store – LSM-trees mostly sequential writes • ZenFS – a new storage backend for RocksDB – Maps SSTables to Zones • ~1X device write amplification w/ ZenFS vs. 3-6X WA measured on standard SSD – Files/tables don’t align to device physical structure on standard SSD • Patches posted upstream, review in progress – https://github.com/facebook/rocksdb/pull/6961 End-to-End Integration of Zones PersistedInmemory HotData A ZZ A Z A Z A A Z sync compaction compaction Append-only A Z Most Updated Least Updated e.g. 64MB e.g. 128MB e.g. 1G e.g. 100G Grows 10X ColdData
  • 9. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 9 Summary – Zoned Storage • Builds on SMR HDD enhancements already delivered to industry • Extending to SSDs – Reduced wear – Reduced overprovisioning – Reduced SSD DRAM • ZNS Specification released by NVMe in June-2020 • ZNS Linux kernel SW enabling under way, creating synergy w/ ZAC/ZBC SMR HDD ecosystem • Application enabling also under way – CEPH, RocksDB, Hadoop, … • See Zonedstorage.io for technical documentation on zoned storage software, kernel interface, etc. Making the world safe for Sequential IO!!
  • 10. © 2020 Western Digital Corporation or its affiliates. All rights reserved. 7/21/20
  • 11. 7/21/20© 2020 Western Digital Corporation or its affiliates. All rights reserved. 11 BACKUP
  • 12. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 12 Ceph • Ceph Bluestore removes the local file system – Bluestore backend writes data directly to the block device and can handle the sequential write constraint • No need to implement all the file system specific operations – RocksDB uses LSM-trees that naturally generate no/few random updates and can easily be stored on zoned block devices • Zones or group of zones can be mapped to failure domains that may be smaller than the whole device – Reduced network utilization and I/O workload for recovery • Implementation is on-going (Abutalib) – https://github.com/ceph/ceph/pull/35111 Ceph as part of the data infrastructure for zoned storage Source: https://ceph.com/community/new-luminous-bluestore/
  • 13. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 13 Hadoop / HDFS Internal prototype working • Internal prototype working in single server environment with SMR disks – Tests on multiple servers starting • Support implemented using zonefs to avoid language bindings of Linux zoned block device interface – Zoned block device support from JAVA ! – Simplifies development by keeping most POSIX semantic used with regular files – zonefs use will also naturally enable support of efficient zone append write path where possible Name Node switch client Data Nodes . . . HW Kernel Space ZAC/ZBC SMR disks, NVMe ZNS Devices SCSI Layer/Driver, NVMe Driver zonefs Java VM Block I/O Layer, Scheduler User Space Data Node File append: direct I/O by sector size HADOOP System POSIX file File read, truncate: regular POSIX file access
  • 14. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 14 Zoned Namespaces TP • ZNS model similar to ZAC/ZBC – States: Empty, Full, Implicit Open, Explicit Open, Closed, Read Only, Offline – State Changes: Write, Zone Management Command (Open, Close, Finish, Reset), Device Resets • Zone Size vs. Zone Capacity(NEW) – Zone Size is fixed – Zone Capacity is variable Zone Capacity Zone Capacity (E.g., 500MB) Zone Size (e.g., 512MB) Zone Start LBA Zone XZone X - 1 Zone X + 1 ZNS State Machine
  • 15. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 15 Zoned Namespaces TP • ZAC/ZBC requires strict write ordering – Limits write performance, increases host overhead • Low scalability with multiple writers to a zone – One writer per zone -> Good performance – Multiple writers per zone -> Lock contention • Performance improves somewhat by writing to multiple Zones • With Zone Append, we scale – Append data to a zone with implicit write pointer – Drive returns LBA where data was written in zone Zone Append 1500 1300 1100 900 700 500 100 -100 300 1 2 3 4 Number of Writers Metal (i7-6700K, 4K, QD1, RW, libaio) 1 Zone 4 Zones Zone Append KIOPS
  • 16. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 16 Zoned Namespaces TP How does Zone Append work? • No host serialization; higher queue depth • Scalable for HDDs and SSDs • Host serializes I/O, forces low queue depth • Insignificant lock contention when using HDDs • Significant lock contention when using SSDs 4K Write0 8K Write1 16K Write2 WP (after W0) WP (after W1) WP (after W2) Zone Zone Write Example Queue Depth = 1 4K Write0 8K Write1 16K Write2 WP (after all writes) Zone Zone Append Example Queue Depth = 3
  • 17. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 17 Zoned Namespaces TP • For NVMeTM devices that implement the Zoned Command Set, there is optional support for: – Variable Capacity • The completion of Reset Zone command may result in a notification that zone capacity has changed – Zone Excursions • The device can transition a zone to Full before writes reaches the Zone Capacity. Host will receive an AEN and write failure if writing after the transition • If device implements, the host shall implement as well – Incoherent state model if not – Software should be specifically be written to know that zone capacity can change, or writes may suddenly fail Attributes: Zone Excursions & Variable Capacity Zone Capacity (E.g., 500MB) Zone Size (e.g., 512MB) Zone Start LBA Zone XZone X - 1 Zone X + 1 Zone Excursion
  • 18. 7/21/20© 2019 Western Digital Corporation or its affiliates. All rights reserved. 18 … The Zone Storage Model • “Sequential Write Required” • Write operations must be issued in order to a zone. • A zone has a write pointer, that communicates where the next write must be issued. • A zone has a state machine associated: • It controls how a zone is accessed. e.g., • Empty or Open -> writes operations are allowed. • Full -> write operations fails. • State machine and other zone attributes are maintained in Zone Descriptors. The Zone Descriptors are accessed using the Zone Management Receive command. • Active Resources and Open Resources restrict how many zones can be in specific state. • A zone’s state can be manipulated by the host by using the Zone Management Send command • E.g., Open Zone, Close Zone, Finish Zone, Reset Zone, … An arrow to or from a shaded area indicates transitions to or from all states in that area. Write pointer position Write operations advance the write pointer A Zone Reset rewinds the write pointer Zone 0 Zone Y…Zone 0 1 2 XLBA X-1 Zone State Machine Write Pointer in a partially written Zone Zone 1