SlideShare a Scribd company logo
1 of 25
Download to read offline
1
© 2019 Joyent, Inc.
Joyent Technical Discussion
NVMe Hotplug
Jordan Hendricks
August 20, 2019
© 2019 Joyent, Inc. 2
Agenda
1. Motivation
2. Goals
3. Strategy
4. Implementation
5. Demos!
Left: “Campfire Pinecone.png” by Emeldi is licensed under Creative Commons Attribution-Share Alike 3.0 Unported
Right: “A Plug.jpg” by Maddin the brain is licensed under Creative Commons Attribution-Share Alike 3.0 Unported
© 2019 Joyent, Inc. 3
Motivation: Why NVMe?
● “NVM” = Non-Volatile Memory
● Standard for non-volatile storage over PCIe
● Designed for NVM instead of hard disks
○ e.g., more queues and deeper queues for I/O
● Several form factors
○ We are interested in U.2 (2.5 inch HDD form factor)
● illumos already has driver support
© 2019 Joyent, Inc. 4
Motivation: Why hotplug?
● “Hotplug” = generally, insertion or removal of devices on a
live system
○ Can be “coordinated” or “surprise”
■ Coordinated = notify OS first
■ Surprise = just pull/insert device
● Operational gains: no need to power off a system first
○ Guards against bad panics or errors if wrong device is
pulled
© 2019 Joyent, Inc. 5
NVMe SSD in slot
© 2019 Joyent, Inc. 6
Project Goals
1. Support coordinated insertion/removal of NVMe SSDs
2. Support surprise removal of NVMe SSDs
…for drives not in a zpool
...for drives in a zpool
...without ongoing I/O
...with ongoing I/O
© 2019 Joyent, Inc. 7
Project Strategy
● Hotplug support exists in the form of the hotplug
framework, integrated 10 years ago
● To find issues, tried various forms of hotplug with
NVMe drive
© 2019 Joyent, Inc. 8
Project Strategy
● Coordinated removal and insertion mostly worked from the
start*:
○ Removal
cfgadm -c unconfigure <slot> // offline device
cfgadm -c disconnect <slot> // power off slot
○ Insertion
cfgadm -c connect <slot> // power on slot
cfgadm -c configure <slot> // online device
*Caveat: except for OS-7494 , pci configurator issue
© 2019 Joyent, Inc. 9
Project Strategy
● These surprise removal experiments didn’t:
○ removing a device and plugging it back in: failed to detach
on pull; confusingly detached when plugged back in
○ pull device with ongoing I/O from dd(1M): dd hung
waiting for I/O in the kernel
○ pull a device from a mirrored zpool with no I/O: zpool
status hung waiting on I/O in the kernel
© 2019 Joyent, Inc. 10
Implementation: Hotplug Framework
EMPTY
POWERED
ENABLED
PRESENT
No device in slot
Device present in slot
Slot powered on
Device ready for use
Note that the slot state could jump straight from enabled to empty, but this is the general order of operations.
© 2019 Joyent, Inc. 11
Implementation: Hotplug Framework
● Removal case: no detach after pull
○ PCIe bridge receives a PCIe hotplug interrupt indicating the
slot status has changed
○ Hotplug code would request a state change for slot to
EMPTY from interrupt path
○ State change handler fetches state again from hardware,
see that the connection is already EMPTY, and do nothing
© 2019 Joyent, Inc. 12
Implementation: Hotplug Framework
● Insertion case: detach after inserted
○ PCIe bridge receives a PCIe hotplug interrupt indicating the slot
status has changed
○ Hotplug code would request a state change for slot to
PRESENT from interrupt path
○ State change handler fetches state again from hardware, see
that the connection state is PRESENT, and assumes it is
already ENABLED
○ Then, we the state change request happens, it thinks it’s going
backward (ENABLED -> PRESENT), and detaches drivers
© 2019 Joyent, Inc. 13
Implementation: Hotplug Framework
● Solution: Change state change transition checks in hotplug
framework to be aware of hot removal
● Internal state of hotplug framework can be out of sync if state
changes happen through surprise removal
● When in doubt, clean up structures
© 2019 Joyent, Inc. 14
Implementation: I/O Hang
● Both dd and zpool status hung on I/O in the kernel after hot
removal
● Want a way to notify the NVMe driver that its device is gone
● Solution?
○ Implement removal event callbacks using Nexus Driver
Interface Events
○ Add callback to the NVMe driver to fire on removal events
○ Plumb up support in nexus driver to fire removal events
© 2019 Joyent, Inc. 15
Implementation: I/O Hang
rootnex
pcieb pciehpc
nvme
blkdev
npe
© 2019 Joyent, Inc. 16
Implementation: I/O Hang
rootnex
pcieb pciehpc
nvme
blkdev
npe
npe: Implement bus ops for NDI
events.
(Events are passed up the tree
until a nexus can handle them.)
nvme: Register for remove events
on attach(9E). When remove
event fires, fail all outstanding
commands.
© 2019 Joyent, Inc. 17
Demo: Coordinated Removal
1: List all connections on the system.
# cfgadm
2: Meanwhile, trace whether the driver is detached…
# dtrace -n ‘fbt::nvme_detach:entry’
3: And see how many nvme driver instances we have to start.
# mdb -k
> ::prtconf -d nvme // 4 instances
© 2019 Joyent, Inc. 18
Demo: Coordinated Removal
4: Power off connection “Slot12”.
# cfgadm -c disconnect Slot12
5: Confirm with cfgadm.
# cfgadm
6: Confirm we see a detach dtrace output.
7. Confirm number of instances in mdb.
# mdb -k
> ::prtconf -d nvme // only 3 instances
© 2019 Joyent, Inc. 19
Demo: Uncoordinated Removal
1: List all connections on the system.
# cfgadm
2: Meanwhile, trace whether the driver is detached…
# dtrace -n ‘fbt::nvme_detach:entry’
3: And see how many nvme driver instances we have to start.
# mdb -k
> ::prtconf -d nvme // 4 instances
© 2019 Joyent, Inc. 20
Demo: Uncoordinated Removal
4: Hot removal of drive.
5: Observe dtrace output.
6: Observe nvme instances in mdb.
7: Observe in cfgadm.
© 2019 Joyent, Inc. 21
Demo: Uncoordinated Removal with I/O
1: Start dd command on disk.
# dd if=/dev/urandom of=/dev/rdsk/c3t1d0p0 bs=1M count=10240
2: Pull disk.
3. Observe EIO output.
© 2019 Joyent, Inc. 22
Demo: Uncoordinated Removal from zpool
1: Create mirrored zpool.
# zpool create test mirror c3t1d0 c4t1d0
2: Check pool status.
# zpool status test
3: Create some files.
# cd /test; echo foo > foo; cat foo;
4: Pull disk!
© 2019 Joyent, Inc. 23
Demo: Uncoordinated Removal from zpool
5: Check pool status.
# zpool status test
6: Read and write from the pool.
# cd /test; echo bar > bar; cat foo; cat bar;
7: Online device.
# cfgadm -c configure Slot12
8: Check pool status.
# zpool status test
© 2019 Joyent, Inc. 24
Remaining Work
● Hot removal from zpool with ongoing I/O panics due to OS-2743
● nvme driver sometimes fails to detach after dd exits because
blkdev still has references in devices tree (OS-7956)
● diskinfo sometimes does not list NVMe drives (OS-7940)
● Auto-online device on surprise insertion
● More work to improve NVMe operationally!
© 2019 Joyent, Inc. 25
Questions?

More Related Content

Similar to NVMe Hotplug Walk-Through

Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Manual nv 105
Manual nv 105Manual nv 105
Manual nv 105grana2810
 
ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleThe Linux Foundation
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Free5 gc installation
Free5 gc installationFree5 gc installation
Free5 gc installationChia-An Lee
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentKarim Yaghmour
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
How to Monitor Your Gaming Computer with a Time Series Database
 How to Monitor Your Gaming Computer with a Time Series Database How to Monitor Your Gaming Computer with a Time Series Database
How to Monitor Your Gaming Computer with a Time Series DatabaseInfluxData
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Reducing boot time in embedded Linux
Reducing boot time in embedded LinuxReducing boot time in embedded Linux
Reducing boot time in embedded LinuxChris Simmonds
 
Dell Venue 7 3740
Dell Venue 7 3740Dell Venue 7 3740
Dell Venue 7 3740Kojo King
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyLinaro
 
BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」
BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」
BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」BitVisor
 

Similar to NVMe Hotplug Walk-Through (20)

Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Manual nv 105
Manual nv 105Manual nv 105
Manual nv 105
 
ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Free5 gc installation
Free5 gc installationFree5 gc installation
Free5 gc installation
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
X Means Y
X Means YX Means Y
X Means Y
 
How to Monitor Your Gaming Computer with a Time Series Database
 How to Monitor Your Gaming Computer with a Time Series Database How to Monitor Your Gaming Computer with a Time Series Database
How to Monitor Your Gaming Computer with a Time Series Database
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Reducing boot time in embedded Linux
Reducing boot time in embedded LinuxReducing boot time in embedded Linux
Reducing boot time in embedded Linux
 
Dell Venue 7 3740
Dell Venue 7 3740Dell Venue 7 3740
Dell Venue 7 3740
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Android Platform Debugging & Development
Android Platform Debugging & Development Android Platform Debugging & Development
Android Platform Debugging & Development
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case study
 
BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」
BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」
BitVisor Summit 7「3. Interesting Issues During NVMe Driver Development」
 
BeagleBoard-xM Booting Process
BeagleBoard-xM Booting ProcessBeagleBoard-xM Booting Process
BeagleBoard-xM Booting Process
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

NVMe Hotplug Walk-Through

  • 1. 1 © 2019 Joyent, Inc. Joyent Technical Discussion NVMe Hotplug Jordan Hendricks August 20, 2019
  • 2. © 2019 Joyent, Inc. 2 Agenda 1. Motivation 2. Goals 3. Strategy 4. Implementation 5. Demos! Left: “Campfire Pinecone.png” by Emeldi is licensed under Creative Commons Attribution-Share Alike 3.0 Unported Right: “A Plug.jpg” by Maddin the brain is licensed under Creative Commons Attribution-Share Alike 3.0 Unported
  • 3. © 2019 Joyent, Inc. 3 Motivation: Why NVMe? ● “NVM” = Non-Volatile Memory ● Standard for non-volatile storage over PCIe ● Designed for NVM instead of hard disks ○ e.g., more queues and deeper queues for I/O ● Several form factors ○ We are interested in U.2 (2.5 inch HDD form factor) ● illumos already has driver support
  • 4. © 2019 Joyent, Inc. 4 Motivation: Why hotplug? ● “Hotplug” = generally, insertion or removal of devices on a live system ○ Can be “coordinated” or “surprise” ■ Coordinated = notify OS first ■ Surprise = just pull/insert device ● Operational gains: no need to power off a system first ○ Guards against bad panics or errors if wrong device is pulled
  • 5. © 2019 Joyent, Inc. 5 NVMe SSD in slot
  • 6. © 2019 Joyent, Inc. 6 Project Goals 1. Support coordinated insertion/removal of NVMe SSDs 2. Support surprise removal of NVMe SSDs …for drives not in a zpool ...for drives in a zpool ...without ongoing I/O ...with ongoing I/O
  • 7. © 2019 Joyent, Inc. 7 Project Strategy ● Hotplug support exists in the form of the hotplug framework, integrated 10 years ago ● To find issues, tried various forms of hotplug with NVMe drive
  • 8. © 2019 Joyent, Inc. 8 Project Strategy ● Coordinated removal and insertion mostly worked from the start*: ○ Removal cfgadm -c unconfigure <slot> // offline device cfgadm -c disconnect <slot> // power off slot ○ Insertion cfgadm -c connect <slot> // power on slot cfgadm -c configure <slot> // online device *Caveat: except for OS-7494 , pci configurator issue
  • 9. © 2019 Joyent, Inc. 9 Project Strategy ● These surprise removal experiments didn’t: ○ removing a device and plugging it back in: failed to detach on pull; confusingly detached when plugged back in ○ pull device with ongoing I/O from dd(1M): dd hung waiting for I/O in the kernel ○ pull a device from a mirrored zpool with no I/O: zpool status hung waiting on I/O in the kernel
  • 10. © 2019 Joyent, Inc. 10 Implementation: Hotplug Framework EMPTY POWERED ENABLED PRESENT No device in slot Device present in slot Slot powered on Device ready for use Note that the slot state could jump straight from enabled to empty, but this is the general order of operations.
  • 11. © 2019 Joyent, Inc. 11 Implementation: Hotplug Framework ● Removal case: no detach after pull ○ PCIe bridge receives a PCIe hotplug interrupt indicating the slot status has changed ○ Hotplug code would request a state change for slot to EMPTY from interrupt path ○ State change handler fetches state again from hardware, see that the connection is already EMPTY, and do nothing
  • 12. © 2019 Joyent, Inc. 12 Implementation: Hotplug Framework ● Insertion case: detach after inserted ○ PCIe bridge receives a PCIe hotplug interrupt indicating the slot status has changed ○ Hotplug code would request a state change for slot to PRESENT from interrupt path ○ State change handler fetches state again from hardware, see that the connection state is PRESENT, and assumes it is already ENABLED ○ Then, we the state change request happens, it thinks it’s going backward (ENABLED -> PRESENT), and detaches drivers
  • 13. © 2019 Joyent, Inc. 13 Implementation: Hotplug Framework ● Solution: Change state change transition checks in hotplug framework to be aware of hot removal ● Internal state of hotplug framework can be out of sync if state changes happen through surprise removal ● When in doubt, clean up structures
  • 14. © 2019 Joyent, Inc. 14 Implementation: I/O Hang ● Both dd and zpool status hung on I/O in the kernel after hot removal ● Want a way to notify the NVMe driver that its device is gone ● Solution? ○ Implement removal event callbacks using Nexus Driver Interface Events ○ Add callback to the NVMe driver to fire on removal events ○ Plumb up support in nexus driver to fire removal events
  • 15. © 2019 Joyent, Inc. 15 Implementation: I/O Hang rootnex pcieb pciehpc nvme blkdev npe
  • 16. © 2019 Joyent, Inc. 16 Implementation: I/O Hang rootnex pcieb pciehpc nvme blkdev npe npe: Implement bus ops for NDI events. (Events are passed up the tree until a nexus can handle them.) nvme: Register for remove events on attach(9E). When remove event fires, fail all outstanding commands.
  • 17. © 2019 Joyent, Inc. 17 Demo: Coordinated Removal 1: List all connections on the system. # cfgadm 2: Meanwhile, trace whether the driver is detached… # dtrace -n ‘fbt::nvme_detach:entry’ 3: And see how many nvme driver instances we have to start. # mdb -k > ::prtconf -d nvme // 4 instances
  • 18. © 2019 Joyent, Inc. 18 Demo: Coordinated Removal 4: Power off connection “Slot12”. # cfgadm -c disconnect Slot12 5: Confirm with cfgadm. # cfgadm 6: Confirm we see a detach dtrace output. 7. Confirm number of instances in mdb. # mdb -k > ::prtconf -d nvme // only 3 instances
  • 19. © 2019 Joyent, Inc. 19 Demo: Uncoordinated Removal 1: List all connections on the system. # cfgadm 2: Meanwhile, trace whether the driver is detached… # dtrace -n ‘fbt::nvme_detach:entry’ 3: And see how many nvme driver instances we have to start. # mdb -k > ::prtconf -d nvme // 4 instances
  • 20. © 2019 Joyent, Inc. 20 Demo: Uncoordinated Removal 4: Hot removal of drive. 5: Observe dtrace output. 6: Observe nvme instances in mdb. 7: Observe in cfgadm.
  • 21. © 2019 Joyent, Inc. 21 Demo: Uncoordinated Removal with I/O 1: Start dd command on disk. # dd if=/dev/urandom of=/dev/rdsk/c3t1d0p0 bs=1M count=10240 2: Pull disk. 3. Observe EIO output.
  • 22. © 2019 Joyent, Inc. 22 Demo: Uncoordinated Removal from zpool 1: Create mirrored zpool. # zpool create test mirror c3t1d0 c4t1d0 2: Check pool status. # zpool status test 3: Create some files. # cd /test; echo foo > foo; cat foo; 4: Pull disk!
  • 23. © 2019 Joyent, Inc. 23 Demo: Uncoordinated Removal from zpool 5: Check pool status. # zpool status test 6: Read and write from the pool. # cd /test; echo bar > bar; cat foo; cat bar; 7: Online device. # cfgadm -c configure Slot12 8: Check pool status. # zpool status test
  • 24. © 2019 Joyent, Inc. 24 Remaining Work ● Hot removal from zpool with ongoing I/O panics due to OS-2743 ● nvme driver sometimes fails to detach after dd exits because blkdev still has references in devices tree (OS-7956) ● diskinfo sometimes does not list NVMe drives (OS-7940) ● Auto-online device on surprise insertion ● More work to improve NVMe operationally!
  • 25. © 2019 Joyent, Inc. 25 Questions?