Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
1. DHT2 - O Brother,
Where Art Thou?
Shyamsundar Ranganathan
Developer
2. Session aims to explore...
"The hypothetical treasure at the end of the journey"
Why DHT2
"The plan..."
DHT2 design
"Known adventures along the way!"
Challenges in DHT2
"The strange characters"
Challenges because of DHT2
"Trouble escaping the chain gang!"
Where are we with DHT2
Loosely inspired by the movie: https://en.wikipedia.org/wiki/O_Brother,_Where_Art_Thou%3F
3. Why DHT2
DHT pitfalls
Directories on all subvolumes
Layout per directory
Rebalance IO path handling and nonoptimal data movement
This impacts scale and correctness!
4. Why DHT2
DHT pitfalls
Directories on all subvolumes
Layout per directory
Rebalance IO path handling and nonoptimal data movement
This impacts scale and correctness!
Correctness can be addressed in DHT,
Broader locking semantics for dentry operations
Possibly single layout adoption
But, increases complexity and could cost performance!
With DHT2 the goal is to fix all of the above, retaining or improving
performance
5. DHT2 Design: The file system
objects
View the file system as a collection of related objects
”wait a second... isn't that what inodes and data pointers are?”
Yes, but they are not distributed!
Directory objects denote hierarchy
storing <name,inode#> tables
File object maintains inode related metadata
Actual file data is maintained in data object(s)
6. The file system objects
(example)
Client View
. ('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
7. Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects
(example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
The different objects, segregated by type
8. Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects
(example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
Namespace hierarchy representation
9. Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects
(example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
Data association
10. DHT2 Design: Distribution
details
Distribute inodes using GFID
in the metadata ring
No hierarchy, a directory object lives only on one subvolume
Use GFID as the data object#
in the data ring
Distribution is hence not name dependent, and we just use a single layout per
ring
11. Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details
(example)
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
00EF
<File1, 00EF>
<Dir1, BA11>
<File2, BAC5>
<Dir2, 7525>
Switch names to GFID, add name to dinodes
12. Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details
(example)
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
00EF
<File1, 00EF>
<Dir1, BA11>
<File2, BAC5>
<Dir2, 7525>
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
13. DHT2 Design: Distribution
details (contd.)
Layout is based on bucket to subvolume assignment
Where, buckets >> subvolumes
Bucket ID is encoded into first n bytes of the GFID
Trivial GFID based operations
Collocates file object with parent object
File object# statically inherits parent directory# bucket ID
Optimized readirp and lookup operations (no hopping unless non-trivially
renamed, or a link file)
IOW, optimized (pGFID, basename) based operations
14. 00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details
(example)
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF>
<Dir1, BA11>
<File2, BAC5>
<Dir2, 7525>
Bricks/Subvols
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
Add bricks/subvolumes
15. 00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details
(example)
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF>
<Dir1, BA11>
<File2, BAC5>
<Dir2, 7525>
Bricks/Subvols
00
75
BA
00
BA
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
Buckets
Assign buckets to bricks
16. 00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details
(example)
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF>
<Dir1, BA11>
<File2, BAC5>
<Dir2, 7525>
Bricks/Subvols
00
75
BA
00
BA
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
Buckets
Place directories based on bucket encoded in the GFID
17. 00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details
(example)
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF>
<Dir1, BA11>
<File2, BAC5>
<Dir2, 7525>
Bricks/Subvols
00
75
BA
00
BA
Client View
('root')
├── Dir1
│ ├── Dir2
│ └── File2
└── File1
Buckets
Colocate the files under a directory with the same bucket ID
18. DHT2 Design: Rebalance
Reassign buckets to/from newer/removed subvolumes
fix-layout is instantaneous
Files travel with directories (same bucket colocation)
Expand the cluster, but perform no rebalance
aka just add-brick and let min-free-disk+link-to do its job
This is the tough one, use layout versions/histories to pull this off?
Split DHT2 into client-server pieces
Handle IO traffic, locking during rebalance
Better consistency model for transactions
Ability to have different expansions strategies for the 2 rings
19. Challenges in DHT2
Rename ELOOP checking requires hierarchy
Object backpointers
Time and size information should be in sync between data and metadata
objects
Dirty inode, tracked via open fd
Orphan GFID cleanup
Enter transactions/journals!
Directories as files/in a DB
Reduce local FS inode proliferation
20. Challenges because of DHT2
IO path cannot depend on hierarchy (Ex: quota)
Quick-read cannot fetch data in lookups
Anon-fd based operations cannot track dirty inodes
Others
Will changelog play well!
EC has to bother with only data?
Tier may need a rethink
Sharding may accrue cost of missing anon-fd and data/meta-data split of
shards
Unknowns!
21. Where are we with DHT2
Introduced DHT Version 2 in Barcelona summit, 2015
Followed up with 2 discussions upstream on core concepts [1] [2]
Followed up with a POC and some slides/documents to demonstrate
the concepts [3]
In a limbo since then,
But, not out of the picture yet!
Targeting an experimental release with 4.0