2. 6/27/18 0
CephFS with OpenStack Manila
based on Bluestore and Erasure Code
• POSIX-compliant shared FS
• Support kernel client and FUSE
• First Stable Release : Jewel (in April 2016)
• Multiple Active MDS : Luminous
• Directory Fragmentation
• Subtree Pinning
• Experimental Features
- INLINE DATA
- MANTLE : Programmable metadata LB
- Snapshot
- Multiple FileSystem
.. . . ./
3. 6/27/18 0
CephFS with OpenStack Manila
based on Bluestore and Erasure Code
• OpenStack File Share Service
• Incubated project in Juno
• Core Service in Kilo
• share drivers : 20
• Create/Delete share
• Access Allow/Deny
• Quota
• Consistency Group
• Snapshot
• Share Replication
( (
manila-api manila-scheduler
manila-share (cephfs-native-driver)
4. 6/27/18 0
CephFS with OpenStack Manila
based on Bluestore and Erasure Code
-/ - / / .
• BlueStore = Block + NewStore
• Consume raw block device
• RocksDB for metadata
• Luminous Default Data Store
5. 6/27/18 0
CephFS with OpenStack Manila
based on Bluestore and Erasure Code
• 5C 23
• % 25
• -C 4 4
1
1 1
Replicated
• 4
• 25
• E 4
Erasure Code
% 0 1
K M
Data
K M
Encoding4 + 2
K + M
Data
Decoding
6. 6/27/18 0
$ ceph osd pool set ec_pool allow_ec_overwrites true
• Luminous Support
• RBD and CephFS on EC Pools
• Only be enabled on BlueStore OSDs
• Erasure coded pools do not support omap
• Needs Metadata Pool with Replicated
$ rbd create --size 1G --data-pool ec_pool replicated_pool/image_name`
$ ceph fs new <fs_name> <metadata> <data>
$ setfattr -n ceph.file.layout.pool -v cephfs_data file2
Erasure Code : allow_ec_overwrites
7. 6/27/18 0
• K : Data-Chunks (4)
• M : Coding Chunks (2)
• Plugin : Jerasure / ISA / Locally repairable
• Technique : reed_sol_van / cauchy
• ruleset-failure-domain : rack / host / osd “ISA only runs on Intel processors”
https://ceph.com/geen-categorie/benchmarking-ceph-erasure-code-plugins/
Plugin Jerasure ISA
techniques reed_sol_van cauchy_good reed_sol_van cauchy
Encode Times(s) 1.140 1.039 0.574 0.561
Decode Times(s)
1 OSD LOST 0.521 0.522 0.333 0.404
2 OSD LOST 1.416 1.113 0.557 0.547
Erasure Code : Profile
8. 6/27/18 0
Replication vs. Erasure Coding
• Replication better performance for read.
• Erasure Coding better performance for write.
https://www.slideshare.net/JoseDeLaRosa7/ceph-perfsizingguide
$ -> 1/2
• EC is only half the cost ($ per GB) of rep.
11. 6/27/18 0
EH C E {fs-name} {DHSS}
EH D E {fs-name} D E D C = ‘{backend name}'
Ceph
D C CAEA A D . /
D C = D DD
DD
C C D D C D C CD D
D C = DD
D C C C manila.share.drivers.cephfs.cephfs_native.CephFSNativeDriver
D A E $ E $ $ A
D FE manila
D FDE C
D D D AED False
cephfs_volume_prefix = /volumes
cephfs_pool_namespace_prefix = fsvolumens_
/ -.
Type
(ssd)
Backend
(ceph_ssd)
Create Type
13. 6/27/18 0
QERFBRRP M ceph.file.layout.pool_namespace
fsvolumes-fd43f701-e4d1-48c5-ad5a-2bb8f2daaaa0
$ N EQ$AMN PN O$ E +F F ++ BFE -F BFE + -
BMI B B EQQ B N FQ MB E cephx QEP MB E
[client.user-name]
JE 0 1 1- 18=$ =8 119R=674:3I 21 M45 6 W100
BOQ/ Q B N P OBR 0/volumes/_nogroup/5e016f4f-2664-4afe-8f03-9d1afe065785
BOQ/ NM B N P
BOQ/ NQ B N P ONN 0 EO FQ BRB namespace=fsvolumes-fd43f701-e4d1-48c5-ad5a-
2bb8f2daaaa0
Allowing access to shares
14. 6/27/18 0
$ mount -t ceph 10.10.10.10:6789:/volumes/_nogroup/5e016f4f-2664-4afe-8f03-9d1afe065785
/mnt -o name={user name},secret=AQA8+ANW/4ZWNRAAOtWJMFPEihBA1unFImJczA==
F $ 8: $= 0 F : 6 F 6 $ 6 :1 FD: 6 : D:8 : 1 D:8 :
$ manila share-export-location-list {fs name}
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$
I 42 I 56 I 5 : : :9 I
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$
I 8 . 7. $ :7 $ -97$ +8$ 7 -: -. I 0,-. 0/volumes/_nogroup/5e016f4f-2664-4afe-8f03-9d1afe065785 I 36 D: I
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$
[client.user-name]
key = AQA8+ANW/4ZWNRAAOtWJMFPEihBA1unFImJczA==
caps: [mds] allow rw path=/volumes/_nogroup/5e016f4f-2664-4afe-8f03-9d1afe065785
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs1-data namespace=fsvolumes-fd43f701-e4d1-48c5-ad5a-2bb8f2daaaa0
Mount
19. 6/27/18 0
User space
Kernel space
-
ceph-fuse kernel mount
User space
Kernel space
-
FastSupport Quotas
ceph-fuse vs kernel mount
20. 6/27/18 0
9F:5FF 79B $ G F5$ 5 : 9 5 : 9 %B5F
9F:5FF 79B $ G F5$ 5 6 F9 5 6 F9 %B5F
• 0 9 2 F . 79B $ G F5$ 5 : 9
• / F9 2 F . 79B $ G F5$ 5 6 F9
8: L 9B :G 9
79B :G 9 1 1 %79B :
ceph-fuse
8: L 9B 79B :
$ $ $ . -.%F9 F $- $ %79B :
kernel mount
Quotas
21. 6/27/18 0
Type Ceph-Fuse (sec) Kernel Mount (sec)
Small File Creation (1000 files) 17 4
Large File Creation (1gb) 56 23
Tar Extract (files:70k, 800mb) 147 32
• Small File : for i in `seq 1000`; do echo hello > test${i}; done
• Large File : dd if=/dev/zero of=./1G bs=1M count=1000 oflag=direct
• Tar Extract : tar xf linux-4.15.14.tar.xz
Fuse vs. Kernel
25. 6/27/18 0
MDS
(H/S)
MDS
(HS)
- .
. .
MDS #1
(rank:0)
MDS #2
(rank:1)
MDS
(STANDBY)
MDS High Availability : Hot Standby
• Standby daemon will continuously read the metadata of up rank
• Give a warm metadata cache
• Speed up the process of failing over
26. 6/27/18 0
Single MDS Multiple MDS
MDS #1
MDS
#1
MDS
#2
MDS
#3
Multiple MDS
• Single MDS has bottleneck
• Multiple MDS may not increase performance on all workloads
• Benefit from many clients working on many separate directories