Published on

CEPH - a distributed storage and file system (spring 2012)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

  1. 1. Ceph ORThe link between file systems and octopuses Udo Seidel Linuxtag 2012
  2. 2. Agenda● Background● CephFS● CephStorage● Summary Linuxtag 2012
  3. 3. Ceph – what?● So-called parallel distributed cluster file system● Started as part of PhD studies at UCSC th● Public announcement in 2006 at 7 OSDI● File system shipped with Linux kernel since 2.6.34● Name derived from pet octopus - cephalopods Linuxtag 2012
  4. 4. Shared file systems – short intro● Multiple server access the same data● Different approaches ● Network based, e.g. NFS, CIFS ● Clustered – Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2 – Distributed parallel, e.g. Lustre .. and Ceph Linuxtag 2012
  5. 5. Ceph and storage● Distributed file system => distributed storage● Does not use traditional disks or RAID arrays● Does use so-called OSD’s – Object based Storage Devices – Intelligent disks Linuxtag 2012
  6. 6. Object Based Storage I● Objects of quite general nature ● Files ● Partitions● ID for each storage object● Separation of meta data operation and storing file data● HA not covered at all● Object based Storage Devices Linuxtag 2012
  7. 7. Object Based Storage II● OSD software implementation ● Usual an additional layer between between computer and storage ● Presents object-based file system to the computer ● Use a “normal” file system to store data on the storage ● Delivered as part of Ceph● File systems: LUSTRE, EXOFS Linuxtag 2012
  8. 8. Ceph – the full architecture I● 4 components ● Object based Storage Devices – Any computer – Form a cluster (redundancy and load balancing) ● Meta Data Servers – Any computer – Form a cluster (redundancy and load balancing) ● Cluster Monitors – Any computer ● Clients ;-) Linuxtag 2012
  9. 9. Ceph – the full architecture II Linuxtag 2012
  10. 10. Ceph client view● The kernel part of Ceph● Unusual kernel implementation ● “light” code ● Almost no intelligence● Communication channels ● To MDS for meta data operation ● To OSD to access file data Linuxtag 2012
  11. 11. Ceph and OSD● User land implementation● Any computer can act as OSD● Uses BTRFS as native file system ● Since 2009 ● Before self-developed EBOFS ● Provides functions of OSD-2 standard – Copy-on-write – snapshots● No redundancy on disk or even computer level Linuxtag 2012
  12. 12. Ceph and OSD – file systems● BTRFS preferred ● Non-default configuration for mkfs● XFS and EXT4 possible ● XATTR (size) is key -> EXT4 less recommended Linuxtag 2012
  13. 13. OSD failure approach● Any OSD expected to fail● New OSD dynamically added/integrated● Data distributed and replicated● Redistribution of data after change in OSD landscape Linuxtag 2012
  14. 14. Data distribution● File stripped● File pieces mapped to object IDs● Assignment of so-called placement group to object ID ● Via hash function ● Placement group (PG): logical container of storage objects● Calculation of list of OSD’s out of PG ● CRUSH algorithm Linuxtag 2012
  15. 15. CRUSH I● Controlled Replication Under Scalable Hashing● Considers several information ● Cluster setup/design ● Actual cluster landscape/map ● Placement rules● Pseudo random -> quasi statistical distribution● Cannot cope with hot spots● Clients, MDS and OSD can calculate object location Linuxtag 2012
  16. 16. CRUSH II Linuxtag 2012
  17. 17. Data replication● N-way replication ● N OSD’s per placement group ● OSD’s in different failure domains ● First non-failed OSD in PG -> primary● Read and write to primary only ● Writes forwarded by primary to replica OSD’s ● Final write commit after all writes on replica OSD● Replication traffic within OSD network Linuxtag 2012
  18. 18. Ceph caches● Per design ● OSD: Identical to access of BTRFS ● Client: own caching● Concurrent write access ● Caches discarded ● Caching disabled -> synchronous I/O● HPC extension of POSIX I/O ● O_LAZY Linuxtag 2012
  19. 19. Meta Data Server● Form a cluster● don’t store any data ● Data stored on OSD ● Journaled writes with cross MDS recovery● Change to MDS landscape ● No data movement ● Only management information exchange● Partitioning of name space ● Overlaps on purpose Linuxtag 2012
  20. 20. Dynamic subtree partitioning● Weighted subtrees per MDS● “load” of MDS re-belanced Linuxtag 2012
  21. 21. Meta data management● Small set of meta data ● No file allocation table ● Object names based on inode numbers● MDS combines operations ● Single request for readdir() and stat() ● stat() information cached Linuxtag 2012
  22. 22. Ceph cluster monitors● Status information of Ceph components critical● First contact point for new clients● Monitor track changes of cluster landscape ● Update cluster map ● Propagate information to OSD’s Linuxtag 2012
  23. 23. Ceph cluster map I● Objects: computers and containers● Container: bucket for computers or containers● Each object has ID and weight● Maps physical conditions ● rack location ● fire cells Linuxtag 2012
  24. 24. Ceph cluster map II● Reflects data rules ● Number of copies ● Placement of copies● Updated version sent to OSD’s ● OSD’s distribute cluster map within OSD cluster ● OSD re-calculates via CRUSH PG membership – data responsibilities – Order: primary or replica ● New I/O accepted after information synch Linuxtag 2012
  25. 25. Ceph – file system part● Replacement of NFS or other DFS● Storage just a part Linuxtag 2012
  26. 26. Ceph - RADOS● Reliable Autonomic Distributed Object Storage● Direct access to OSD cluster via librados● Drop/skip of POSIX layer (cephfs) on top● Visible to all ceph cluster members => shared storage Linuxtag 2012
  27. 27. RADOS Block Device● RADOS storage exposed as block device ● /dev/rbd ● qemu/KVM storage driver via librados● Upstream since kernel 2.6.37● Replacement of ● shared disk clustered file systems for HA environments ● Storage HA solutions for qemu/KVM Linuxtag 2012
  28. 28. RADOS – part I Linuxtag 2012
  29. 29. RADOS Gateway● RESTful API ● Amazon S3 -> s3 tools work ● SWIFT APIs● Proxy HTTP to RADOS● Tested with apache and lighthttpd Linuxtag 2012
  30. 30. Ceph storage – all in one Linuxtag 2012
  31. 31. Ceph – first steps● A few servers ● At least one additional disk/partition ● Recent Linux installed ● ceph installed ● Trusted ssh connections● Ceph configuration ● Each servers is OSD, MDS and Monitor Linuxtag 2012
  32. 32. Summary● Promising design/approach● High grade of parallelism● still experimental status -> limited recommendation production● Big installations? ● Back-end file system ● Number of components ● Layout Linuxtag 2012
  33. 33. References●● @ceph-devel● Linuxtag 2012
  34. 34. Thank you! Linuxtag 2012