Cloud Service Models Overview
• What if you want to have an IT department ?
– Similar to build a new house in previous analogy
• You can rent some virtualized infrastructure and build up your own
IT system among those resources, which may be fully controlled.
• Technical speaking, use the Infrastructure as a Service (IaaS)
– Similar to buy an empty house in previous analogy
• You can directly develop your IT system through one cloud
platform, and do not care about any lower level resource
• Technical speaking, use the Platform as a Service (PaaS) solution.
– Similar to live in a hotel in previous analogy
• You can directly use some existed IT system solutions, which were
provided by some cloud application service provider, without
knowing any detail technique about how these service was
• Technical speaking, use the Software as a Service (SaaS) solution.
From IaaS to PaaS
– Ceph is a free software distributed file system.
– Ceph's main goals are to be POSIX-compatible, and
completely distributed without a single point of failure.
– The data is seamlessly replicated, making it fault
– On July 3, 2012, the Ceph development team released
Argonaut, the first release of Ceph with long-term
– Ceph is a distributed file system that provides
excellent performance ,reliability and scalability.
– Objected-based Storage.
– Ceph separates data and metadata operations by
eliminating file allocation tables and replacing them
with generating functions.
– Ceph utilizes a highly adaptive distributed metadata
cluster, improving scalability.
– Using OSD to directly access data, high performance.
• Storage capacity, throughput, client performance.
Emphasis on HPC.
• Failures are the norm rather than the exception, so the
system must have fault detection and recovery
• Dynamic workloads Load balance.
• Ceph Filesystem
• File based
• Ceph Block Device
• Block based
• Ceph Object Gateway
– Swift / S3 Restful API
• Object based
• Three main components
– Clients : Near-POSIX file system interface.
– Cluster of OSDs : Store all data and metadata.
– Metadata Cluster : Manage namespace(file name)
Three Fundamental Design
1. Separating Data and Metadata
– Separation of file metadata management from the
– Metadata operations are collectively managed by
a metadata server cluster.
– User can direct access OSDs to get data by
– Ceph remove data allocation lists entirely.
– Using CRUSH assigns objects to storage devices.
Separating Data and Metadata
• Ceph separates data and metadata operations
Separating Data and Metadata
• Data Distribution with CRUSH
– In order to avoid imbalance(OSD idle, empty) or
load asymmetries(hot data on new device).
→distributing new data randomly.
– Ceph maps ojects into Placement groups(PGs)PGs
are assigned to OSDs by CRUSH.
Dynamic Distributed Metadata
2. Dynamic Distributed Metadata Management
Ceph utilizes a metadata cluster architecture based on Dynamic
Subtree Partitioning.(workload balance)
– Dynamic Subtree Partitioning
• Most FS, use static subtree partitioning
→imbalance workloads and easy hash function.
• Ceph’s MDS cluster is based on a dynamic subtree
partitioning. →balance workloads
• Client Operation
– File I/O and Capabilities
size, …)Check OK, return
Return inode number,
map file data into
• Client Synchronization
– If Multiple clients(readers and writers) use same
file, cancel any previously read and write
capability until OSD check OK.
• Traditional: Update serialization. →Bad performance
• Ceph: Use HPC(high-performance computing
community) can read and write different parts of same
• Dynamically Distributed Metadata
– MDSs use journaling
• Repetitive metadata updates handled in memory.
• Optimizes on-disk layout for read access.
– Per-MDS has journal, when MDS failure another
node can quickly recover with journal.
– Inodes are embedded directly within directories.
– Each directory’s content is written to the OSD
cluster using the same striping and distribution
strategy as metadata journals and file data.
– Data is replicated in terms of PGs.
– Clients send all writes to the first non-failed OSD
in an object’s PG (the primary), which assigns a
new version number for the object and PG and
forwards the write to any additional replica OSDs.
• Failure detection
– When OSD not response → sign “down”
– Pass to the next OSD.
– If first OSD doesn’t recover →sign “out”
– Another OSD join.
• Recovery and Cluster Updates
– If an OSD1 crashes → sign “down”
– The OSD2 take over as primary.
– If OSD1 recovers → sign “up”
– The OSD2 receives update request, sent new
version data to OSD1.
• Highly developed
• Monitor waste CPUs
• Recovery into un-consistency state
• Bugs in file extend behavior
– Qcow2 images have got IO errors in VMs kernel,
» but things are going well in the log of Ceph.
• Correct the time
• OSDs waste CPUs
– ntpdate tock.stdtime.gov.tw
• health HEALTH_WARN clock skew detected on mon.1
– ntpdate tock.stdtime.gov.tw
• CephFS is not statble
– Newly system can use ceph RBD
– Traditional system could only use the POSIX
– Ceph’s operation in a folder would be frozen,
» if that folder is getting heavy loading.
– Bugs in file extend behavior
• Mount ceph with
– Kernel module
• mount –t ceph …
• ceph-fuse -c /etc/ceph/ceph.conf …
root@SSCloud-01:/# cephfs /mnt/dev set_layout -p 5
cephfs is not a super-friendly tool right now — sorry! :(
I believe you will find it works correctly if you specify all the layout parameters,
not just one of them.
root@SSCloud-01:/# cephfs -h
not enough parameters!
usage: cephfs path command [options]*
show_layout -- view the layout information on a file or dir
set_layout -- set the layout on an empty file, or the default layout on a directory
show_location -- view the location information on a file
map -- display file objects, pgs, osds
Useful for setting layouts:
--stripe_unit, -u: set the size of each stripe
--stripe_count, -c: set the number of objects to stripe across
--object_size, -s: set the size of the objects to stripe across
--pool, -p: set the pool to use
Useful for getting location data:
--offset, -l: the offset to retrieve location data for
root@SSCloud-01:/# cephfs /mnt/dev set_layout -u 4194304 -c 1 -s 4194304 -p 5
root@SSCloud-01:/# cephfs /mnt/dev show_layout
• There are three type of storage in IaaS
– File-based, block-based, object-based
• Ceph is a good choice for IaaS
– OpenStack store images in Ceph Block Device
– Cinder or nova-volume to boot a VM
• using a copy-on-write clone of an image
• CephFS is still highly developed
– However, newer version is better.