Quick-and-Easy Deployment of a Ceph Storage Cluster

Quick & Easy Deployment of
a Ceph Storage Cluster with
SUSE Enterprise Storage
Lars Marowsky-Brée
Distinguished Engineer, Architect Storage & HA
lmb@suse.com

2
Agenda
• Introduction
• Setting the stage
• What is Ceph, and how does it work?
• Designing Ceph clusters
• Deploying Ceph
• Q & A

Software-defined storage market

4
2014: $1.5B - 2017: $2.8B - 22.5% CAGR (IDC)
By 2018, open-source storage will gain 20% of the market share, up from
less than 1% in 2013 (Gartner)

5
Enterprise Data Storage: Market Segments

6
Enterprise Data Capacity Utilization
50-60%
of Enterprise Data
20-25%
15-20%
1-3%
Tier 0
Ultra High
Performance
Tier 1
High-value, OLTP,
Revenue Generating
Tier 2
Backup/Recovery,
Reference Data,
Bulk Data
Tier 3
Object, Archive,
Compliance Archive,
Long-term Retention
Source: Horison Information Strategies - Fred Moore

7
LOW
FUNCTIONALITY
HIGH
FUNCTIONALITY
Object
Storage
Archive
Storage
Data
Backup
Video
Audio
Big
Data
Bulk
Storage
Data
Analytics
OLTP
CRM
ERP
Email
HPC
Compliance
Archive
Enterprise Data Storage: Use Cases
CAPACITY
OPTIMIZED
PERFORMANCE
OPTIMIZED
VM-Aware
DR
Target

8
Ceph at SUSE
• 20+ years success with Linux & Enterprise storage
‒ Ceph as technology preview in SUSE Cloud 3
‒ General support with SUSE Cloud 4
‒ Dedicated storage product since February 2015
• Fast growing team, still hiring!
• Focus areas:
‒ Enterprise-readiness
‒ Ease of use
‒ Stabilization and integration

9
LOW
FUNCTIONALITY
HIGH
FUNCTIONALITY
Object
Storage
Archive
Storage
Data
Backup
Video
Audio
Big
Data
Bulk
Storage
Data
Analytics
OLTP
CRM
ERP
Email
HPC
Compliance
Archive
SUSE Enterprise Storage: Initial Target Market
CAPACITY
OPTIMIZED
PERFORMANCE
OPTIMIZED
VM-Aware
DR
Target
Initial Target Market

10
SUSE Enterprise Storage Pricing Model
SES Base Configuration – Priority Subscription
●
SES and limited use of SUSE Linux Enterprise Server to provide:
●
4 SES storage OSD nodes (1-2 sockets)
●
3 -5 SES MON nodes (customer selects redundancy level)
●
1 SES management node
SES Expansion Node – Priority Subscription

12
From 10,000 Meters
[1] http://www.openstack.org/blog/2013/11/openstack-user-survey-statistics-november-2013/
• Open Source Distributed Storage solution
• Most popular choice of distributed storage for
OpenStack
[1]
• Lots of goodies
‒ Distributed Object Storage
‒ Redundancy
‒ Efficient Scale-Out
‒ Extensible
‒ Can be built on commodity hardware
‒ Lower operational cost

13
From 1,000 Meters
Object Storage
(Like Amazon S3)
Block Device File System
Unified Data Handling for 3 Purposes
●RESTful Interface
●S3 and SWIFT APIs
●Block devices
●Up to 16 EiB
●Thin Provisioning
●Snapshots
●POSIX Compliant
●Separate Data and
Metadata
●For use e.g. with
Hadoop
Autonomous, Redundant Storage Cluster

14
Component Names
radosgw
Object Storage
RBD
Block Device
Ceph FS
File System
RADOS
librados
Direct
Application
Access to
RADOS

15
For a Moment, Zooming to Atom Level
FS
Disk
OSD Object Storage Daemon
File System (btrfs, xfs)
Physical Disk
● OSDs serve storage objects to clients
● Peer to perform replication and recovery

16
Put Several of These in One Node
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD

17
Mix In a Few Monitor Nodes
M • Monitors are the brain cells of the cluster
‒ Cluster Membership
‒ Consensus for Distributed Decision Making
• Do not serve stored objects to clients

18
Voilà, a Small RADOS Cluster
M M
M

19
Application Access
M M
M
Application
librados
Application
librados
radosgw
Socket
Socket
REST

20
Linux Host
RADOS Block Device
M M
M
M
krbd librados

21
RADOS Block Device - VMs
M M
M
M
qemu-rbd librados
VM
KVM

22
Several Ingredients
• Basic Idea
‒ Coarse grained partitioning of storage supports policy based
mapping (don't put all copies of my data in one rack)
‒ Topology map and Rules allow clients to “compute” the exact
location of any storage object
‒ CRUSH: deterministic decentralized placement algorithm
• Three conceptual components
‒ Objects / images
‒ Pools
‒ Placement groups

23
Pools
• A pool is a logical container for storage objects
• A pool has a set of parameters
‒ a name
‒ a numerical ID (internal to RADOS)
‒ number of replicas
‒ number of placement groups
‒ placement rule set
‒ owner
• Pools support certain operations
‒ create/remove/read/write entire objects
‒ snapshot of the entire pool

24
Placement Groups
• Placement groups help balance data across OSDs
• Consider a pool named “swimmingpool”
‒ with a pool ID of 38 and 8192 placement groups (PGs)
• Consider object “rubberduck” in “swimmingpool”
‒ hash(“rubberduck”) % 8192 = 0xb0b
‒ The resulting PG is 38.b0b
• One PG typically spans several OSDs
‒ for balancing
‒ for replication
• One OSD typically serves many PGs

25
CRUSH
• CRUSH uses a map of all OSDs in your
cluster
‒ includes physical topology, like row, rack, host
‒ includes rules describing which OSDs to
consider for what type of pool/PG
• This map is maintained by the monitor
nodes
‒ Monitor nodes use standard cluster algorithms
for consensus building, etc

27
CRUSH in Action: Writing
M M
M
M
38.b0b
swimmingpool/rubberduck
Writes go to one
OSD, which then
propagates the
changes to other
replicas

29
Self-Healing
M M
M
M
38.b0b
Monitors detect a
dead OSD

30
Self-Healing
M M
M
M
38.b0b
Monitors allocate
other OSDs to
PG and update
CRUSH map

31
Self-Healing
M M
M
M
38.b0b
Monitors initiate
recovery

32
Self-Healing
M M
M
M
38.b0b
Future writes
update the new
replica

33
Redundancy Choices
• Replication:
‒ n number of exact, full-size
copies
‒ Potentially increased read
performance due to striping
‒ More copies lower
throughput, increase latency
‒ Increased cluster network
utilization for writes
‒ Rebuilds can leverage
multiple sources
‒ Significant capacity impact
• Erasure coding:
‒ Data split into k parts plus m
redundancy codes
‒ Better space efficiency
‒ Higher CPU overhead
‒ Significant CPU and cluster
network impact, especially
during rebuild
‒ Cannot directly be used with
block devices (see next
slide)
‒ Plenty of algorithms to
choose from

34
Cache Tiering
• Multi-tier storage architecture:
‒ Pool acts as a transparent write-back overlay for another
‒ e.g., SSD 3-way replication over HDDs with erasure coding
‒ Additional configuration complexity and requires workload-
specific tuning
‒ Also available: read-only mode (no write acceleration)
‒ Some downsides (not all features supported), memory
consumption for HitSet
• A good way to combine the advantages of replication
and erasure coding

Software Defined Storage and
Performance

36
Legacy Storage Arrays
• Limits:
‒ Tightly controlled
environment
‒ Limited scalability
‒ Few options
‒ Only certain approved drives
‒ Constrained number of disk
slots
‒ Few memory variations
‒ Only very few networking
choices
‒ Typically fixed controller and
CPU
• Benefits:
‒ Reasonably easy to
understand
‒ Long-term experience and
“gut instincts”
‒ Somewhat deterministic
behavior and pricing

37
Software Defined Storage (SDS)
• Limits:
‒ ?
‒ (Essentially, all limits are
merely bugs.)
• Benefits:
‒ Infinite scalability
‒ Infinite adaptability
‒ Infinite choices
‒ Infinite flexibility

38
Properties of a SDS System
• Throughput
• Latency
• IOPS
• Single client or aggregate
• Availability
• Reliability
• Safety
• Cost
• Adaptability
• Flexibility
• Capacity
• Density

39
Architecting a SDS system
• These goals often conflict:
‒ Availability versus Density
‒ IOPS versus Density
‒ Latency versus Throughput
‒ “Performance” during rebuild
‒ Everything versus Cost
• Many hardware options
• Software topology offers many configuration choices
• There is no one size fits all

40
Designing a Ceph cluster
• Understand your workload
• Make a best guess, based on the desirable properties
and factors
• Build a 1-10% pilot / proof of concept
‒ Preferably using loaner hardware from a vendor to avoid early
commitment.
• Refine and tune this until desired performance is
achieved
• Scale up from there, re-tuning as you go

41
Measuring Ceph performance
• rados bench
‒ Measures backend performance of the RADOS store
• rados load-gen
‒ Generate configurable load on the cluster
• ceph tell osd.XX
• fio rbd backend
‒ Swiss army knife of IO benchmarking on Linux
‒ Can also compare in-kernel rbd with user-space librados
• rest-bench
‒ Measures S3/radosgw performance

43
Assumptions
• You have three nodes:
‒ node1, node2, node3
‒ These nodes will be both OSDs and MONitor nodes
‒ Each of these has /dev/vdb as a disk to be used for ceph
‒ admin
‒ A node from which you want to run calamari and host the web UI

44
Prepare the nodes
• Setup NTP
• Create a ceph user on each
‒ Setup sudo for the ceph user
• Enable sshd and allow all nodes to login to each other
‒ ssh-copy-id
• Install SUSE Linux Enterprise Server 12 and the
SUSE Enterprise Storage 1.0 products on each
• Apply all available online updates

45
Deploy ceph!
• su - ceph
• ceph-deploy new node1 node2 node3
• ceph-deploy mon create-initial
• ceph-deploy osd create node1:vdb node2:vdb
node3:vdb
• Sorry, we’re done ;-)

46
Deploy the calamari web frontend
• On the admin node:
• zypper in calamari-clients
• calamari-ctl initialize
• Open http://admin.node/
• ceph-deploy calamari –master admin –connect node1
node2 node3

50
Summary: Why Ceph?
• Can be scaled arbitrarily
‒ No central bottleneck “master” servers
• Low operational cost
‒ Automation
‒ Commodity hardware
• Can move data close to application
• Redundancy through data replication
‒ Self-healing
‒ No need for RAID
• APIs and cloud integration for self-service
‒ Software defined storage

Thank you.
51
Questions and Answers?

Corporate Headquarters
Maxfeldstrasse 5
90409 Nuremberg
Germany
+49 911 740 53 0 (Worldwide)
www.suse.com
Join us on:
www.opensuse.org
52

Unpublished Work of SUSE LLC. All Rights Reserved.
This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC.
Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of
their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,
abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE.
Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market a
product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making
purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document,
and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The
development, release, and timing of features or functionality described for SUSE products remains at the sole
discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at
any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in
this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All
third-party trademarks are the property of their respective owners.

Quick-and-Easy Deployment of a Ceph Storage Cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Quick-and-Easy Deployment of a Ceph Storage Cluster

Similar to Quick-and-Easy Deployment of a Ceph Storage Cluster (20)

Recently uploaded

Recently uploaded (20)

Quick-and-Easy Deployment of a Ceph Storage Cluster