OVERVIEW, EXPERIENCES & OUTLOOK
1
Danny Al-Gaaf
Linux-Stammtisch Munich, 24.11.2015
OUTLINE
● Architecture
○ Ceph basics
○ Ceph and OpenStack
● Experiences
○ Support
○ Performance
○ Security
○ HA
● Outlook
○ Hardware
○ Features
● Conclusions
2
ARCHITECTURE
CEPH - BASICS
WHY SOFTWARE DEFINED STORAGE?
4
Proprietary hardware → Common, off-the-shelf hardware
Low cost, standardized supply chain
Scale-Up architecture → Scale-Out architecture
Increase operational flexibility
Hardware-based intelligence → Software-based intelligence
More programmability, agility, and control
Closed development process → Open development process
More flexible, well integrated technology
OVERVIEW
5
● LGPL v2.1
● Object based storage
● Scale horizontally
● Terabytes to exabytes
● Fault tolerant: no SPF
● Hardware agnostic
● Commodity Hardware
○ Ethernet (1, 10, 40 GbE …)
○ Fibre Channel
○ SATA/SAS, HDD/SSD, …
● Self-managed
● Community:
○ 635 developers
○ 116 companies
RADOS
6
RADOS COMPONENTS
● OSDs
○ 10s - 1000s per cluster
○ One per device (HDD/SDD/RAID Group, SAN …)
○ Store objects
○ Handle replication and recovery
○ Flexible backend
■ FileStore (XFS, btrfs, …)
■ KeyValueStore (levelDB, rocksDB, libkinetic)
■ ErasureCoding (ISA, SHEC, Jerasure, LRC)
○ Cluster and public networks
● MONs
○ Maintain cluster membership and states
○ Use PAXOS protocol to establish quorum consensus
○ Small, lightweight
○ Odd number
7
DATA DISTRIBUTION - CRUSH
8
CRUSH - QUICK CALCULATION
9
CRUSH - PLACEMENT
10
● Each PG maps independently to a pseudo-
random set of OSDs
● PGs that map to the same OSD have always
replicas that do not
● In case of an OSD failure:
○ CRUSH provides an other OSD target
○ Each PG on this OSD will be re-
replicated by a different OSD
○ Highly parallel recovery and
rebalancing
CRUSH - PLACEMENT
11
Controlled Replication Under Scalable Hashing
● Pseudo-random data placement function
○ Fast calculation - O(log n)
○ No lists or tables, no lookups
○ Repeatable and deterministic
● Uniform, weighted distribution
● Stable mapping
○ Predictable, bounded migration of data on changes (e.g.
adding/removing nodes)
● Rule-based configuration
○ Aware of physical infrastructure topology
■ OSDs in hosts/racks/rows/fire compartments/data center …
○ Adjustable replication
○ Adjustable weighting
● Calculated by clients
LOGICAL SEPARATION - POOLS
12
CLIENT DATA HANDLING
● Client
○ Calculate placement with
CRUSH and writes only to
primary OSD
● OSDs
○ Handle data replication,
rebalancing, and recovery
based on CRUSH
13
REPLICATION VS ERASURE CODING
Full copies of stored objects
● Very high durability
● 3x (200% overhead)
● Quicker recovery
14
One copy plus parity
● Cost-effective durability
● 1.5x (50% overhead)
● Expensive recovery
● RBD/CephFS would require cache tiering
RADOS BLOCK DEVICE
15
● Provide block interface to Ceph
● Stripes images across entire pool/cluster
● Snapshots (read-only)
● Copy-on-write clones
● Integration
○ Qemu/KVM
○ XEN
○ LXC
○ Linux kernel (rbd.ko)
○ iSCSI (through gateway) e.g. for VMWare or Windows
○ OpenStack, CloudStack, Nebula, …
● Incremental backup
● Client side caching
RBD FEATURES
16
RBD - VIRTUAL DISKS
17
RBD - KERNEL MODULE
18
THE RADOS GATEWAY
19
THE RADOS GATEWAY
20
THE RADOS GATEWAY
● RESTful object storage proxy
○ Store objects via RADOS
○ Stripes large RESTful objects across many RADOS objects
○ API supports buckets, accounts
○ Usage accounting
● Interfaces compatible to:
○ Amazon S3
○ Swift
● Federated RGW
○ Zones, regions topology
○ Global bucket and user namespace
○ Asynchronous replicated across data center
○ Read affinity
■ Serve local data from local/closest DC
21
CEPHFS
22
CEPHFS
23
META DATA SERVER
● MDS
○ Manages only metadata for POSIX - compliant shared
file system
■ Directory hierarchy
■ File metadata (owner, timestamps, mode, …)
○ Clients stripe file data in RADOS
■ MDS is not in the data path
○ MDS stores metadata in RADOS
○ Dynamic cluster scales to 10s or 100s
● CephFS
○ Linux kernel module or fuse driver
○ Not production ready !!!
24
ARCHITECTURE
OPENSTACK
ARCHITECTURE: CEPH AND OPENSTACK
26
EXPERIENCES
SUPPORT
SUPPORT
● Who claims to provide support?
○ Red Hat (Red Hat Storage)
○ SUSE (SUSE Enterprise Storage)
○ Canonical
○ Mirantis
● What is required for support?
○ Automation
○ Support organisation and processes
○ Knowledge and experience of the technology
■ Ceph, filesystems, kernel, KVM, ...
○ Developers
○ Active upstream contributions
● Consider which distribution you already know and use!
28
SUPPORT
Ceph Kernel
29
# company commits authors
1 Red Hat / Inktank 57147 71 / 42 (overlap)
2 Dreamhost 15774 26
3 Intel 1613 11
4 SUSE 1591 18
5 Deutsche Telekom 1480 2
13 Mirantis 283 9
37 Canonical 25 4
Source: metrics.ceph.com
(20151121)
# company changesets authors
1 Intel 37602 733
2 Red Hat 31490 333
3 SUSE 18018 132
80 Canonical 187 22
173 Mirantis 15 1
Source: linux.git + gitdm (20151121)
EXPERIENCES
PERFORMANCE
PERFORMANCE
● Performance depends on
○ Network
○ OSD drives
○ Journal drives
○ CPU, memory
○ Ceph backend
● What performance do you need?
● What is your use-case and load profile?
31
or:
DO YOU LIKE BUG REPORTS?
32
The customer complained about bad performance on new production system:
# Current live env:
root@webserver01:~# dd if=/dev/zero of=test.bin bs=1k count=1M
1073741824 bytes (1.1 GB) copied, 11.4649 s, 93.7 MB/s
# Staging env:
root@webserver01:~# dd if=/dev/zero of=test.bin bs=1k count=1M
1073741824 bytes (1.1 GB) copied, 22.4607 s, 47.8 MB/s
... Ceph rbd disk shows bad 4k performance with synthetic MongoDB perf tests ...
HOW TO IDENTIFY BOTTLENECKS ?
33
GET THE RIGHT TOOL
● Ceph performance tool not fully reliable for deep analysis
● Needed a tool to analyse all layers rbd.ko, librbd, block devices and files
○ Don’t reinvent the wheel !
○ There is a “swiss army knife” - fio
○ Contributed RBD support for fio upstream
○ Other helpers: blktrace, systemtap
34
“Get Jens' FIO code. It does things right, including writing
actual pseudo-random contents, which shows if the disk
does some de-duplication (aka optimize for
benchmarks): [...] Anything else is suspect - forget about
bonnie or other traditional tools.”
Linus Torvalds
PERFORMANCE ANALYSIS - SYMPTOMS
35
ANALYSE THE DISK
36
MITIGATE
● Make use of RAID controller
○ RAID 0 out of multiple disks
○ Use RAID write-back cache
● Move journals always to separate
drives
● Consider scale-out your cluster
● Carefully design your hardware
○ commodity != consumer grade
● Users must understand your
platform
○ e.g. don’t run replicated databases
like MongoDB on replicated storage
37
STORAGE HARDWARE RECOMMENDATIONS
● Hardware recommendations in Ceph documentation good starting point
● Network
○ Ethernet (1/10/40 GbE)
○ Multiple NICs and/or ports
○ bounded
○ Jumbo frames
○ Use the fastest network you can afford
● CPU
○ x86_64 or 64bit ARM
○ Consider at least one dedicated physical core per OSD
● Memory
○ Recommendation is: 1GByte per TByte storage
○ More memory doesn’t hurt!
38
STORAGE HARDWARE RECOMMENDATIONS
● Storage
○ One disk per OSD
○ Journals on separate disk, SSD highly recommended
○ Use enterprise devices !
○ If affordable prefer SAS over SATA
■ e.g. Unrecoverable Silent Data Corruption on channel: 10E21 vs 10E17
○ Highly depends on performance requirements and cost
○ Consider density regarding size of failure zones (node, rack)
○ Consider different storage types for different use cases
■ Standard, performance, archive/backup/cold data
○ SSDs: keep DWPD in mind!
● Storage controller
○ Bandwidth, performance, cache size
○ HBA prefered over RAID
○ RAID with RAID0: battery or flash protected write-back cache has huge impact on writes
39
EXPERIENCES
SECURITY
● Generic vectors
○ Network
○ Authentication
○ Code flaws
● Many potential component specific vectors
○ RBD
■ librbd vs. kernel module
○ RadosGW
■ as every web service
○ CephFS
● Deep dive: http://www.slideshare.net/dalgaaf/open-stack-dost-frankfurt-ceph-security
ATTACK VECTORS
41
COUNTERMEASURES - TIPS
● Network
○ Always use separated cluster and public net
○ Always separate your control nodes from other networks
○ Don’t expose to the open internet
○ Encrypt inter-datacenter traffic
● Avoid hyper-converged infrastructure
○ Isolate compute and storage resources
○ Scale them independently
○ Risk mitigation if daemons are compromised or DoS’d
○ Don’t mix
■ compute and storage
■ control nodes (OpenStack and Ceph)
42
COUNTERMEASURES - RADOSGW
● Big and easy target through
HTTP(S) protocol
● Small appliance per tenant with
○ Separate network
○ SSL terminated proxy forwarding
requests to radosgw
○ WAF (mod_security) to filter
○ Placed in secure/managed zone
● Don’t share buckets/users
between tenants
43
COUNTERMEASURES - CEPHX
● Monitors are trusted key servers
○ Store copies of all entity keys
○ Each key has an associated “capability”
■ Plaintext description of what the key user is allowed to do
● What you get
○ Mutual authentication of client + server
○ Extensible authorization w/ “capabilities”
○ Protection from man-in-the-middle, TCP session hijacking
● What you don’t get
○ Secrecy (encryption over the wire)
● What you can do
○ Restrict capabilities associated with each key
○ Limit administrators’ power
■ use ‘allow profile admin’ and ‘allow profile readonly’
■ restrict role-definer or ‘allow *’ keys
○ Careful key distribution (Ceph/OpenStack nodes)
44
SECURITY - STATUS
● Reactive processes are in place
○ security@ceph.com , CVEs, downstream product updates, etc.
● Proactive measures in progress
○ Code quality improves (SCA, etc.)
○ Unprivileged daemons
○ MAC (SELinux, AppArmor)
○ Encryption
● Progress defining security best-practices
○ Document best practices for security
● Ongoing process
45
EXPERIENCES
HIGH AVAILABILITY
HA - FAILURE SCENARIOS
● Power and network outage
● Failure of a server or component
● Failure of a software service
● Disaster
● Human errors
○ Still often leading cause of outage
○ Misconfiguration
○ Accidents
○ Emergency power-off
● Deep dive: http://www.slideshare.net/dalgaaf/99999-available-openstack-cloud-a-builders-guide
47
CEPH AND OPENSTACK - WORST CASES
48
● Fire compartment fails
CEPH AND OPENSTACK - WORST CASES
49
● Split brain scenarios
CEPH AND OPENSTACK - MITIGATION
50
● Most DCs have
backup rooms
● Only a few servers
to host quorum
related services
● Less cost intensive
than three fire
compartments
● Can mitigate split
brain between FCs
(depending on
network layout)
MITIGATION - APPLICATIONS
51
● Target for five 9’s is E2E
○ Five 9’s on data center level very expensive
○ No pets allowed !!!
○ Only cloud-ready applications
OUTLOOK
HARDWARE
HARDWARE
● Open Kinetic
○ Classic 3.5” form factor
○ SAS HDD connector
○ Dual 1Gb/s Ethernet
○ Connect drives directly to data center fabric
○ Requires enclosures (e.g. Supermicro)
○ Allows intelligent drives
● Seagate Kinetic KV drives
○ Provide a RESTful KV interface
○ Improve performance of SMR (Shingled Magnetic Recording) drives
○ Eliminates filesystem and ease up logic in Ceph
○ Requires still compute power for Ceph
53
HARDWARE
● HGST
○ Open Ethernet drive architecture
○ Drive running Linux
○ CPU (32bit ARM) and RAM
○ Each drive appears as a Linux server
○ Can be provisioned with SDS applications
● Toshiba
○ Open Kinetic form factor with Open Ethernet drive like arch
○ 64bit MIPS processor
○ Contains e.g. 2x2.5” drives (HDD or SSD plus NVRAM)
○ Drive running Linux
○ Can be provisioned with Ceph
○ Eliminates need for compute power to run OSDs
○ Consumes 14-16 W per drive
54
OUTLOOK
FEATURES
SOME ROADMAP ITEMS
● RBD
○ Mirroring (async)
○ Native VMWare driver
● RGW
○ Active/Active multi-site
● CephFS
○ Get it production ready (fsck, repair, …)
● QoS for clients
● Performance
○ NewStore (bypass journal and file system where possible)
○ Cache Tiering improvements
● Security
○ SELinux/AppAmor MAC profiles
○ Non-root daemons
● Encryption in flight ?
56
CONCLUSIONS
Conclusions
● Ceph
○ RBD/RGW is ready for enterprise production environments
○ CephFS is on the way
● Security
○ Reactive processes are in place
○ Proactive measures in progress
● High Availability
○ Requires careful planning, especially within a cloud setup
○ 3- or quorum room required
○ NO PETS, only cloud ready applications !!!
● Open Kinetic based drives can simplify your DC
58
Get involved !
● Ceph
○ https://ceph.com/community/contribute/
○ ceph-devel@vger.kernel.org
○ IRC: OFTC
■ #ceph,
■ #ceph-devel
○ Ceph Developer Summit
● OpenStack
○ Cinder, Glance, Manila, ...
59
danny.al-gaaf@bisect.de
dalgaaf
blog.bisect.de
@dannnyalgaaf
linkedin.com/in/dalgaaf
xing.com/profile/Danny_AlGaaf
Danny Al-Gaaf
Senior Cloud Technologist
Q&A - THANK YOU!

Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook

  • 1.
    OVERVIEW, EXPERIENCES &OUTLOOK 1 Danny Al-Gaaf Linux-Stammtisch Munich, 24.11.2015
  • 2.
    OUTLINE ● Architecture ○ Cephbasics ○ Ceph and OpenStack ● Experiences ○ Support ○ Performance ○ Security ○ HA ● Outlook ○ Hardware ○ Features ● Conclusions 2
  • 3.
  • 4.
    WHY SOFTWARE DEFINEDSTORAGE? 4 Proprietary hardware → Common, off-the-shelf hardware Low cost, standardized supply chain Scale-Up architecture → Scale-Out architecture Increase operational flexibility Hardware-based intelligence → Software-based intelligence More programmability, agility, and control Closed development process → Open development process More flexible, well integrated technology
  • 5.
    OVERVIEW 5 ● LGPL v2.1 ●Object based storage ● Scale horizontally ● Terabytes to exabytes ● Fault tolerant: no SPF ● Hardware agnostic ● Commodity Hardware ○ Ethernet (1, 10, 40 GbE …) ○ Fibre Channel ○ SATA/SAS, HDD/SSD, … ● Self-managed ● Community: ○ 635 developers ○ 116 companies
  • 6.
  • 7.
    RADOS COMPONENTS ● OSDs ○10s - 1000s per cluster ○ One per device (HDD/SDD/RAID Group, SAN …) ○ Store objects ○ Handle replication and recovery ○ Flexible backend ■ FileStore (XFS, btrfs, …) ■ KeyValueStore (levelDB, rocksDB, libkinetic) ■ ErasureCoding (ISA, SHEC, Jerasure, LRC) ○ Cluster and public networks ● MONs ○ Maintain cluster membership and states ○ Use PAXOS protocol to establish quorum consensus ○ Small, lightweight ○ Odd number 7
  • 8.
  • 9.
    CRUSH - QUICKCALCULATION 9
  • 10.
    CRUSH - PLACEMENT 10 ●Each PG maps independently to a pseudo- random set of OSDs ● PGs that map to the same OSD have always replicas that do not ● In case of an OSD failure: ○ CRUSH provides an other OSD target ○ Each PG on this OSD will be re- replicated by a different OSD ○ Highly parallel recovery and rebalancing
  • 11.
    CRUSH - PLACEMENT 11 ControlledReplication Under Scalable Hashing ● Pseudo-random data placement function ○ Fast calculation - O(log n) ○ No lists or tables, no lookups ○ Repeatable and deterministic ● Uniform, weighted distribution ● Stable mapping ○ Predictable, bounded migration of data on changes (e.g. adding/removing nodes) ● Rule-based configuration ○ Aware of physical infrastructure topology ■ OSDs in hosts/racks/rows/fire compartments/data center … ○ Adjustable replication ○ Adjustable weighting ● Calculated by clients
  • 12.
  • 13.
    CLIENT DATA HANDLING ●Client ○ Calculate placement with CRUSH and writes only to primary OSD ● OSDs ○ Handle data replication, rebalancing, and recovery based on CRUSH 13
  • 14.
    REPLICATION VS ERASURECODING Full copies of stored objects ● Very high durability ● 3x (200% overhead) ● Quicker recovery 14 One copy plus parity ● Cost-effective durability ● 1.5x (50% overhead) ● Expensive recovery ● RBD/CephFS would require cache tiering
  • 15.
  • 16.
    ● Provide blockinterface to Ceph ● Stripes images across entire pool/cluster ● Snapshots (read-only) ● Copy-on-write clones ● Integration ○ Qemu/KVM ○ XEN ○ LXC ○ Linux kernel (rbd.ko) ○ iSCSI (through gateway) e.g. for VMWare or Windows ○ OpenStack, CloudStack, Nebula, … ● Incremental backup ● Client side caching RBD FEATURES 16
  • 17.
    RBD - VIRTUALDISKS 17
  • 18.
    RBD - KERNELMODULE 18
  • 19.
  • 20.
  • 21.
    THE RADOS GATEWAY ●RESTful object storage proxy ○ Store objects via RADOS ○ Stripes large RESTful objects across many RADOS objects ○ API supports buckets, accounts ○ Usage accounting ● Interfaces compatible to: ○ Amazon S3 ○ Swift ● Federated RGW ○ Zones, regions topology ○ Global bucket and user namespace ○ Asynchronous replicated across data center ○ Read affinity ■ Serve local data from local/closest DC 21
  • 22.
  • 23.
  • 24.
    META DATA SERVER ●MDS ○ Manages only metadata for POSIX - compliant shared file system ■ Directory hierarchy ■ File metadata (owner, timestamps, mode, …) ○ Clients stripe file data in RADOS ■ MDS is not in the data path ○ MDS stores metadata in RADOS ○ Dynamic cluster scales to 10s or 100s ● CephFS ○ Linux kernel module or fuse driver ○ Not production ready !!! 24
  • 25.
  • 26.
  • 27.
  • 28.
    SUPPORT ● Who claimsto provide support? ○ Red Hat (Red Hat Storage) ○ SUSE (SUSE Enterprise Storage) ○ Canonical ○ Mirantis ● What is required for support? ○ Automation ○ Support organisation and processes ○ Knowledge and experience of the technology ■ Ceph, filesystems, kernel, KVM, ... ○ Developers ○ Active upstream contributions ● Consider which distribution you already know and use! 28
  • 29.
    SUPPORT Ceph Kernel 29 # companycommits authors 1 Red Hat / Inktank 57147 71 / 42 (overlap) 2 Dreamhost 15774 26 3 Intel 1613 11 4 SUSE 1591 18 5 Deutsche Telekom 1480 2 13 Mirantis 283 9 37 Canonical 25 4 Source: metrics.ceph.com (20151121) # company changesets authors 1 Intel 37602 733 2 Red Hat 31490 333 3 SUSE 18018 132 80 Canonical 187 22 173 Mirantis 15 1 Source: linux.git + gitdm (20151121)
  • 30.
  • 31.
    PERFORMANCE ● Performance dependson ○ Network ○ OSD drives ○ Journal drives ○ CPU, memory ○ Ceph backend ● What performance do you need? ● What is your use-case and load profile? 31
  • 32.
    or: DO YOU LIKEBUG REPORTS? 32 The customer complained about bad performance on new production system: # Current live env: root@webserver01:~# dd if=/dev/zero of=test.bin bs=1k count=1M 1073741824 bytes (1.1 GB) copied, 11.4649 s, 93.7 MB/s # Staging env: root@webserver01:~# dd if=/dev/zero of=test.bin bs=1k count=1M 1073741824 bytes (1.1 GB) copied, 22.4607 s, 47.8 MB/s ... Ceph rbd disk shows bad 4k performance with synthetic MongoDB perf tests ...
  • 33.
    HOW TO IDENTIFYBOTTLENECKS ? 33
  • 34.
    GET THE RIGHTTOOL ● Ceph performance tool not fully reliable for deep analysis ● Needed a tool to analyse all layers rbd.ko, librbd, block devices and files ○ Don’t reinvent the wheel ! ○ There is a “swiss army knife” - fio ○ Contributed RBD support for fio upstream ○ Other helpers: blktrace, systemtap 34 “Get Jens' FIO code. It does things right, including writing actual pseudo-random contents, which shows if the disk does some de-duplication (aka optimize for benchmarks): [...] Anything else is suspect - forget about bonnie or other traditional tools.” Linus Torvalds
  • 35.
  • 36.
  • 37.
    MITIGATE ● Make useof RAID controller ○ RAID 0 out of multiple disks ○ Use RAID write-back cache ● Move journals always to separate drives ● Consider scale-out your cluster ● Carefully design your hardware ○ commodity != consumer grade ● Users must understand your platform ○ e.g. don’t run replicated databases like MongoDB on replicated storage 37
  • 38.
    STORAGE HARDWARE RECOMMENDATIONS ●Hardware recommendations in Ceph documentation good starting point ● Network ○ Ethernet (1/10/40 GbE) ○ Multiple NICs and/or ports ○ bounded ○ Jumbo frames ○ Use the fastest network you can afford ● CPU ○ x86_64 or 64bit ARM ○ Consider at least one dedicated physical core per OSD ● Memory ○ Recommendation is: 1GByte per TByte storage ○ More memory doesn’t hurt! 38
  • 39.
    STORAGE HARDWARE RECOMMENDATIONS ●Storage ○ One disk per OSD ○ Journals on separate disk, SSD highly recommended ○ Use enterprise devices ! ○ If affordable prefer SAS over SATA ■ e.g. Unrecoverable Silent Data Corruption on channel: 10E21 vs 10E17 ○ Highly depends on performance requirements and cost ○ Consider density regarding size of failure zones (node, rack) ○ Consider different storage types for different use cases ■ Standard, performance, archive/backup/cold data ○ SSDs: keep DWPD in mind! ● Storage controller ○ Bandwidth, performance, cache size ○ HBA prefered over RAID ○ RAID with RAID0: battery or flash protected write-back cache has huge impact on writes 39
  • 40.
  • 41.
    ● Generic vectors ○Network ○ Authentication ○ Code flaws ● Many potential component specific vectors ○ RBD ■ librbd vs. kernel module ○ RadosGW ■ as every web service ○ CephFS ● Deep dive: http://www.slideshare.net/dalgaaf/open-stack-dost-frankfurt-ceph-security ATTACK VECTORS 41
  • 42.
    COUNTERMEASURES - TIPS ●Network ○ Always use separated cluster and public net ○ Always separate your control nodes from other networks ○ Don’t expose to the open internet ○ Encrypt inter-datacenter traffic ● Avoid hyper-converged infrastructure ○ Isolate compute and storage resources ○ Scale them independently ○ Risk mitigation if daemons are compromised or DoS’d ○ Don’t mix ■ compute and storage ■ control nodes (OpenStack and Ceph) 42
  • 43.
    COUNTERMEASURES - RADOSGW ●Big and easy target through HTTP(S) protocol ● Small appliance per tenant with ○ Separate network ○ SSL terminated proxy forwarding requests to radosgw ○ WAF (mod_security) to filter ○ Placed in secure/managed zone ● Don’t share buckets/users between tenants 43
  • 44.
    COUNTERMEASURES - CEPHX ●Monitors are trusted key servers ○ Store copies of all entity keys ○ Each key has an associated “capability” ■ Plaintext description of what the key user is allowed to do ● What you get ○ Mutual authentication of client + server ○ Extensible authorization w/ “capabilities” ○ Protection from man-in-the-middle, TCP session hijacking ● What you don’t get ○ Secrecy (encryption over the wire) ● What you can do ○ Restrict capabilities associated with each key ○ Limit administrators’ power ■ use ‘allow profile admin’ and ‘allow profile readonly’ ■ restrict role-definer or ‘allow *’ keys ○ Careful key distribution (Ceph/OpenStack nodes) 44
  • 45.
    SECURITY - STATUS ●Reactive processes are in place ○ security@ceph.com , CVEs, downstream product updates, etc. ● Proactive measures in progress ○ Code quality improves (SCA, etc.) ○ Unprivileged daemons ○ MAC (SELinux, AppArmor) ○ Encryption ● Progress defining security best-practices ○ Document best practices for security ● Ongoing process 45
  • 46.
  • 47.
    HA - FAILURESCENARIOS ● Power and network outage ● Failure of a server or component ● Failure of a software service ● Disaster ● Human errors ○ Still often leading cause of outage ○ Misconfiguration ○ Accidents ○ Emergency power-off ● Deep dive: http://www.slideshare.net/dalgaaf/99999-available-openstack-cloud-a-builders-guide 47
  • 48.
    CEPH AND OPENSTACK- WORST CASES 48 ● Fire compartment fails
  • 49.
    CEPH AND OPENSTACK- WORST CASES 49 ● Split brain scenarios
  • 50.
    CEPH AND OPENSTACK- MITIGATION 50 ● Most DCs have backup rooms ● Only a few servers to host quorum related services ● Less cost intensive than three fire compartments ● Can mitigate split brain between FCs (depending on network layout)
  • 51.
    MITIGATION - APPLICATIONS 51 ●Target for five 9’s is E2E ○ Five 9’s on data center level very expensive ○ No pets allowed !!! ○ Only cloud-ready applications
  • 52.
  • 53.
    HARDWARE ● Open Kinetic ○Classic 3.5” form factor ○ SAS HDD connector ○ Dual 1Gb/s Ethernet ○ Connect drives directly to data center fabric ○ Requires enclosures (e.g. Supermicro) ○ Allows intelligent drives ● Seagate Kinetic KV drives ○ Provide a RESTful KV interface ○ Improve performance of SMR (Shingled Magnetic Recording) drives ○ Eliminates filesystem and ease up logic in Ceph ○ Requires still compute power for Ceph 53
  • 54.
    HARDWARE ● HGST ○ OpenEthernet drive architecture ○ Drive running Linux ○ CPU (32bit ARM) and RAM ○ Each drive appears as a Linux server ○ Can be provisioned with SDS applications ● Toshiba ○ Open Kinetic form factor with Open Ethernet drive like arch ○ 64bit MIPS processor ○ Contains e.g. 2x2.5” drives (HDD or SSD plus NVRAM) ○ Drive running Linux ○ Can be provisioned with Ceph ○ Eliminates need for compute power to run OSDs ○ Consumes 14-16 W per drive 54
  • 55.
  • 56.
    SOME ROADMAP ITEMS ●RBD ○ Mirroring (async) ○ Native VMWare driver ● RGW ○ Active/Active multi-site ● CephFS ○ Get it production ready (fsck, repair, …) ● QoS for clients ● Performance ○ NewStore (bypass journal and file system where possible) ○ Cache Tiering improvements ● Security ○ SELinux/AppAmor MAC profiles ○ Non-root daemons ● Encryption in flight ? 56
  • 57.
  • 58.
    Conclusions ● Ceph ○ RBD/RGWis ready for enterprise production environments ○ CephFS is on the way ● Security ○ Reactive processes are in place ○ Proactive measures in progress ● High Availability ○ Requires careful planning, especially within a cloud setup ○ 3- or quorum room required ○ NO PETS, only cloud ready applications !!! ● Open Kinetic based drives can simplify your DC 58
  • 59.
    Get involved ! ●Ceph ○ https://ceph.com/community/contribute/ ○ ceph-devel@vger.kernel.org ○ IRC: OFTC ■ #ceph, ■ #ceph-devel ○ Ceph Developer Summit ● OpenStack ○ Cinder, Glance, Manila, ... 59
  • 60.