CEPH AT WORK
IN BLOOMBERG
Object Store, RBD and OpenStack
August 18, 2015
By: Chris Jones
Copyright 2015 Bloomberg L.P.
BLOOMBERG
2
30 Years in under 30 Seconds
● Subscriber base financial information provider (Bloomberg Terminal)
● Online, TV, Print, Real-time streaming information
● Offices and customers in every major financial market and institution
worldwide
BLOOMBERG
3
Primary product - Information
● Bloomberg Terminal
− Approximately over 60,000 features/functions. For
example, ability to track oil tankers in real-time via
satellite feeds
● Most important internal feature to me is the “Soup
List.” 
− Note: Exact numbers are not specified. Contact
media relations for specifics and other important
information.
BLOOMBERG TERMINAL
4
Terminal example.
- You can search for
anything. Even bios
on the wealthiest
people in the world
(Billionaire’s list).
CLOUD INFRASTRUCTURE
*PROBLEMS TO SOLVE*
5
CLOUD INFRASTRUCTURE GROUP
6
Primary customers
– Developers
– Product Groups
● Many different
development
groups throughout
our organization
● Many thousands of
developers
throughout our
organization
● Everyone of them
wants and needs
resources
CLOUD INFRASTRUCTURE GROUP
7
Resource Problems
● Developers
− Development
− Testing
− Automation (Cattle vs. Pets)
● Organizations
− POC
− Products in production
− Automation
● Security/Networking
− Compliance
HCP “THEOREM”
8
All distributed storage and cloud
computing systems fall under what I call
the HCP “Theorem”. Unlike the CAP
Theorem where you can have
Consistence or Availability but not both.
With HCP you are guaranteed to have two
if not all in a distributed scalable system.
The question is, how do you lessen or
remove the parts in this endless cycle? Painful
ISSUES
Complex
● Security/Compliance
● Automation
● Logging, Monitoring, Auditability
● Alter thinking – educate
(painfully slow)
● Failure (hardware and
ideas)
● Increase Tolerance
● Scaling
● Compute
● Distributed Storage
HOW DID WE SOLVE IT OR DID WE?
10
Hard Painful
Complex
We focused on the “sweet spot”
● Hard
− Open Source products with strong community support
− We looked for compute, networking and storage that scaled
− Engaged Security and Networking teams
● Complex
− Automation – Chef, Ansible. Everything must be able to be rebuilt from
source control (Git). No manual steps
− Engaged Security and Networking teams
● Painful
− Created converged architecture (compute/storage). In theory it looked like
it would fit in the sweet spot but in reality it created more pain
− Still working to get our developers to treat their resources as Cattle vs.
Pets – NO Pets Policy!
− Talent
− Engaged Security and Networking teams
● Sweet spot
− Ceph – Object Store and RBD Block/Volume
− OpenStack (not all projects)
CEPH AND OPENSTACK
11
USE IN BLOOMBERG
12
● Ceph – RGW (Object Store)
● Ceph – RBD (Block/Volumes)
● OpenStack
─ Compute, Keystone, Cinder, Glance…
─ Ephemeral storage (new)
● Object Store is
becoming one of
the most popular
items
● OpenStack
Compute with
Ceph backed block
store volumes are
very popular
● We are introducing
ephemeral
compute storage
STANDARD STACK
13
OpenStack Converged Cluster
INTUITIVE OR COUNTER INTUITIVE
14
Hard Painful
Complex
Completely Converged Architecture
● OpenStack and Ceph
− Reduced footprint
− Scalability
− Attempt to reduce “Hard” and “Complex” and
eliminate “Painful”
● Controller (Head) Nodes
− Ceph Mon, Ceph OSD, RGW
− Nova, Cinder, MySQL, RabbitMQ, etc
● Side Affects (Sometimes you fail)
− Had to increase pain tolerance
− Initial automation did get easier (reduced
“Hard”) but “Complex” increased along with
“Pain”
− Made it more painful to balance loads
CONVERGED STACK
15
Converged Architecture Rack Layout
● 3 Head Nodes (Controller Nodes)
− Ceph Monitor
− Ceph OSD
− OpenStack Controllers (All of them!)
− HAProxy
● 1 Bootstrap Node
− Cobbler (PXE Boot)
− Repos
− Chef
− Rally/Tempest
● Remaining Nodes
− Nova Compute
− Ceph OSDs
− RGW – Apache
● Ubuntu
● Shared spine with Hadoop resources
Bootstrap Node
Compute/Ceph OSDs/RGW/Apache
Remaining Stack
OSD BANDWIDTH – ATTEMPT TO BETTER IT
16
Renice OSD daemons – Above: Higher is better
OSD LATENCY – ATTEMPT TO BETTER IT
17
Renice OSD daemons – Above: Lower is better
NOTE: Chart mislabeled – left should be (ms) but is seconds
LESSON LEARNED? - BETTER SOLUTION?
18
Hard Painful
Complex
Semi-Converged Architecture - POD
● OpenStack and Ceph
− “Complex” increases but “Hard” and “Painful”
decrease. “Painful” could be gone but we are talking
about OpenStack too 
● Controller (Head) Nodes
− Nova, Cinder, MySQL, RabbitMQ, etc. split and
balanced better
− More purpose built but easily provisioned as needed
● Ceph Nodes
− Split Object Store out of OpenStack Cluster so it can
scale easier
− Dedicated Ceph Mons
− Dedicated Ceph OSDs
− Dedicated RGW – Replaced Apache with Civetweb
− Much better performance and maintenance
POD ARCHITECTURE (OPENSTACK/CEPH)
19
POD
(TOR)
HAProxy
OS-Nova
OS-NovaOS-Rabbit
OS-DB
Number of large providers have taken similar approaches
Note: Illustrative only – Not Representative
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
RBD Only
Bootstrap
Monitoring
Ephemeral
Ephemeral – Fast/Dangerous
Host aggregates & flavors
Not Ceph backed
POD ARCHITECTURE (OPENSTACK/CEPH)
20
POD
(TOR)
Ceph
Block
OS-Nova
OS-NovaOS-Rabbit
OS-NovaOS-DB
Number of large providers have taken similar approaches
Note: Illustrative only – Not Representative
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
• Scale and re-
provision as needed
• 3 PODs per rack
EPHEMERAL VS. CEPH BLOCK STORAGE
21
Numbers will vary in different environments. Illustrations are simplified.
Ceph Ephemeral
EPHEMERAL VS. CEPH BLOCK STORAGE
22
Numbers will vary in different environments. Illustrations are simplified.
Ceph – Advantages
● All data is replicated at least 3 ways across the cluster
● Ceph RBD volumes can be created, attached and detached from any hypervisor
● Very fast provisioning using COW (copy-on-write) images
● Allows easy instance re-launch in the event of hypervisor failure
● High read performance
Ephemeral – Advantages
● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency
● Can provide fairly large volumes for cheap
Ceph – Disadvantages
● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability)
● Higher latency due to Ceph being network based instead of local
Ephemeral – Disadvantages
● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost
● Maybe difficult to add more capacity (depends on type of RAID)
● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph
● Less important, with RAID your drives need to be same size or you lose capacity
EPHEMERAL VS. CEPH BLOCK STORAGE
23
Numbers will vary in different environments. Illustrations are simplified.
EPHEMERAL CEPH
Block write bandwidth (MB/s) 1,094.02 642.15
Block read bandwidth (MB/s) 1,826.43 639.47
Character read bandwidth (MB/s) 4.93 4.31
Character write bandwidth (MB/s) 0.83 0.75
Block write latency (ms) 9.502 37.096
Block read latency (ms) 8.121 4.941
Character read latency (ms) 2.395 3.322
Character write latency (ms) 11.052 13.587
Note: Ephemeral in JBOD/LVM mode is not as fast as Ceph
OBJECT STORE STACK (SINGLE RACK)
24
Small single purpose (lab or whatever) cluster/rack – RedHat 7.1
● Rack = Cluster
● Smaller Cluster – Storage node number could be “short stack”
● 1 TOR and 1 Rack Mgt Node
● 3 Ceph Mon Nodes (No OSDs)
● Up to 14 Ceph OSD nodes (depends on size)
● 2x or 3x Replication depending on need (3x default)
● 1 RGW (coexist with Mon or OSD Node)
● 10g Cluster interface
● 10g Public interface
● 1g Management interface
● OSD Nodes (lower density nodes)
− Option 1: 6TB HDD x 12 – Journal partition on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio
− Option 3: 6TB HDD x 12 – 1 NVMe SSD for Journals with 12:1 ratio
− Choose based on tolerance level and failure domain for specific use case
− ~1PB of raw space - ~330TB of usable (depends on drives)
3 Mon Nodes
TOR/IPMI
Storage Nodes
OBJECT STORE STACK (3 RACK CLUSTER)
25
1 Mon/RGW Node
Per rack
TOR - Leaf
Storage Nodes
Spine Spine LBLB
OBJECT STORE STACK (3 RACK CLUSTER)
26
Standard cluster is 3 or more racks
● Min of 3 Racks = Cluster
● 1 TOR and 1 Rack Mgt Node
● 1 Ceph Mon node per rack (No OSDs)
● Up to 15 Ceph OSD nodes (depends on
size) per rack
● 1 RGW (dedicated Node)
● OSD Nodes (lower density nodes)
− Option 1: 6TB HDD x 12 – Journal partition
on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals
with 5:1 ratio
− Option 3: 6TB HDD x 12 – 1 NVMe SSD for
Journals with 12:1 ratio
− Choose based on tolerance level and failure
domain for specific use case
1 Mon/RGW Node
TOR/IPMI
Storage Nodes
OBJECT STORE STACK
27
Standard configuration
● Min of 3 Racks = Cluster
● Cluster Network: Bonded 10g or higher depending on size of cluster
● Public Network: Bonded 10g for RGW interfaces
● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep
odd number of Mons so some racks may not have Mons. We try to keep
larger cluster racks & Mons in different power zones
● We have developed a healthy “Pain” tolerance. We can survive an entire
rack going down but we mainly see drive failures and more node failures.
● Min 1 RGW (dedicated Node) per rack (may want more)
● Hardware load balancers to RGWs with redundancy
● OSD Nodes (lower density nodes) – we have both. Actually looking at new
hardware and drive options
− Option 1: 6TB HDD x 12 – Journal partition on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio
AUTOMATION
28
All of what we do only happens because of automation
● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for
orchestration and maintenance
● Bloomberg Github: https://github.com/bloomberg/chef-bcpc
● Ceph specific options
− Ceph Deploy: https://github.com/ceph/ceph-deploy
− Ceph Ansible: https://github.com/ceph/ceph-ansible
− Ceph Chef: https://github.com/ceph/ceph-cookbook
● Our bootstrap server is our Chef server per cluster
TESTING
29
Testing is critical. We use different strategies for the different parts of
OpenStack and Ceph we test
● OpenStack
− Tempest – We currently only use this for patches we make. We plan to use this more in our
DevOps pipeline
− Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself
● Ceph
− RADOS Bench
− COS Bench – Going to try this with CBT
− CBT – Ceph Benchmark Testing
− CeTune
− Bonnie++
− FIO
● Ceph – RGW
− Jmeter – Need to test load at scale. Takes a cloud to test a cloud 
● A lot of the times you find it’s your network, load balancers etc
OPENSOURCE STACK
30
https://github.com/bloomberg/chef-bcpc
Contribute or keep track to see how we’re changing things
We develop on laptops using VirtualBox before testing on real hardware
CEPH USE CASE DEMAND – GROWING!
31
Ceph
*Real-time
Object
ImmutableOpenStack
Big Data*?
*Possible use cases if performance is enhanced
WHAT’S NEXT?
32
Continue to evolve our POD architecture
● OpenStack
− Work on performance improvements and track stats on usage for departments
− Better monitoring
● Containers and PaaS
− We’re currently evaluating PaaS software and container strategies now
● Better DevOps Pipelining
− GO CD and/or Jenkins improved strategies
− Continue to enhance automation and re-provisioning
− Add testing to automation
● Ceph
− Erasure coding
− Performance improvements – Ceph Hackathon showed very promising improvements
− RGW Multi-Master (multi-sync) between datacenters
− Enhanced security – encryption at rest (can already do) but with better key management
− Purpose built pools for specific use cases (i.e., lower density but blazingly fast hot swappable NVMe SSDs)
− Possible RGW Caching. External pulls come only from CDN
THANK YOU
ADDITIONAL RESOURCES
34
● Chris Jones: cjones303@bloomberg.net
● Twitter: @hanschrisjones, @iqstack, @cloudm2
● BCPC: https://github.com/bloomberg/chef-bcpc
− Current repo for Bloomberg’s Converged OpenStack and Ceph cluster
● Ceph Hackathon: http://pad.ceph.com/p/hackathon_2015-08
● *Soon – A pure Ceph Object Store (COS) repo will be in the Bloomberg
Github repo
− This will have no OpenStack and only be Object Store (RGW – Rados
Gateway), no block devices (RBD)
● Other repos (automation, new projects, etc.):
− IQStack: https://github.com/iqstack - managed by me (disclosure)
− Personal: https://github.com/cloudm2 - me 
− Ansible: https://github.com/ceph/ceph-ansible
− Chef: https://github.com/ceph/ceph-cookbook - this one is going through a
major overhaul and also managed by me for Ceph

Ceph Day Chicago - Ceph at work at Bloomberg

  • 1.
    CEPH AT WORK INBLOOMBERG Object Store, RBD and OpenStack August 18, 2015 By: Chris Jones Copyright 2015 Bloomberg L.P.
  • 2.
    BLOOMBERG 2 30 Years inunder 30 Seconds ● Subscriber base financial information provider (Bloomberg Terminal) ● Online, TV, Print, Real-time streaming information ● Offices and customers in every major financial market and institution worldwide
  • 3.
    BLOOMBERG 3 Primary product -Information ● Bloomberg Terminal − Approximately over 60,000 features/functions. For example, ability to track oil tankers in real-time via satellite feeds ● Most important internal feature to me is the “Soup List.”  − Note: Exact numbers are not specified. Contact media relations for specifics and other important information.
  • 4.
    BLOOMBERG TERMINAL 4 Terminal example. -You can search for anything. Even bios on the wealthiest people in the world (Billionaire’s list).
  • 5.
  • 6.
    CLOUD INFRASTRUCTURE GROUP 6 Primarycustomers – Developers – Product Groups ● Many different development groups throughout our organization ● Many thousands of developers throughout our organization ● Everyone of them wants and needs resources
  • 7.
    CLOUD INFRASTRUCTURE GROUP 7 ResourceProblems ● Developers − Development − Testing − Automation (Cattle vs. Pets) ● Organizations − POC − Products in production − Automation ● Security/Networking − Compliance
  • 8.
    HCP “THEOREM” 8 All distributedstorage and cloud computing systems fall under what I call the HCP “Theorem”. Unlike the CAP Theorem where you can have Consistence or Availability but not both. With HCP you are guaranteed to have two if not all in a distributed scalable system. The question is, how do you lessen or remove the parts in this endless cycle? Painful
  • 9.
    ISSUES Complex ● Security/Compliance ● Automation ●Logging, Monitoring, Auditability ● Alter thinking – educate (painfully slow) ● Failure (hardware and ideas) ● Increase Tolerance ● Scaling ● Compute ● Distributed Storage
  • 10.
    HOW DID WESOLVE IT OR DID WE? 10 Hard Painful Complex We focused on the “sweet spot” ● Hard − Open Source products with strong community support − We looked for compute, networking and storage that scaled − Engaged Security and Networking teams ● Complex − Automation – Chef, Ansible. Everything must be able to be rebuilt from source control (Git). No manual steps − Engaged Security and Networking teams ● Painful − Created converged architecture (compute/storage). In theory it looked like it would fit in the sweet spot but in reality it created more pain − Still working to get our developers to treat their resources as Cattle vs. Pets – NO Pets Policy! − Talent − Engaged Security and Networking teams ● Sweet spot − Ceph – Object Store and RBD Block/Volume − OpenStack (not all projects)
  • 11.
  • 12.
    USE IN BLOOMBERG 12 ●Ceph – RGW (Object Store) ● Ceph – RBD (Block/Volumes) ● OpenStack ─ Compute, Keystone, Cinder, Glance… ─ Ephemeral storage (new) ● Object Store is becoming one of the most popular items ● OpenStack Compute with Ceph backed block store volumes are very popular ● We are introducing ephemeral compute storage
  • 13.
  • 14.
    INTUITIVE OR COUNTERINTUITIVE 14 Hard Painful Complex Completely Converged Architecture ● OpenStack and Ceph − Reduced footprint − Scalability − Attempt to reduce “Hard” and “Complex” and eliminate “Painful” ● Controller (Head) Nodes − Ceph Mon, Ceph OSD, RGW − Nova, Cinder, MySQL, RabbitMQ, etc ● Side Affects (Sometimes you fail) − Had to increase pain tolerance − Initial automation did get easier (reduced “Hard”) but “Complex” increased along with “Pain” − Made it more painful to balance loads
  • 15.
    CONVERGED STACK 15 Converged ArchitectureRack Layout ● 3 Head Nodes (Controller Nodes) − Ceph Monitor − Ceph OSD − OpenStack Controllers (All of them!) − HAProxy ● 1 Bootstrap Node − Cobbler (PXE Boot) − Repos − Chef − Rally/Tempest ● Remaining Nodes − Nova Compute − Ceph OSDs − RGW – Apache ● Ubuntu ● Shared spine with Hadoop resources Bootstrap Node Compute/Ceph OSDs/RGW/Apache Remaining Stack
  • 16.
    OSD BANDWIDTH –ATTEMPT TO BETTER IT 16 Renice OSD daemons – Above: Higher is better
  • 17.
    OSD LATENCY –ATTEMPT TO BETTER IT 17 Renice OSD daemons – Above: Lower is better NOTE: Chart mislabeled – left should be (ms) but is seconds
  • 18.
    LESSON LEARNED? -BETTER SOLUTION? 18 Hard Painful Complex Semi-Converged Architecture - POD ● OpenStack and Ceph − “Complex” increases but “Hard” and “Painful” decrease. “Painful” could be gone but we are talking about OpenStack too  ● Controller (Head) Nodes − Nova, Cinder, MySQL, RabbitMQ, etc. split and balanced better − More purpose built but easily provisioned as needed ● Ceph Nodes − Split Object Store out of OpenStack Cluster so it can scale easier − Dedicated Ceph Mons − Dedicated Ceph OSDs − Dedicated RGW – Replaced Apache with Civetweb − Much better performance and maintenance
  • 19.
    POD ARCHITECTURE (OPENSTACK/CEPH) 19 POD (TOR) HAProxy OS-Nova OS-NovaOS-Rabbit OS-DB Numberof large providers have taken similar approaches Note: Illustrative only – Not Representative POD (TOR) Ceph OSD Ceph Mon Ceph Mon Ceph Mon Ceph OSD Ceph OSD RBD Only Bootstrap Monitoring Ephemeral Ephemeral – Fast/Dangerous Host aggregates & flavors Not Ceph backed
  • 20.
    POD ARCHITECTURE (OPENSTACK/CEPH) 20 POD (TOR) Ceph Block OS-Nova OS-NovaOS-Rabbit OS-NovaOS-DB Numberof large providers have taken similar approaches Note: Illustrative only – Not Representative POD (TOR) Ceph OSD Ceph Mon Ceph Mon Ceph Mon Ceph OSD Ceph OSD POD (TOR) Ceph OSD Ceph Mon Ceph Mon Ceph Mon Ceph OSD Ceph OSD • Scale and re- provision as needed • 3 PODs per rack
  • 21.
    EPHEMERAL VS. CEPHBLOCK STORAGE 21 Numbers will vary in different environments. Illustrations are simplified. Ceph Ephemeral
  • 22.
    EPHEMERAL VS. CEPHBLOCK STORAGE 22 Numbers will vary in different environments. Illustrations are simplified. Ceph – Advantages ● All data is replicated at least 3 ways across the cluster ● Ceph RBD volumes can be created, attached and detached from any hypervisor ● Very fast provisioning using COW (copy-on-write) images ● Allows easy instance re-launch in the event of hypervisor failure ● High read performance Ephemeral – Advantages ● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency ● Can provide fairly large volumes for cheap Ceph – Disadvantages ● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability) ● Higher latency due to Ceph being network based instead of local Ephemeral – Disadvantages ● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost ● Maybe difficult to add more capacity (depends on type of RAID) ● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph ● Less important, with RAID your drives need to be same size or you lose capacity
  • 23.
    EPHEMERAL VS. CEPHBLOCK STORAGE 23 Numbers will vary in different environments. Illustrations are simplified. EPHEMERAL CEPH Block write bandwidth (MB/s) 1,094.02 642.15 Block read bandwidth (MB/s) 1,826.43 639.47 Character read bandwidth (MB/s) 4.93 4.31 Character write bandwidth (MB/s) 0.83 0.75 Block write latency (ms) 9.502 37.096 Block read latency (ms) 8.121 4.941 Character read latency (ms) 2.395 3.322 Character write latency (ms) 11.052 13.587 Note: Ephemeral in JBOD/LVM mode is not as fast as Ceph
  • 24.
    OBJECT STORE STACK(SINGLE RACK) 24 Small single purpose (lab or whatever) cluster/rack – RedHat 7.1 ● Rack = Cluster ● Smaller Cluster – Storage node number could be “short stack” ● 1 TOR and 1 Rack Mgt Node ● 3 Ceph Mon Nodes (No OSDs) ● Up to 14 Ceph OSD nodes (depends on size) ● 2x or 3x Replication depending on need (3x default) ● 1 RGW (coexist with Mon or OSD Node) ● 10g Cluster interface ● 10g Public interface ● 1g Management interface ● OSD Nodes (lower density nodes) − Option 1: 6TB HDD x 12 – Journal partition on HDD − Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio − Option 3: 6TB HDD x 12 – 1 NVMe SSD for Journals with 12:1 ratio − Choose based on tolerance level and failure domain for specific use case − ~1PB of raw space - ~330TB of usable (depends on drives) 3 Mon Nodes TOR/IPMI Storage Nodes
  • 25.
    OBJECT STORE STACK(3 RACK CLUSTER) 25 1 Mon/RGW Node Per rack TOR - Leaf Storage Nodes Spine Spine LBLB
  • 26.
    OBJECT STORE STACK(3 RACK CLUSTER) 26 Standard cluster is 3 or more racks ● Min of 3 Racks = Cluster ● 1 TOR and 1 Rack Mgt Node ● 1 Ceph Mon node per rack (No OSDs) ● Up to 15 Ceph OSD nodes (depends on size) per rack ● 1 RGW (dedicated Node) ● OSD Nodes (lower density nodes) − Option 1: 6TB HDD x 12 – Journal partition on HDD − Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio − Option 3: 6TB HDD x 12 – 1 NVMe SSD for Journals with 12:1 ratio − Choose based on tolerance level and failure domain for specific use case 1 Mon/RGW Node TOR/IPMI Storage Nodes
  • 27.
    OBJECT STORE STACK 27 Standardconfiguration ● Min of 3 Racks = Cluster ● Cluster Network: Bonded 10g or higher depending on size of cluster ● Public Network: Bonded 10g for RGW interfaces ● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep odd number of Mons so some racks may not have Mons. We try to keep larger cluster racks & Mons in different power zones ● We have developed a healthy “Pain” tolerance. We can survive an entire rack going down but we mainly see drive failures and more node failures. ● Min 1 RGW (dedicated Node) per rack (may want more) ● Hardware load balancers to RGWs with redundancy ● OSD Nodes (lower density nodes) – we have both. Actually looking at new hardware and drive options − Option 1: 6TB HDD x 12 – Journal partition on HDD − Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio
  • 28.
    AUTOMATION 28 All of whatwe do only happens because of automation ● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for orchestration and maintenance ● Bloomberg Github: https://github.com/bloomberg/chef-bcpc ● Ceph specific options − Ceph Deploy: https://github.com/ceph/ceph-deploy − Ceph Ansible: https://github.com/ceph/ceph-ansible − Ceph Chef: https://github.com/ceph/ceph-cookbook ● Our bootstrap server is our Chef server per cluster
  • 29.
    TESTING 29 Testing is critical.We use different strategies for the different parts of OpenStack and Ceph we test ● OpenStack − Tempest – We currently only use this for patches we make. We plan to use this more in our DevOps pipeline − Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself ● Ceph − RADOS Bench − COS Bench – Going to try this with CBT − CBT – Ceph Benchmark Testing − CeTune − Bonnie++ − FIO ● Ceph – RGW − Jmeter – Need to test load at scale. Takes a cloud to test a cloud  ● A lot of the times you find it’s your network, load balancers etc
  • 30.
    OPENSOURCE STACK 30 https://github.com/bloomberg/chef-bcpc Contribute orkeep track to see how we’re changing things We develop on laptops using VirtualBox before testing on real hardware
  • 31.
    CEPH USE CASEDEMAND – GROWING! 31 Ceph *Real-time Object ImmutableOpenStack Big Data*? *Possible use cases if performance is enhanced
  • 32.
    WHAT’S NEXT? 32 Continue toevolve our POD architecture ● OpenStack − Work on performance improvements and track stats on usage for departments − Better monitoring ● Containers and PaaS − We’re currently evaluating PaaS software and container strategies now ● Better DevOps Pipelining − GO CD and/or Jenkins improved strategies − Continue to enhance automation and re-provisioning − Add testing to automation ● Ceph − Erasure coding − Performance improvements – Ceph Hackathon showed very promising improvements − RGW Multi-Master (multi-sync) between datacenters − Enhanced security – encryption at rest (can already do) but with better key management − Purpose built pools for specific use cases (i.e., lower density but blazingly fast hot swappable NVMe SSDs) − Possible RGW Caching. External pulls come only from CDN
  • 33.
  • 34.
    ADDITIONAL RESOURCES 34 ● ChrisJones: cjones303@bloomberg.net ● Twitter: @hanschrisjones, @iqstack, @cloudm2 ● BCPC: https://github.com/bloomberg/chef-bcpc − Current repo for Bloomberg’s Converged OpenStack and Ceph cluster ● Ceph Hackathon: http://pad.ceph.com/p/hackathon_2015-08 ● *Soon – A pure Ceph Object Store (COS) repo will be in the Bloomberg Github repo − This will have no OpenStack and only be Object Store (RGW – Rados Gateway), no block devices (RBD) ● Other repos (automation, new projects, etc.): − IQStack: https://github.com/iqstack - managed by me (disclosure) − Personal: https://github.com/cloudm2 - me  − Ansible: https://github.com/ceph/ceph-ansible − Chef: https://github.com/ceph/ceph-cookbook - this one is going through a major overhaul and also managed by me for Ceph