Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop

Scale and performance: Servicing the
Fabric and the Workshop
Steve Quenette, Deputy Director, Monash eResearch Centre, and
Blair Bethwaite, Technical Lead, Monash eResearch Centre
Ceph day Melbourne 2015, Monash University
brought to you by

bought to you by
computing for research:  
… extremes over spectrums …
1. peak vs long-tail
(spectrum of user expectations)
2. permeability: solo vs multidisciplinary
(spectrum of organisational expectations)
3. paradigms: “there is no spoon”
(spectrum of computing expectations)

bought to you by
• Leading researchers build tools to see what could
not be seen before, and provide that tool for others.
• All researchers apply tools (of others) on new
problems.
“peak”
and the tail

bought to you by
2. permeability
• Implies - over time: Research
verticals…
• becomes increasing
complicated, involved and
leveraged
• involves many organisations and
people

bought to you by
3. discovery paradigms

bought to you by
technology driven
discovery?…
5.7% (CAGR)
Moore’s Curse, IEEE Spectrum, April 2015
http://www.i-scoop.eu/internet-of-things ,https://www.ncta.com/broadband-by-the-numbers
http://www.pwc.com/gx/en/technology/mobile-innovation/assets/pwc-mobile-technologies-index-image-sensor-steady-growth-for-new-capabilities.pdf
Normalised growth - innovations
1
1000
1000000
1875.00 1909.75 1944.50 1979.25 2014.00
Number of components on a microchip IoT - Number of devices on the internet light efficiency (outdoor lights)
light efficiency (indoor lights) Intercontinental travel Capability of image sensors
Fuel conversion efficiency (US passenger car) energy cost of steel (coke, natural gas, electricity) US corn grop yeild

bought to you by
4 paradigms
Emperical
(“1st paradigm”)
Collecting and enumerating things. 
Enabled by telescopes,
microscopes, …
Theoretical
(“2nd paradigm”)
Properties determined by models.
Enabled by innovations in statistics,
calculus, physical laws, …
Computational
(“3rd paradigm”)
Models signiﬁcantly more complex
and sized than a human can
compute.
Enabled by computing growth
Data-driven
(“4th paradigm”)
Signiﬁcantly more and complex data.
Enabled by sensors, storage, IoT
growth

bought to you by
… the 4th is really …
Data-mining
There is so much data the f can be
discovered with little or no
preconditioning of what “f” is.
Enabled by innovations in data-
mining model/approaches (“g”)
Data assimilation
Both models and observations are
big and complex.
Enabled by innovations in inverse
and optimisation model/approaches
Visualisation
Where very much more of x and y
can be displayed to humans, and
the human brain does the “data-
mining”

bought to you by
Yes visualisation is relevant!

bought to you by
21st century microscopes
look more like…
ANALYSIS 
Filters
INSIGHT 
Lens
AUSTRALIAN SYNCHROtRON
MONASH  
BIOMEDICAL 
IMAGING
RAMACCIOTTI 
CRYO-EM
CAVE2  
IMMERSIVE  
VISUALISATION
DIGITAL  
SCIENTIFIC 
DESKTOPS
MONASH  
RESEARCH  
CLOUD
CAPTURE 
Light Source,
Samples
SHARE 
DATA

bought to you by
computing for research:  
… extremes over spectrums…
(spectrum of user expectations)
2. permeability: solo vs multidisciplinary
(spectrum of organisational expectations)
3. paradigms: “there is no spoon”
(spectrum of computing expectations)
self service
multiple market-driven front-ends
quality
accessible &
multi-tenant
scale
low latency
bandwidth
front-ends “emerge”

bought to you by
fabric and workshop
• Ceph (together with OpenStack and Neutron),
means our storage is software deﬁned
• Its more like a fabric
• Self-service to pieces
• We choose the pieces to be right for researchers
who orchestrate their own 21st century microscope
• MeRC, including compute, people, etc is more like a
workshop for microscope builders

bought to you by
storage IaaS products
• Customer’s storage capacity can be a mix of…
• Vault
• Lower $/tb, write fast, slow retrieve
• Market (Object)
• Moderate $/tb
• Amazon S3-like for modern “Access Layers”
• Remote backup optional
• Market (File)
• Higher $/tb
• For traditional ﬁlesystem “Access Layers”
• Remote backup implied
• Computational
• Moderate $/tb
• Direct attached volumes to R@CMon Cloud
• A user can join storage capacity from other tenants (e.g. RDSI ReDS merit
allocation) per “project”.

bought to you by
storage Access Layers
• MyTardis
• For Instrument Integration
• From sensor to analysis to open access
• Researcher, Facility & Institutional data management
• Figshare
• Data management for Institutions and the long-tail
• (Can trial through R@CMon Storage)
• Aspera
• RDS/VicNode operated FTP & web access tool for very high-
speed data transfer
• OwnCloud (not yet in production)
• Dropbox-like
• Linked to user end-points across Access layers

bought to you by
some numbers
By allocations (Q3 2015)…
• Vault: 2.5uPB
• Market (Object): 0.6uPB
• Market (File): 2uPB
• Computational: 0.5uPB
• Intent: By end of 2016 all* Monash University “storage”
for research will be on this infrastructure
(*) Except the IS027k accredited hosting facility, and admin storage space used by researchers

bought to you by
at the end of the day, we are still consolidating -
its just that we’ve asked where should
consolidation occur

bought to you by
Now over to the techies…
Speaking: Blair Bethwaite, Senior HPC Consultant,
Monash eResearch Centre
Monash Ceph Crew:
Jerico Revote, Rafael Lopez, Swe Aung, Craig
Beckman, George Foscolos, George Kralevski,
Steve Davison, John Mann, Colin Blythe
Please ask questions as we go

bought to you by
Ceph@Monash, some history
It all started with The Cloud
https://xkcd.com/908/ (NeCTAR logo added)

bought to you by
speaking of accidents
• In early 2013 R@CMon started with Monash’s ﬁrst zone of the
NeCTAR cloud
• Our own local cloud = awesome! But, “where do we store
all the things?”
• No persistent volume service provided by NeCTAR,
expected from other funding sources
• Plenty of object storage though…
• Enter Cuttleﬁsh!
• “monash-01” Cinder zone backed by Ceph available mid
2013

bought to you by
show and tell: monash-01
• (Disclaimer: we’re not good at names)
• The hardware - repurposed Swift servers:
• 8x Dell R720xd (colo osds & mons x5) - 24TB/node
• 12x 2TB 7.2k NL-SAS (12x RAID0, PERC H710p)
• 2x E5-2650(2GHz), 32GB RAM
• 20GbE (Intel X520 DP), VLANs for back/front-end
• Ceph Fireﬂy on Ubuntu Precise, 2 replicas, ~90uTB,
60TB used, 135TB committed (thin provisioning)

bought to you by
show and tell: monash-02
• 17x Dell R720xd (virtualised mons x3)
• 9x 4TB 7.2k NL-SAS (9x RAID0, PERC H710p) - 36TB/node
• 3x 200GB Intel DC S3700 SSDs (journals and future cache)
• 1x E5-2630Lv2 (2.4GHz), 32GB RAM
• 2x 10GbE (Mellanox CX-3 DP), back/front-end active on
alternate ports (different ToR switches)
• Ceph Fireﬂy on Ubuntu Trusty, 2 replicas, ~300uTB, 110TB
used,130TB committed
What we changed?

bought to you by
show and tell: rds[i]
• 3x Dell R320 (mons)
• 4x Dell R720xd (cache tier) - 18TB/node
• 20x 900GB 10k SAS (20x RAID0, PERC 710p) - rgw
hot tier
• 4x 400GB Intel DC S3700 SSDs (journals for rgw hot
tier)
• 2x E5-2630v2 (2.6GHz), 128GB RAM
• 56GbE (Mellanox CX-3 DP), VLANs for back/front-end

bought to you by
show and tell: rds[i]
• 33x Dell R720xd + 66 MD1200 (2 per node) -144TB/node
• 8x 6TB 7.2k NL-SAS (8x RAID0, PERC H710p) - rgw EC cold
tier
• 24x 4TB 7.2k NL-SAS (24x RAID0, PERC H810) - rbds go
here
• 4x 200GB Intel DC S3700 SSDs (journals for rbd pool)
• 2x E5-2630v2 (2.6GHz), 128GB RAM
• 20GbE (Mellanox CX-3 DP), VLANs for back/front-end
• Ceph Hammer on RHEL Maipo

bought to you by
rds logical - physical layout

bought to you by
rgw HA architecture
• DNS round-robin provides initial HA request fanout
• HAproxys handle load-balancing and SSL/TLS
termination.
• Scale arbitrarily in pairs with keepalived pairing
providing redundancy and HA via Virtual/ﬂoating
IP address (VIP) failover.
• RGW instances handle actual client/application
protocol (S3, Swift, etc) trafﬁc.
• Scale arbitrarily.

bought to you by
new hardware/capacity
• monash-02 - another 10 nodes same conﬁg
• rds - another 9 nodes same conﬁg
• Refresh monash-01 cluster:
• 9x Dell R730xd - 96TB/node
• 16x 6TB 7.2k NL-SAS
• 2x 400GB Intel DC P3700 NVMe (journals)
• 1x E5-2630v3 (2.5GHz), 128GB RAM
• 20GbE (Intel X710 DP), VLANs for back/front-end
16x 3.5” data drives in 2RU!

bought to you by
pain / nits
• Most problems have been indirect, i.e. operating system
and hardware, Ceph itself solid
• But can be very opaque when things go wrong
• E.g., what is wrong in this picture, how bad is it, is
there any commonality or correlation of symptoms,
does cluster need intervention to recover?

cluster
b8bf920a-‐de81-‐4ea5-‐b63e-‐2d5f8cced22d

health
HEALTH_WARN

23
pgs
backfill

68
pgs
backfilling

1230
pgs
degraded

6017
pgs
down

46
pgs
incomplete

8099
pgs
peering

94
pgs
recovering

41
pgs
recovery_wait

2824
pgs
stale

1204
pgs
stuck
degraded

8908
pgs
stuck
inactive

2824
pgs
stuck
stale

9913
pgs
stuck
unclean

1073
pgs
stuck
undersized

1092
pgs
undersized

1308
requests
are
blocked
>
32
sec

recovery
168114/1648042
objects
degraded
(10.201%)

recovery
52842/1648042
objects
misplaced
(3.206%)

recovery
1056/460665
unfound
(0.229%)

74/256
in
osds
are
down

1
mons
down,
quorum
1,2
rcmondc1r75-‐02-‐ac,rcmondc1r75-‐01-‐ac

monmap
e2:
3
mons
at
{rcmondc1r75-‐01-‐ac=172.16.93.3:6789/0,rcmondc1r75-‐02-‐ac=172.16.93.2:6789/0,rcmondc1r75-‐03-‐ac=172.16.93.1:6789/0}

election
epoch
51186,
quorum
1,2
rcmondc1r75-‐02-‐ac,rcmondc1r75-‐01-‐ac

osdmap
e103326:
848
osds:
182
up,
256
in;
1153
remapped
pgs

pgmap
v3451913:
10560
pgs,
18
pools,
1152
GB
data,
449
kobjects

2547
GB
used,
784
TB
/
786
TB
avail

168114/1648042
objects
degraded
(10.201%)

52842/1648042
objects
misplaced
(3.206%)

1056/460665
unfound
(0.229%)

4703
down+peering

1236
stale+down+peering

877
stale+peering

634
peering

474
active+undersized+degraded

422
remapped+peering

381
active+clean

266
stale+active+clean

251
active+remapped

bought to you by
challenges / questions
• Best for vNAS performance via KVM+librbd
• Filesystem, # of rbds, what interface, tuning?
• Disk failure handling process
• Currently policy is to redeploy OSD for any media
error
• http://tracker.ceph.com/projects/ceph/wiki/
A_standard_framework_for_Ceph_performance_pro
ﬁling_with_latency_breakdown

bought to you by
learnings
• use dedicated journals
• network matters - becomes much more visible
• RAID controllers with no native JBOD are ok, but be
prepared for more complicated ops

MASSIVE Business Plan 2013 / 2014 DRAFT
Title: Business Plan for the Multi-modal Australian ScienceS Imaging and
Visualisation Environment (MASSIVE) 2013 / 2014
Document no: MASSIVE-BP-2.3 DRAFT
Date: June 2013
Prepared by: Name: Wojtek J Goscinski
Title: MASSIVE Coordinator
Approved by: Name: MASSIVE Steering Committee
Date:
Open IaaS:
Technology:
30/10/2015 1:59 pmMyTardis | Automatically stores your instrument data for sharing.
Menu
MyTardis Tech Group Meeting
#3
Posted on August 20, 2015August 20, 2015 by steve.androulakissteve.androulakis
It’s been months since the last one, so a wealth of activity to report on.
MyTardis
Automatically stores your instrument data for sharing.
Application layers:

IaaS: Lustre
Cloud Storage HPC
Tenancies:
RDS/VicNode
NeCTAR
Monash
Other 1
Vault
Market 
(object)
Market 
(ﬁle)
Computational
MyTardis Figshare CIFS OwnCloud
Access
Layers:
Ceph

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop

Similar to Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop (20)

Recently uploaded

Recently uploaded (20)

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop