Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop
1. Scale and performance: Servicing the
Fabric and the Workshop
Steve Quenette, Deputy Director, Monash eResearch Centre, and
Blair Bethwaite, Technical Lead, Monash eResearch Centre
Ceph day Melbourne 2015, Monash University
brought to you by
2. bought to you by
computing for research:
… extremes over spectrums …
1. peak vs long-tail
(spectrum of user expectations)
2. permeability: solo vs multidisciplinary
(spectrum of organisational expectations)
3. paradigms: “there is no spoon”
(spectrum of computing expectations)
3. bought to you by
1. peak vs long-tail
• Leading researchers build tools to see what could
not be seen before, and provide that tool for others.
• All researchers apply tools (of others) on new
problems.
“peak”
and the tail
4. bought to you by
2. permeability
• Implies - over time: Research
verticals…
• becomes increasing
complicated, involved and
leveraged
• involves many organisations and
people
6. bought to you by
technology driven
discovery?…
5.7% (CAGR)
Moore’s Curse, IEEE Spectrum, April 2015
http://www.i-scoop.eu/internet-of-things ,https://www.ncta.com/broadband-by-the-numbers
http://www.pwc.com/gx/en/technology/mobile-innovation/assets/pwc-mobile-technologies-index-image-sensor-steady-growth-for-new-capabilities.pdf
Normalised growth - innovations
1
1000
1000000
1875.00 1909.75 1944.50 1979.25 2014.00
Number of components on a microchip IoT - Number of devices on the internet light efficiency (outdoor lights)
light efficiency (indoor lights) Intercontinental travel Capability of image sensors
Fuel conversion efficiency (US passenger car) energy cost of steel (coke, natural gas, electricity) US corn grop yeild
7. bought to you by
4 paradigms
Emperical
(“1st paradigm”)
Collecting and enumerating things.
Enabled by telescopes,
microscopes, …
Theoretical
(“2nd paradigm”)
Properties determined by models.
Enabled by innovations in statistics,
calculus, physical laws, …
Computational
(“3rd paradigm”)
Models significantly more complex
and sized than a human can
compute.
Enabled by computing growth
Data-driven
(“4th paradigm”)
Significantly more and complex data.
Enabled by sensors, storage, IoT
growth
8. bought to you by
… the 4th is really …
Data-mining
There is so much data the f can be
discovered with little or no
preconditioning of what “f” is.
Enabled by innovations in data-
mining model/approaches (“g”)
Data assimilation
Both models and observations are
big and complex.
Enabled by innovations in inverse
and optimisation model/approaches
Visualisation
Where very much more of x and y
can be displayed to humans, and
the human brain does the “data-
mining”
10. bought to you by
21st century microscopes
look more like…
ANALYSIS
Filters
INSIGHT
Lens
AUSTRALIAN SYNCHROtRON
MONASH
BIOMEDICAL
IMAGING
RAMACCIOTTI
CRYO-EM
CAVE2
IMMERSIVE
VISUALISATION
DIGITAL
SCIENTIFIC
DESKTOPS
MONASH
RESEARCH
CLOUD
CAPTURE
Light Source,
Samples
SHARE
DATA
11. bought to you by
computing for research:
… extremes over spectrums…
1. peak vs long-tail
(spectrum of user expectations)
2. permeability: solo vs multidisciplinary
(spectrum of organisational expectations)
3. paradigms: “there is no spoon”
(spectrum of computing expectations)
self service
multiple market-driven front-ends
quality
accessible &
multi-tenant
scale
low latency
bandwidth
front-ends “emerge”
12. bought to you by
fabric and workshop
• Ceph (together with OpenStack and Neutron),
means our storage is software defined
• Its more like a fabric
• Self-service to pieces
• We choose the pieces to be right for researchers
who orchestrate their own 21st century microscope
• MeRC, including compute, people, etc is more like a
workshop for microscope builders
13. bought to you by
storage IaaS products
• Customer’s storage capacity can be a mix of…
• Vault
• Lower $/tb, write fast, slow retrieve
• Market (Object)
• Moderate $/tb
• Amazon S3-like for modern “Access Layers”
• Remote backup optional
• Market (File)
• Higher $/tb
• For traditional filesystem “Access Layers”
• Remote backup implied
• Computational
• Moderate $/tb
• Direct attached volumes to R@CMon Cloud
• A user can join storage capacity from other tenants (e.g. RDSI ReDS merit
allocation) per “project”.
14. bought to you by
storage Access Layers
• MyTardis
• For Instrument Integration
• From sensor to analysis to open access
• Researcher, Facility & Institutional data management
• Figshare
• Data management for Institutions and the long-tail
• (Can trial through R@CMon Storage)
• Aspera
• RDS/VicNode operated FTP & web access tool for very high-
speed data transfer
• OwnCloud (not yet in production)
• Dropbox-like
• Linked to user end-points across Access layers
15. bought to you by
some numbers
By allocations (Q3 2015)…
• Vault: 2.5uPB
• Market (Object): 0.6uPB
• Market (File): 2uPB
• Computational: 0.5uPB
• Intent: By end of 2016 all* Monash University “storage”
for research will be on this infrastructure
(*) Except the IS027k accredited hosting facility, and admin storage space used by researchers
16. bought to you by
at the end of the day, we are still consolidating -
its just that we’ve asked where should
consolidation occur
17. bought to you by
Now over to the techies…
Speaking: Blair Bethwaite, Senior HPC Consultant,
Monash eResearch Centre
Monash Ceph Crew:
Jerico Revote, Rafael Lopez, Swe Aung, Craig
Beckman, George Foscolos, George Kralevski,
Steve Davison, John Mann, Colin Blythe
Please ask questions as we go
18. bought to you by
Ceph@Monash, some history
It all started with The Cloud
https://xkcd.com/908/ (NeCTAR logo added)
19. bought to you by
speaking of accidents
• In early 2013 R@CMon started with Monash’s first zone of the
NeCTAR cloud
• Our own local cloud = awesome! But, “where do we store
all the things?”
• No persistent volume service provided by NeCTAR,
expected from other funding sources
• Plenty of object storage though…
• Enter Cuttlefish!
• “monash-01” Cinder zone backed by Ceph available mid
2013
20. bought to you by
show and tell: monash-01
• (Disclaimer: we’re not good at names)
• The hardware - repurposed Swift servers:
• 8x Dell R720xd (colo osds & mons x5) - 24TB/node
• 12x 2TB 7.2k NL-SAS (12x RAID0, PERC H710p)
• 2x E5-2650(2GHz), 32GB RAM
• 20GbE (Intel X520 DP), VLANs for back/front-end
• Ceph Firefly on Ubuntu Precise, 2 replicas, ~90uTB,
60TB used, 135TB committed (thin provisioning)
21. bought to you by
show and tell: monash-02
• 17x Dell R720xd (virtualised mons x3)
• 9x 4TB 7.2k NL-SAS (9x RAID0, PERC H710p) - 36TB/node
• 3x 200GB Intel DC S3700 SSDs (journals and future cache)
• 1x E5-2630Lv2 (2.4GHz), 32GB RAM
• 2x 10GbE (Mellanox CX-3 DP), back/front-end active on
alternate ports (different ToR switches)
• Ceph Firefly on Ubuntu Trusty, 2 replicas, ~300uTB, 110TB
used,130TB committed
What we changed?
22. bought to you by
show and tell: rds[i]
• 3x Dell R320 (mons)
• 4x Dell R720xd (cache tier) - 18TB/node
• 20x 900GB 10k SAS (20x RAID0, PERC 710p) - rgw
hot tier
• 4x 400GB Intel DC S3700 SSDs (journals for rgw hot
tier)
• 2x E5-2630v2 (2.6GHz), 128GB RAM
• 56GbE (Mellanox CX-3 DP), VLANs for back/front-end
23. bought to you by
show and tell: rds[i]
• 33x Dell R720xd + 66 MD1200 (2 per node) -144TB/node
• 8x 6TB 7.2k NL-SAS (8x RAID0, PERC H710p) - rgw EC cold
tier
• 24x 4TB 7.2k NL-SAS (24x RAID0, PERC H810) - rbds go
here
• 4x 200GB Intel DC S3700 SSDs (journals for rbd pool)
• 2x E5-2630v2 (2.6GHz), 128GB RAM
• 20GbE (Mellanox CX-3 DP), VLANs for back/front-end
• Ceph Hammer on RHEL Maipo
25. bought to you by
rgw HA architecture
• DNS round-robin provides initial HA request fanout
• HAproxys handle load-balancing and SSL/TLS
termination.
• Scale arbitrarily in pairs with keepalived pairing
providing redundancy and HA via Virtual/floating
IP address (VIP) failover.
• RGW instances handle actual client/application
protocol (S3, Swift, etc) traffic.
• Scale arbitrarily.
26. bought to you by
new hardware/capacity
• monash-02 - another 10 nodes same config
• rds - another 9 nodes same config
• Refresh monash-01 cluster:
• 9x Dell R730xd - 96TB/node
• 16x 6TB 7.2k NL-SAS
• 2x 400GB Intel DC P3700 NVMe (journals)
• 1x E5-2630v3 (2.5GHz), 128GB RAM
• 20GbE (Intel X710 DP), VLANs for back/front-end
16x 3.5” data drives in 2RU!
27. bought to you by
pain / nits
• Most problems have been indirect, i.e. operating system
and hardware, Ceph itself solid
• But can be very opaque when things go wrong
• E.g., what is wrong in this picture, how bad is it, is
there any commonality or correlation of symptoms,
does cluster need intervention to recover?
cluster
b8bf920a-‐de81-‐4ea5-‐b63e-‐2d5f8cced22d
health
HEALTH_WARN
23
pgs
backfill
68
pgs
backfilling
1230
pgs
degraded
6017
pgs
down
46
pgs
incomplete
8099
pgs
peering
94
pgs
recovering
41
pgs
recovery_wait
2824
pgs
stale
1204
pgs
stuck
degraded
8908
pgs
stuck
inactive
2824
pgs
stuck
stale
9913
pgs
stuck
unclean
1073
pgs
stuck
undersized
1092
pgs
undersized
1308
requests
are
blocked
>
32
sec
recovery
168114/1648042
objects
degraded
(10.201%)
recovery
52842/1648042
objects
misplaced
(3.206%)
recovery
1056/460665
unfound
(0.229%)
74/256
in
osds
are
down
1
mons
down,
quorum
1,2
rcmondc1r75-‐02-‐ac,rcmondc1r75-‐01-‐ac
monmap
e2:
3
mons
at
{rcmondc1r75-‐01-‐ac=172.16.93.3:6789/0,rcmondc1r75-‐02-‐ac=172.16.93.2:6789/0,rcmondc1r75-‐03-‐ac=172.16.93.1:6789/0}
election
epoch
51186,
quorum
1,2
rcmondc1r75-‐02-‐ac,rcmondc1r75-‐01-‐ac
osdmap
e103326:
848
osds:
182
up,
256
in;
1153
remapped
pgs
pgmap
v3451913:
10560
pgs,
18
pools,
1152
GB
data,
449
kobjects
2547
GB
used,
784
TB
/
786
TB
avail
168114/1648042
objects
degraded
(10.201%)
52842/1648042
objects
misplaced
(3.206%)
1056/460665
unfound
(0.229%)
4703
down+peering
1236
stale+down+peering
877
stale+peering
634
peering
474
active+undersized+degraded
422
remapped+peering
381
active+clean
266
stale+active+clean
251
active+remapped
28. bought to you by
challenges / questions
• Best for vNAS performance via KVM+librbd
• Filesystem, # of rbds, what interface, tuning?
• Disk failure handling process
• Currently policy is to redeploy OSD for any media
error
• http://tracker.ceph.com/projects/ceph/wiki/
A_standard_framework_for_Ceph_performance_pro
filing_with_latency_breakdown
29. bought to you by
learnings
• use dedicated journals
• network matters - becomes much more visible
• RAID controllers with no native JBOD are ok, but be
prepared for more complicated ops
30. MASSIVE Business Plan 2013 / 2014 DRAFT
Title: Business Plan for the Multi-modal Australian ScienceS Imaging and
Visualisation Environment (MASSIVE) 2013 / 2014
Document no: MASSIVE-BP-2.3 DRAFT
Date: June 2013
Prepared by: Name: Wojtek J Goscinski
Title: MASSIVE Coordinator
Approved by: Name: MASSIVE Steering Committee
Date:
Open IaaS:
Technology:
30/10/2015 1:59 pmMyTardis | Automatically stores your instrument data for sharing.
Menu
MyTardis Tech Group Meeting
#3
Posted on August 20, 2015August 20, 2015 by steve.androulakissteve.androulakis
It’s been months since the last one, so a wealth of activity to report on.
MyTardis
Automatically stores your instrument data for sharing.
Application layers: