Scale and performance: Servicing the
Fabric and the Workshop
Steve Quenette, Deputy Director, Monash eResearch Centre, and
Blair Bethwaite, Technical Lead, Monash eResearch Centre
Ceph day Melbourne 2015, Monash University
brought to you by
bought to you by
computing for research: 

… extremes over spectrums …
1. peak vs long-tail
(spectrum of user expectations)
2. permeability: solo vs multidisciplinary
(spectrum of organisational expectations)
3. paradigms: “there is no spoon”
(spectrum of computing expectations)
bought to you by
1. peak vs long-tail
• Leading researchers build tools to see what could
not be seen before, and provide that tool for others.
• All researchers apply tools (of others) on new
problems.
“peak”
and the tail
bought to you by
2. permeability
• Implies - over time: Research
verticals…
• becomes increasing
complicated, involved and
leveraged
• involves many organisations and
people
bought to you by
3. discovery paradigms
bought to you by
technology driven
discovery?…
5.7% (CAGR)
Moore’s Curse, IEEE Spectrum, April 2015
http://www.i-scoop.eu/internet-of-things ,https://www.ncta.com/broadband-by-the-numbers
http://www.pwc.com/gx/en/technology/mobile-innovation/assets/pwc-mobile-technologies-index-image-sensor-steady-growth-for-new-capabilities.pdf
Normalised growth - innovations
1
1000
1000000
1875.00 1909.75 1944.50 1979.25 2014.00
Number of components on a microchip IoT - Number of devices on the internet light efficiency (outdoor lights)
light efficiency (indoor lights) Intercontinental travel Capability of image sensors
Fuel conversion efficiency (US passenger car) energy cost of steel (coke, natural gas, electricity) US corn grop yeild
bought to you by
4 paradigms
Emperical
(“1st paradigm”)
Collecting and enumerating things.

Enabled by telescopes,
microscopes, …
Theoretical
(“2nd paradigm”)
Properties determined by models.
Enabled by innovations in statistics,
calculus, physical laws, …
Computational
(“3rd paradigm”)
Models significantly more complex
and sized than a human can
compute.
Enabled by computing growth
Data-driven
(“4th paradigm”)
Significantly more and complex data.
Enabled by sensors, storage, IoT
growth
bought to you by
… the 4th is really …
Data-mining
There is so much data the f can be
discovered with little or no
preconditioning of what “f” is.
Enabled by innovations in data-
mining model/approaches (“g”)
Data assimilation
Both models and observations are
big and complex.
Enabled by innovations in inverse
and optimisation model/approaches
Visualisation
Where very much more of x and y
can be displayed to humans, and
the human brain does the “data-
mining”
bought to you by
Yes visualisation is relevant!
bought to you by
21st century microscopes
look more like…
ANALYSIS

Filters
INSIGHT

Lens
AUSTRALIAN SYNCHROtRON
MONASH 

BIOMEDICAL

IMAGING
RAMACCIOTTI

CRYO-EM
CAVE2 

IMMERSIVE 

VISUALISATION
DIGITAL 

SCIENTIFIC

DESKTOPS
MONASH 

RESEARCH 

CLOUD
CAPTURE

Light Source,
Samples
SHARE

DATA
bought to you by
computing for research: 

… extremes over spectrums…
1. peak vs long-tail
(spectrum of user expectations)
2. permeability: solo vs multidisciplinary
(spectrum of organisational expectations)
3. paradigms: “there is no spoon”
(spectrum of computing expectations)
self service
multiple market-driven front-ends
quality
accessible &
multi-tenant
scale
low latency
bandwidth
front-ends “emerge”
bought to you by
fabric and workshop
• Ceph (together with OpenStack and Neutron),
means our storage is software defined
• Its more like a fabric
• Self-service to pieces
• We choose the pieces to be right for researchers
who orchestrate their own 21st century microscope
• MeRC, including compute, people, etc is more like a
workshop for microscope builders
bought to you by
storage IaaS products
• Customer’s storage capacity can be a mix of…
• Vault
• Lower $/tb, write fast, slow retrieve
• Market (Object)
• Moderate $/tb
• Amazon S3-like for modern “Access Layers”
• Remote backup optional
• Market (File)
• Higher $/tb
• For traditional filesystem “Access Layers”
• Remote backup implied
• Computational
• Moderate $/tb
• Direct attached volumes to R@CMon Cloud
• A user can join storage capacity from other tenants (e.g. RDSI ReDS merit
allocation) per “project”.
bought to you by
storage Access Layers
• MyTardis
• For Instrument Integration
• From sensor to analysis to open access
• Researcher, Facility & Institutional data management
• Figshare
• Data management for Institutions and the long-tail
• (Can trial through R@CMon Storage)
• Aspera
• RDS/VicNode operated FTP & web access tool for very high-
speed data transfer
• OwnCloud (not yet in production)
• Dropbox-like
• Linked to user end-points across Access layers
bought to you by
some numbers
By allocations (Q3 2015)…
• Vault: 2.5uPB
• Market (Object): 0.6uPB
• Market (File): 2uPB
• Computational: 0.5uPB
• Intent: By end of 2016 all* Monash University “storage”
for research will be on this infrastructure
(*) Except the IS027k accredited hosting facility, and admin storage space used by researchers
bought to you by
at the end of the day, we are still consolidating -
its just that we’ve asked where should
consolidation occur
bought to you by
Now over to the techies…
Speaking: Blair Bethwaite, Senior HPC Consultant,
Monash eResearch Centre
Monash Ceph Crew:
Jerico Revote, Rafael Lopez, Swe Aung, Craig
Beckman, George Foscolos, George Kralevski,
Steve Davison, John Mann, Colin Blythe
Please ask questions as we go
bought to you by
Ceph@Monash, some history
It all started with The Cloud
https://xkcd.com/908/ (NeCTAR logo added)
bought to you by
speaking of accidents
• In early 2013 R@CMon started with Monash’s first zone of the
NeCTAR cloud
• Our own local cloud = awesome! But, “where do we store
all the things?”
• No persistent volume service provided by NeCTAR,
expected from other funding sources
• Plenty of object storage though…
• Enter Cuttlefish!
• “monash-01” Cinder zone backed by Ceph available mid
2013
bought to you by
show and tell: monash-01
• (Disclaimer: we’re not good at names)
• The hardware - repurposed Swift servers:
• 8x Dell R720xd (colo osds & mons x5) - 24TB/node
• 12x 2TB 7.2k NL-SAS (12x RAID0, PERC H710p)
• 2x E5-2650(2GHz), 32GB RAM
• 20GbE (Intel X520 DP), VLANs for back/front-end
• Ceph Firefly on Ubuntu Precise, 2 replicas, ~90uTB,
60TB used, 135TB committed (thin provisioning)
bought to you by
show and tell: monash-02
• 17x Dell R720xd (virtualised mons x3)
• 9x 4TB 7.2k NL-SAS (9x RAID0, PERC H710p) - 36TB/node
• 3x 200GB Intel DC S3700 SSDs (journals and future cache)
• 1x E5-2630Lv2 (2.4GHz), 32GB RAM
• 2x 10GbE (Mellanox CX-3 DP), back/front-end active on
alternate ports (different ToR switches)
• Ceph Firefly on Ubuntu Trusty, 2 replicas, ~300uTB, 110TB
used,130TB committed
What we changed?
bought to you by
show and tell: rds[i]
• 3x Dell R320 (mons)
• 4x Dell R720xd (cache tier) - 18TB/node
• 20x 900GB 10k SAS (20x RAID0, PERC 710p) - rgw
hot tier
• 4x 400GB Intel DC S3700 SSDs (journals for rgw hot
tier)
• 2x E5-2630v2 (2.6GHz), 128GB RAM
• 56GbE (Mellanox CX-3 DP), VLANs for back/front-end
bought to you by
show and tell: rds[i]
• 33x Dell R720xd + 66 MD1200 (2 per node) -144TB/node
• 8x 6TB 7.2k NL-SAS (8x RAID0, PERC H710p) - rgw EC cold
tier
• 24x 4TB 7.2k NL-SAS (24x RAID0, PERC H810) - rbds go
here
• 4x 200GB Intel DC S3700 SSDs (journals for rbd pool)
• 2x E5-2630v2 (2.6GHz), 128GB RAM
• 20GbE (Mellanox CX-3 DP), VLANs for back/front-end
• Ceph Hammer on RHEL Maipo
bought to you by
rds logical - physical layout
bought to you by
rgw HA architecture
• DNS round-robin provides initial HA request fanout
• HAproxys handle load-balancing and SSL/TLS
termination.
• Scale arbitrarily in pairs with keepalived pairing
providing redundancy and HA via Virtual/floating
IP address (VIP) failover.
• RGW instances handle actual client/application
protocol (S3, Swift, etc) traffic.
• Scale arbitrarily.
bought to you by
new hardware/capacity
• monash-02 - another 10 nodes same config
• rds - another 9 nodes same config
• Refresh monash-01 cluster:
• 9x Dell R730xd - 96TB/node
• 16x 6TB 7.2k NL-SAS
• 2x 400GB Intel DC P3700 NVMe (journals)
• 1x E5-2630v3 (2.5GHz), 128GB RAM
• 20GbE (Intel X710 DP), VLANs for back/front-end
16x 3.5” data drives in 2RU!
bought to you by
pain / nits
• Most problems have been indirect, i.e. operating system
and hardware, Ceph itself solid
• But can be very opaque when things go wrong
• E.g., what is wrong in this picture, how bad is it, is
there any commonality or correlation of symptoms,
does cluster need intervention to recover?
	
  	
  	
  cluster	
  b8bf920a-­‐de81-­‐4ea5-­‐b63e-­‐2d5f8cced22d	
  
	
  	
  	
  	
  health	
  HEALTH_WARN	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  23	
  pgs	
  backfill	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  68	
  pgs	
  backfilling	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1230	
  pgs	
  degraded	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6017	
  pgs	
  down	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  46	
  pgs	
  incomplete	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  8099	
  pgs	
  peering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  94	
  pgs	
  recovering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  41	
  pgs	
  recovery_wait	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2824	
  pgs	
  stale	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1204	
  pgs	
  stuck	
  degraded	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  8908	
  pgs	
  stuck	
  inactive	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2824	
  pgs	
  stuck	
  stale	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  9913	
  pgs	
  stuck	
  unclean	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1073	
  pgs	
  stuck	
  undersized	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1092	
  pgs	
  undersized	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1308	
  requests	
  are	
  blocked	
  >	
  32	
  sec	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  recovery	
  168114/1648042	
  objects	
  degraded	
  (10.201%)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  recovery	
  52842/1648042	
  objects	
  misplaced	
  (3.206%)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  recovery	
  1056/460665	
  unfound	
  (0.229%)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  74/256	
  in	
  osds	
  are	
  down	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  mons	
  down,	
  quorum	
  1,2	
  rcmondc1r75-­‐02-­‐ac,rcmondc1r75-­‐01-­‐ac	
  
	
  	
  	
  	
  monmap	
  e2:	
  3	
  mons	
  at	
  {rcmondc1r75-­‐01-­‐ac=172.16.93.3:6789/0,rcmondc1r75-­‐02-­‐ac=172.16.93.2:6789/0,rcmondc1r75-­‐03-­‐ac=172.16.93.1:6789/0}	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  election	
  epoch	
  51186,	
  quorum	
  1,2	
  rcmondc1r75-­‐02-­‐ac,rcmondc1r75-­‐01-­‐ac	
  
	
  	
  	
  	
  osdmap	
  e103326:	
  848	
  osds:	
  182	
  up,	
  256	
  in;	
  1153	
  remapped	
  pgs	
  
	
  	
  	
  	
  	
  pgmap	
  v3451913:	
  10560	
  pgs,	
  18	
  pools,	
  1152	
  GB	
  data,	
  449	
  kobjects	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2547	
  GB	
  used,	
  784	
  TB	
  /	
  786	
  TB	
  avail	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  168114/1648042	
  objects	
  degraded	
  (10.201%)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  52842/1648042	
  objects	
  misplaced	
  (3.206%)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1056/460665	
  unfound	
  (0.229%)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4703	
  down+peering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1236	
  stale+down+peering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  877	
  stale+peering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  634	
  peering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  474	
  active+undersized+degraded	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  422	
  remapped+peering	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  381	
  active+clean	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  266	
  stale+active+clean	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  251	
  active+remapped
bought to you by
challenges / questions
• Best for vNAS performance via KVM+librbd
• Filesystem, # of rbds, what interface, tuning?
• Disk failure handling process
• Currently policy is to redeploy OSD for any media
error
• http://tracker.ceph.com/projects/ceph/wiki/
A_standard_framework_for_Ceph_performance_pro
filing_with_latency_breakdown
bought to you by
learnings
• use dedicated journals
• network matters - becomes much more visible
• RAID controllers with no native JBOD are ok, but be
prepared for more complicated ops
MASSIVE Business Plan 2013 / 2014 DRAFT
Title: Business Plan for the Multi-modal Australian ScienceS Imaging and
Visualisation Environment (MASSIVE) 2013 / 2014
Document no: MASSIVE-BP-2.3 DRAFT
Date: June 2013
Prepared by: Name: Wojtek J Goscinski
Title: MASSIVE Coordinator
Approved by: Name: MASSIVE Steering Committee
Date:
Open IaaS:
Technology:
30/10/2015 1:59 pmMyTardis | Automatically stores your instrument data for sharing.
Menu
MyTardis Tech Group Meeting
#3
Posted on August 20, 2015August 20, 2015 by steve.androulakissteve.androulakis
It’s been months since the last one, so a wealth of activity to report on.
MyTardis
Automatically stores your instrument data for sharing.
Application layers:
IaaS: Lustre
Cloud Storage HPC
Tenancies:
RDS/VicNode
NeCTAR
Monash
Other 1
Vault
Market

(object)
Market

(file)
Computational
MyTardis Figshare CIFS OwnCloud
Access
Layers:
Ceph

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Workshop

  • 1.
    Scale and performance:Servicing the Fabric and the Workshop Steve Quenette, Deputy Director, Monash eResearch Centre, and Blair Bethwaite, Technical Lead, Monash eResearch Centre Ceph day Melbourne 2015, Monash University brought to you by
  • 2.
    bought to youby computing for research: 
 … extremes over spectrums … 1. peak vs long-tail (spectrum of user expectations) 2. permeability: solo vs multidisciplinary (spectrum of organisational expectations) 3. paradigms: “there is no spoon” (spectrum of computing expectations)
  • 3.
    bought to youby 1. peak vs long-tail • Leading researchers build tools to see what could not be seen before, and provide that tool for others. • All researchers apply tools (of others) on new problems. “peak” and the tail
  • 4.
    bought to youby 2. permeability • Implies - over time: Research verticals… • becomes increasing complicated, involved and leveraged • involves many organisations and people
  • 5.
    bought to youby 3. discovery paradigms
  • 6.
    bought to youby technology driven discovery?… 5.7% (CAGR) Moore’s Curse, IEEE Spectrum, April 2015 http://www.i-scoop.eu/internet-of-things ,https://www.ncta.com/broadband-by-the-numbers http://www.pwc.com/gx/en/technology/mobile-innovation/assets/pwc-mobile-technologies-index-image-sensor-steady-growth-for-new-capabilities.pdf Normalised growth - innovations 1 1000 1000000 1875.00 1909.75 1944.50 1979.25 2014.00 Number of components on a microchip IoT - Number of devices on the internet light efficiency (outdoor lights) light efficiency (indoor lights) Intercontinental travel Capability of image sensors Fuel conversion efficiency (US passenger car) energy cost of steel (coke, natural gas, electricity) US corn grop yeild
  • 7.
    bought to youby 4 paradigms Emperical (“1st paradigm”) Collecting and enumerating things.
 Enabled by telescopes, microscopes, … Theoretical (“2nd paradigm”) Properties determined by models. Enabled by innovations in statistics, calculus, physical laws, … Computational (“3rd paradigm”) Models significantly more complex and sized than a human can compute. Enabled by computing growth Data-driven (“4th paradigm”) Significantly more and complex data. Enabled by sensors, storage, IoT growth
  • 8.
    bought to youby … the 4th is really … Data-mining There is so much data the f can be discovered with little or no preconditioning of what “f” is. Enabled by innovations in data- mining model/approaches (“g”) Data assimilation Both models and observations are big and complex. Enabled by innovations in inverse and optimisation model/approaches Visualisation Where very much more of x and y can be displayed to humans, and the human brain does the “data- mining”
  • 9.
    bought to youby Yes visualisation is relevant!
  • 10.
    bought to youby 21st century microscopes look more like… ANALYSIS
 Filters INSIGHT
 Lens AUSTRALIAN SYNCHROtRON MONASH 
 BIOMEDICAL
 IMAGING RAMACCIOTTI
 CRYO-EM CAVE2 
 IMMERSIVE 
 VISUALISATION DIGITAL 
 SCIENTIFIC
 DESKTOPS MONASH 
 RESEARCH 
 CLOUD CAPTURE
 Light Source, Samples SHARE
 DATA
  • 11.
    bought to youby computing for research: 
 … extremes over spectrums… 1. peak vs long-tail (spectrum of user expectations) 2. permeability: solo vs multidisciplinary (spectrum of organisational expectations) 3. paradigms: “there is no spoon” (spectrum of computing expectations) self service multiple market-driven front-ends quality accessible & multi-tenant scale low latency bandwidth front-ends “emerge”
  • 12.
    bought to youby fabric and workshop • Ceph (together with OpenStack and Neutron), means our storage is software defined • Its more like a fabric • Self-service to pieces • We choose the pieces to be right for researchers who orchestrate their own 21st century microscope • MeRC, including compute, people, etc is more like a workshop for microscope builders
  • 13.
    bought to youby storage IaaS products • Customer’s storage capacity can be a mix of… • Vault • Lower $/tb, write fast, slow retrieve • Market (Object) • Moderate $/tb • Amazon S3-like for modern “Access Layers” • Remote backup optional • Market (File) • Higher $/tb • For traditional filesystem “Access Layers” • Remote backup implied • Computational • Moderate $/tb • Direct attached volumes to R@CMon Cloud • A user can join storage capacity from other tenants (e.g. RDSI ReDS merit allocation) per “project”.
  • 14.
    bought to youby storage Access Layers • MyTardis • For Instrument Integration • From sensor to analysis to open access • Researcher, Facility & Institutional data management • Figshare • Data management for Institutions and the long-tail • (Can trial through R@CMon Storage) • Aspera • RDS/VicNode operated FTP & web access tool for very high- speed data transfer • OwnCloud (not yet in production) • Dropbox-like • Linked to user end-points across Access layers
  • 15.
    bought to youby some numbers By allocations (Q3 2015)… • Vault: 2.5uPB • Market (Object): 0.6uPB • Market (File): 2uPB • Computational: 0.5uPB • Intent: By end of 2016 all* Monash University “storage” for research will be on this infrastructure (*) Except the IS027k accredited hosting facility, and admin storage space used by researchers
  • 16.
    bought to youby at the end of the day, we are still consolidating - its just that we’ve asked where should consolidation occur
  • 17.
    bought to youby Now over to the techies… Speaking: Blair Bethwaite, Senior HPC Consultant, Monash eResearch Centre Monash Ceph Crew: Jerico Revote, Rafael Lopez, Swe Aung, Craig Beckman, George Foscolos, George Kralevski, Steve Davison, John Mann, Colin Blythe Please ask questions as we go
  • 18.
    bought to youby Ceph@Monash, some history It all started with The Cloud https://xkcd.com/908/ (NeCTAR logo added)
  • 19.
    bought to youby speaking of accidents • In early 2013 R@CMon started with Monash’s first zone of the NeCTAR cloud • Our own local cloud = awesome! But, “where do we store all the things?” • No persistent volume service provided by NeCTAR, expected from other funding sources • Plenty of object storage though… • Enter Cuttlefish! • “monash-01” Cinder zone backed by Ceph available mid 2013
  • 20.
    bought to youby show and tell: monash-01 • (Disclaimer: we’re not good at names) • The hardware - repurposed Swift servers: • 8x Dell R720xd (colo osds & mons x5) - 24TB/node • 12x 2TB 7.2k NL-SAS (12x RAID0, PERC H710p) • 2x E5-2650(2GHz), 32GB RAM • 20GbE (Intel X520 DP), VLANs for back/front-end • Ceph Firefly on Ubuntu Precise, 2 replicas, ~90uTB, 60TB used, 135TB committed (thin provisioning)
  • 21.
    bought to youby show and tell: monash-02 • 17x Dell R720xd (virtualised mons x3) • 9x 4TB 7.2k NL-SAS (9x RAID0, PERC H710p) - 36TB/node • 3x 200GB Intel DC S3700 SSDs (journals and future cache) • 1x E5-2630Lv2 (2.4GHz), 32GB RAM • 2x 10GbE (Mellanox CX-3 DP), back/front-end active on alternate ports (different ToR switches) • Ceph Firefly on Ubuntu Trusty, 2 replicas, ~300uTB, 110TB used,130TB committed What we changed?
  • 22.
    bought to youby show and tell: rds[i] • 3x Dell R320 (mons) • 4x Dell R720xd (cache tier) - 18TB/node • 20x 900GB 10k SAS (20x RAID0, PERC 710p) - rgw hot tier • 4x 400GB Intel DC S3700 SSDs (journals for rgw hot tier) • 2x E5-2630v2 (2.6GHz), 128GB RAM • 56GbE (Mellanox CX-3 DP), VLANs for back/front-end
  • 23.
    bought to youby show and tell: rds[i] • 33x Dell R720xd + 66 MD1200 (2 per node) -144TB/node • 8x 6TB 7.2k NL-SAS (8x RAID0, PERC H710p) - rgw EC cold tier • 24x 4TB 7.2k NL-SAS (24x RAID0, PERC H810) - rbds go here • 4x 200GB Intel DC S3700 SSDs (journals for rbd pool) • 2x E5-2630v2 (2.6GHz), 128GB RAM • 20GbE (Mellanox CX-3 DP), VLANs for back/front-end • Ceph Hammer on RHEL Maipo
  • 24.
    bought to youby rds logical - physical layout
  • 25.
    bought to youby rgw HA architecture • DNS round-robin provides initial HA request fanout • HAproxys handle load-balancing and SSL/TLS termination. • Scale arbitrarily in pairs with keepalived pairing providing redundancy and HA via Virtual/floating IP address (VIP) failover. • RGW instances handle actual client/application protocol (S3, Swift, etc) traffic. • Scale arbitrarily.
  • 26.
    bought to youby new hardware/capacity • monash-02 - another 10 nodes same config • rds - another 9 nodes same config • Refresh monash-01 cluster: • 9x Dell R730xd - 96TB/node • 16x 6TB 7.2k NL-SAS • 2x 400GB Intel DC P3700 NVMe (journals) • 1x E5-2630v3 (2.5GHz), 128GB RAM • 20GbE (Intel X710 DP), VLANs for back/front-end 16x 3.5” data drives in 2RU!
  • 27.
    bought to youby pain / nits • Most problems have been indirect, i.e. operating system and hardware, Ceph itself solid • But can be very opaque when things go wrong • E.g., what is wrong in this picture, how bad is it, is there any commonality or correlation of symptoms, does cluster need intervention to recover?      cluster  b8bf920a-­‐de81-­‐4ea5-­‐b63e-­‐2d5f8cced22d          health  HEALTH_WARN                        23  pgs  backfill                        68  pgs  backfilling                        1230  pgs  degraded                        6017  pgs  down                        46  pgs  incomplete                        8099  pgs  peering                        94  pgs  recovering                        41  pgs  recovery_wait                        2824  pgs  stale                        1204  pgs  stuck  degraded                        8908  pgs  stuck  inactive                        2824  pgs  stuck  stale                        9913  pgs  stuck  unclean                        1073  pgs  stuck  undersized                        1092  pgs  undersized                        1308  requests  are  blocked  >  32  sec                        recovery  168114/1648042  objects  degraded  (10.201%)                        recovery  52842/1648042  objects  misplaced  (3.206%)                        recovery  1056/460665  unfound  (0.229%)                        74/256  in  osds  are  down                        1  mons  down,  quorum  1,2  rcmondc1r75-­‐02-­‐ac,rcmondc1r75-­‐01-­‐ac          monmap  e2:  3  mons  at  {rcmondc1r75-­‐01-­‐ac=172.16.93.3:6789/0,rcmondc1r75-­‐02-­‐ac=172.16.93.2:6789/0,rcmondc1r75-­‐03-­‐ac=172.16.93.1:6789/0}                        election  epoch  51186,  quorum  1,2  rcmondc1r75-­‐02-­‐ac,rcmondc1r75-­‐01-­‐ac          osdmap  e103326:  848  osds:  182  up,  256  in;  1153  remapped  pgs            pgmap  v3451913:  10560  pgs,  18  pools,  1152  GB  data,  449  kobjects                        2547  GB  used,  784  TB  /  786  TB  avail                        168114/1648042  objects  degraded  (10.201%)                        52842/1648042  objects  misplaced  (3.206%)                        1056/460665  unfound  (0.229%)                                4703  down+peering                                1236  stale+down+peering                                  877  stale+peering                                  634  peering                                  474  active+undersized+degraded                                  422  remapped+peering                                  381  active+clean                                  266  stale+active+clean                                  251  active+remapped
  • 28.
    bought to youby challenges / questions • Best for vNAS performance via KVM+librbd • Filesystem, # of rbds, what interface, tuning? • Disk failure handling process • Currently policy is to redeploy OSD for any media error • http://tracker.ceph.com/projects/ceph/wiki/ A_standard_framework_for_Ceph_performance_pro filing_with_latency_breakdown
  • 29.
    bought to youby learnings • use dedicated journals • network matters - becomes much more visible • RAID controllers with no native JBOD are ok, but be prepared for more complicated ops
  • 30.
    MASSIVE Business Plan2013 / 2014 DRAFT Title: Business Plan for the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE) 2013 / 2014 Document no: MASSIVE-BP-2.3 DRAFT Date: June 2013 Prepared by: Name: Wojtek J Goscinski Title: MASSIVE Coordinator Approved by: Name: MASSIVE Steering Committee Date: Open IaaS: Technology: 30/10/2015 1:59 pmMyTardis | Automatically stores your instrument data for sharing. Menu MyTardis Tech Group Meeting #3 Posted on August 20, 2015August 20, 2015 by steve.androulakissteve.androulakis It’s been months since the last one, so a wealth of activity to report on. MyTardis Automatically stores your instrument data for sharing. Application layers:
  • 31.
    IaaS: Lustre Cloud StorageHPC Tenancies: RDS/VicNode NeCTAR Monash Other 1 Vault Market
 (object) Market
 (file) Computational MyTardis Figshare CIFS OwnCloud Access Layers: Ceph