SlideShare a Scribd company logo
DAOS Adventures at CERN openlab
Miguel F. Medeiros on behalf of openlab team
miguel.fontes.medeiros@cern.ch
19/11/2020
DAOS User Group 2020 (DUG20)
• CERN’s IT branch for cutting-edge computing technologies.
• Active collaboration with Intel on several projects: https://openlab.cern/members/intel
CERN openlab
Miguel F. Medeiros | DAOS Adventures at CERN openlab 2
https://openlab.cern/
“CERN openlab is a unique public-private partnership that works to accelerate the
development of cutting-edge ICT solutions for the worldwide LHC community and wider
scientific research. Through CERN openlab, CERN collaborates with leading ICT companies
and research institutes.”
19/11/2020
• Work performed under CERN’s openlab umbrella
• CERN openlab comprises many projects and collaborations which we provide technical support.
• The specific openlab DAOS use-cases will not be covered on this talk.
• This talk will only focus on the sysadmin/technical aspects
• We will present our experience and process on commissioning, testing and benchmarking DAOS.
• We will try to provide insights, findings and hopefully valuable feedback for DAOS developers.
Disclaimer
Miguel F. Medeiros | DAOS Adventures at CERN openlab 319/11/2020
Releasing it to our users: the process
Miguel F. Medeiros | DAOS Adventures at CERN openlab 4
• Intel Workshop at CERN on February 2020 (right on time…)
• Benchmark and test it ourselves (with the valuable support of Intel experts)
• Release DAOS with socket configuration
• Allow all users to get acquainted with the system.
• Test the functionality, development and integration aspects.
• Release DAOS with PSM2 configuration
• Exclusive cluster access.
• Allow the users with performance requirements to test their use cases.
19/11/2020
Cluster Hardware
Miguel F. Medeiros | DAOS Adventures at CERN openlab 5
• 4x Cascade Lakes
• 4x SkyLakes (only 2x with DAOS)
19/11/2020
Benchmarking considerations
Miguel F. Medeiros | DAOS Adventures at CERN openlab 6
• Validated the Omnipath cluster with MPI tests & benchmarks
• OSU Micro-Benchmarks, IntelMPI
• Benchmark was based on IOR [1] with DAOS API
• All benchmarks were performed with DAOS v0.9.4
• Topology
• 3x DAOS Servers
• 2x DAOS Clients
[1]: https://github.com/hpc/ior
19/11/2020
Benchmarking with IOR: DAOS Sockets
Miguel F. Medeiros | DAOS Adventures at CERN openlab 7
• Functional tests with Sockets configuration
19/11/2020
Each point represents the average of 20 iterations. Error bars are standard deviation.
Benchmarking with IOR: DAOS PSM2
Miguel F. Medeiros | DAOS Adventures at CERN openlab 8
• We also tested with PSM2 → performance gains
19/11/2020
Each point represents the average of 20 iterations. Error bars are standard deviation.
Benchmarking with IOR: DAOS PSM2 scaling
11/18/2020 Miguel F. Medeiros | DAOS Adventures at CERN openlab 9
• Two node test
Each point represents the average of 20 iterations. Error bars are standard deviation.
Performance limitations
Miguel F. Medeiros | DAOS Adventures at CERN openlab 10
• Finding the missing performance…
• Cascade Lake with half performance.
• We suspect the riser card → HFI card is PCIe 16x but our riser card only provides 8x elec, 8x mech.
19/11/2020
Each point represents the average of 20 iterations. Error bars are standard deviation.
A SysAdmin experience – feedback for developers
Miguel F. Medeiros | DAOS Adventures at CERN openlab 11
DISCLAIMER:
• Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was improved in
the meantime please ignore its mention.
19/11/2020
A SysAdmin experience – feedback for developers 1
Miguel F. Medeiros | DAOS Adventures at CERN openlab 12
• Installation in our environment was challenging
• Dependency resolution with conflicts → we rely on specific software versions for our internal software.
• Warn users about “sanity/pre-flight” checks before compiling.
• Error reporting
• Difficult to troubleshoot some issues.
• We rely on error messages for troubleshooting and not all errors were mapped on https://daos-
stack.github.io/admin/troubleshooting/#daos-errors
DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was
improved in the meantime please ignore its mention.
19/11/2020
A SysAdmin experience – feedback for developers 2
Miguel F. Medeiros | DAOS Adventures at CERN openlab 13
• What about a “--detail” option for admins?
NVMe size SCM size Creation date Creator?…
DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was
improved in the meantime please ignore its mention.
19/11/2020
--detail
Miguel F. Medeiros | DAOS Adventures at CERN openlab 14
• DAOS Servers with constant 30% CPU utilization.
A SysAdmin experience – feedback for developers 3
DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was
improved in the meantime please ignore its mention.
19/11/2020
Miguel F. Medeiros | DAOS Adventures at CERN openlab 15
• System metrics
• Nice to have these metrics in handy formats (e.g: json).
• Unpack the “message” section.
→ Facilitate the integration on monitoring solutions without extra parsers.
A SysAdmin experience – feedback for developers 4
DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was
improved in the meantime please ignore its mention.
19/11/2020
Final thoughts
Miguel F. Medeiros | DAOS Adventures at CERN openlab 16
• System commissioning was challenging and interesting!
• Required some debugging in our environment.
• There is still some room for a performance increase
• Issue with riser cards, one HFI card per socket, etc.
• Scalability testing
• We needed more nodes to fully evaluate the system.
• The DAOS server configuration is a plus
• Quite user friendly!
• Works well with configuration management tools (Puppet).
• The difficulty was mostly to understand which settings suited best for our cluster.
• On a personal note, it was a good experience with several learning opportunities.
Thank you Intel for all the support during this Benchmark process!
19/11/2020
home.cern

More Related Content

Similar to DUG'20: 06 - DAOS Adventures at CERN Openlab

Managing ScaleIO as Software on Mesos
Managing ScaleIO as Software on MesosManaging ScaleIO as Software on Mesos
Managing ScaleIO as Software on Mesos
David vonThenen
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Group, Inc.
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
harryvanhaaren
 
Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015
WaveMaker, Inc.
 
A295 nodejs-knowledge-accelerator
A295   nodejs-knowledge-acceleratorA295   nodejs-knowledge-accelerator
A295 nodejs-knowledge-accelerator
Michael Dawson
 
Deploying Containers in Production and at Scale
Deploying Containers in Production and at ScaleDeploying Containers in Production and at Scale
Deploying Containers in Production and at Scale
Mesosphere Inc.
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
Steve Wong
 
SDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesSDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunities
Diego Kreutz
 
"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop
Patrick Chanezon
 
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
Christoph Adler
 
Strategy, planning and governance for enterprise deployments of containers - ...
Strategy, planning and governance for enterprise deployments of containers - ...Strategy, planning and governance for enterprise deployments of containers - ...
Strategy, planning and governance for enterprise deployments of containers - ...
The Incredible Automation Day
 
Docker Enterprise Deployment Planning
Docker Enterprise Deployment PlanningDocker Enterprise Deployment Planning
Docker Enterprise Deployment Planning
Stephane Woillez
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Ambassador Labs
 
Tips for Installing Cognos Analytics 11.2.1x
Tips for Installing Cognos Analytics 11.2.1xTips for Installing Cognos Analytics 11.2.1x
Tips for Installing Cognos Analytics 11.2.1x
Senturus
 
Container Attached Storage (CAS) with OpenEBS - SDC 2018
Container Attached Storage (CAS) with OpenEBS -  SDC 2018Container Attached Storage (CAS) with OpenEBS -  SDC 2018
Container Attached Storage (CAS) with OpenEBS - SDC 2018
OpenEBS
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
Christoph Adler
 
Node-RED Installer, Standalone Installer using Electron
Node-RED Installer, Standalone Installer using ElectronNode-RED Installer, Standalone Installer using Electron
Node-RED Installer, Standalone Installer using Electron
Hitachi, Ltd. OSS Solution Center.
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
Node.js Service - Best practices in 2019
Node.js Service - Best practices in 2019Node.js Service - Best practices in 2019
Node.js Service - Best practices in 2019
Olivier Loverde
 

Similar to DUG'20: 06 - DAOS Adventures at CERN Openlab (20)

Managing ScaleIO as Software on Mesos
Managing ScaleIO as Software on MesosManaging ScaleIO as Software on Mesos
Managing ScaleIO as Software on Mesos
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker & aPaaS: Enterprise Innovation and Trends for 2015
 
A295 nodejs-knowledge-accelerator
A295   nodejs-knowledge-acceleratorA295   nodejs-knowledge-accelerator
A295 nodejs-knowledge-accelerator
 
Deploying Containers in Production and at Scale
Deploying Containers in Production and at ScaleDeploying Containers in Production and at Scale
Deploying Containers in Production and at Scale
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
SDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesSDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunities
 
"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop
 
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
 
Strategy, planning and governance for enterprise deployments of containers - ...
Strategy, planning and governance for enterprise deployments of containers - ...Strategy, planning and governance for enterprise deployments of containers - ...
Strategy, planning and governance for enterprise deployments of containers - ...
 
Docker Enterprise Deployment Planning
Docker Enterprise Deployment PlanningDocker Enterprise Deployment Planning
Docker Enterprise Deployment Planning
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
 
Tips for Installing Cognos Analytics 11.2.1x
Tips for Installing Cognos Analytics 11.2.1xTips for Installing Cognos Analytics 11.2.1x
Tips for Installing Cognos Analytics 11.2.1x
 
Container Attached Storage (CAS) with OpenEBS - SDC 2018
Container Attached Storage (CAS) with OpenEBS -  SDC 2018Container Attached Storage (CAS) with OpenEBS -  SDC 2018
Container Attached Storage (CAS) with OpenEBS - SDC 2018
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
 
Node-RED Installer, Standalone Installer using Electron
Node-RED Installer, Standalone Installer using ElectronNode-RED Installer, Standalone Installer using Electron
Node-RED Installer, Standalone Installer using Electron
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Node.js Service - Best practices in 2019
Node.js Service - Best practices in 2019Node.js Service - Best practices in 2019
Node.js Service - Best practices in 2019
 

More from Andrey Kudryavtsev

DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
Andrey Kudryavtsev
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
Andrey Kudryavtsev
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
Andrey Kudryavtsev
 
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesDUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
Andrey Kudryavtsev
 
DUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware UpdateDUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware Update
Andrey Kudryavtsev
 
DUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY MappingDUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY Mapping
Andrey Kudryavtsev
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Andrey Kudryavtsev
 
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS TestbedDUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed
Andrey Kudryavtsev
 
DUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature UpdateDUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature Update
Andrey Kudryavtsev
 
DUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOSDUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOS
Andrey Kudryavtsev
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
 
DUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS UpdateDUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS Update
Andrey Kudryavtsev
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
Andrey Kudryavtsev
 

More from Andrey Kudryavtsev (13)

DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
 
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesDUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
 
DUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware UpdateDUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware Update
 
DUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY MappingDUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY Mapping
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOS
 
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS TestbedDUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed
 
DUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature UpdateDUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature Update
 
DUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOSDUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOS
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
DUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS UpdateDUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS Update
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
 

Recently uploaded

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 

Recently uploaded (20)

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 

DUG'20: 06 - DAOS Adventures at CERN Openlab

  • 1. DAOS Adventures at CERN openlab Miguel F. Medeiros on behalf of openlab team miguel.fontes.medeiros@cern.ch 19/11/2020 DAOS User Group 2020 (DUG20)
  • 2. • CERN’s IT branch for cutting-edge computing technologies. • Active collaboration with Intel on several projects: https://openlab.cern/members/intel CERN openlab Miguel F. Medeiros | DAOS Adventures at CERN openlab 2 https://openlab.cern/ “CERN openlab is a unique public-private partnership that works to accelerate the development of cutting-edge ICT solutions for the worldwide LHC community and wider scientific research. Through CERN openlab, CERN collaborates with leading ICT companies and research institutes.” 19/11/2020
  • 3. • Work performed under CERN’s openlab umbrella • CERN openlab comprises many projects and collaborations which we provide technical support. • The specific openlab DAOS use-cases will not be covered on this talk. • This talk will only focus on the sysadmin/technical aspects • We will present our experience and process on commissioning, testing and benchmarking DAOS. • We will try to provide insights, findings and hopefully valuable feedback for DAOS developers. Disclaimer Miguel F. Medeiros | DAOS Adventures at CERN openlab 319/11/2020
  • 4. Releasing it to our users: the process Miguel F. Medeiros | DAOS Adventures at CERN openlab 4 • Intel Workshop at CERN on February 2020 (right on time…) • Benchmark and test it ourselves (with the valuable support of Intel experts) • Release DAOS with socket configuration • Allow all users to get acquainted with the system. • Test the functionality, development and integration aspects. • Release DAOS with PSM2 configuration • Exclusive cluster access. • Allow the users with performance requirements to test their use cases. 19/11/2020
  • 5. Cluster Hardware Miguel F. Medeiros | DAOS Adventures at CERN openlab 5 • 4x Cascade Lakes • 4x SkyLakes (only 2x with DAOS) 19/11/2020
  • 6. Benchmarking considerations Miguel F. Medeiros | DAOS Adventures at CERN openlab 6 • Validated the Omnipath cluster with MPI tests & benchmarks • OSU Micro-Benchmarks, IntelMPI • Benchmark was based on IOR [1] with DAOS API • All benchmarks were performed with DAOS v0.9.4 • Topology • 3x DAOS Servers • 2x DAOS Clients [1]: https://github.com/hpc/ior 19/11/2020
  • 7. Benchmarking with IOR: DAOS Sockets Miguel F. Medeiros | DAOS Adventures at CERN openlab 7 • Functional tests with Sockets configuration 19/11/2020 Each point represents the average of 20 iterations. Error bars are standard deviation.
  • 8. Benchmarking with IOR: DAOS PSM2 Miguel F. Medeiros | DAOS Adventures at CERN openlab 8 • We also tested with PSM2 → performance gains 19/11/2020 Each point represents the average of 20 iterations. Error bars are standard deviation.
  • 9. Benchmarking with IOR: DAOS PSM2 scaling 11/18/2020 Miguel F. Medeiros | DAOS Adventures at CERN openlab 9 • Two node test Each point represents the average of 20 iterations. Error bars are standard deviation.
  • 10. Performance limitations Miguel F. Medeiros | DAOS Adventures at CERN openlab 10 • Finding the missing performance… • Cascade Lake with half performance. • We suspect the riser card → HFI card is PCIe 16x but our riser card only provides 8x elec, 8x mech. 19/11/2020 Each point represents the average of 20 iterations. Error bars are standard deviation.
  • 11. A SysAdmin experience – feedback for developers Miguel F. Medeiros | DAOS Adventures at CERN openlab 11 DISCLAIMER: • Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was improved in the meantime please ignore its mention. 19/11/2020
  • 12. A SysAdmin experience – feedback for developers 1 Miguel F. Medeiros | DAOS Adventures at CERN openlab 12 • Installation in our environment was challenging • Dependency resolution with conflicts → we rely on specific software versions for our internal software. • Warn users about “sanity/pre-flight” checks before compiling. • Error reporting • Difficult to troubleshoot some issues. • We rely on error messages for troubleshooting and not all errors were mapped on https://daos- stack.github.io/admin/troubleshooting/#daos-errors DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was improved in the meantime please ignore its mention. 19/11/2020
  • 13. A SysAdmin experience – feedback for developers 2 Miguel F. Medeiros | DAOS Adventures at CERN openlab 13 • What about a “--detail” option for admins? NVMe size SCM size Creation date Creator?… DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was improved in the meantime please ignore its mention. 19/11/2020 --detail
  • 14. Miguel F. Medeiros | DAOS Adventures at CERN openlab 14 • DAOS Servers with constant 30% CPU utilization. A SysAdmin experience – feedback for developers 3 DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was improved in the meantime please ignore its mention. 19/11/2020
  • 15. Miguel F. Medeiros | DAOS Adventures at CERN openlab 15 • System metrics • Nice to have these metrics in handy formats (e.g: json). • Unpack the “message” section. → Facilitate the integration on monitoring solutions without extra parsers. A SysAdmin experience – feedback for developers 4 DISCLAIMER: Please note that all comments provided here are based on a DAOS v0.9.4 experience. If something was improved in the meantime please ignore its mention. 19/11/2020
  • 16. Final thoughts Miguel F. Medeiros | DAOS Adventures at CERN openlab 16 • System commissioning was challenging and interesting! • Required some debugging in our environment. • There is still some room for a performance increase • Issue with riser cards, one HFI card per socket, etc. • Scalability testing • We needed more nodes to fully evaluate the system. • The DAOS server configuration is a plus • Quite user friendly! • Works well with configuration management tools (Puppet). • The difficulty was mostly to understand which settings suited best for our cluster. • On a personal note, it was a good experience with several learning opportunities. Thank you Intel for all the support during this Benchmark process! 19/11/2020