SlideShare a Scribd company logo
Very Early Experiences with a 0.5 PByte DAOS Testbed
Steffen Christgau, Tobias Watermann, Thomas Steinke
Supercomputing Department
Zuse Institute Berlin
DAOS User Group Meeting 2020
Computing + Data Storing @ Zuse Institute Berlin
• ZIB operates HLRN-IV "Lise" for German science community
• Motivation for DAOS:
Our current Lustre installation w/o Burst Buffer ⇒ poor IOPS performance
Complement Lustre? — Burst Buffer, BeeGFS/BeeOND, DAOS
Workloads that can benefit from DAOS: turbulence simulation, astrophysics,
small/many files I/O, · · · AI/DL
Evaluation of new storage concepts vs. "traditional" concepts
• DAOS as research and later production platform
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 1 / 10
HLRN-IV "Lise" @ ZIB
8 PFlop/s peak
1,270 nodes, Intel CLX AP
121,920 cores
3,000+ users — 200+ projects
8 PB Lustre
A B
~32 Å
~16 Å
• Chemistry incl. Material Science
• Earth Science incl. Climate Research
• Engineering
• Life Science (biology, medicine)
• Physics incl. Astro and High-Energy Physics
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 2 / 10
DAOS @ ZIB: Exploration Testbed
• DAOS user since July 2019
• version 0.6... and git commits before that
• manual compilation process
• Exploration Testbed: used for DCPM and DAOS exploration
isolated from HLRN
2 Inspur dual-socket nodes (CLX-SP Platinum 8260L)
3 + 6 TB Optane DCPMM and 384 GB + 768 GB DRAM
8 + 16 TB Optane SSD
single 100 Gb/s OmniPath back-to-back
CentOS 7
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 3 / 10
DAOS @ HLRN: Integration Testbed
larger testbed to be integrated in HLRN infrastructure
• 20 dual-socket server nodes (CLX Gold 6240R)
• 192 GB DRAM
• 1.5 TB DCPM
• 25.6 TB NVMe NAND SSD
• 2 x 100 Gbit/s OmniPath
total capacity 512 TB SSD + 30 TB DCPM
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 4 / 10
Installation Experiences with DAOS 1.0 (I)
compared to earlier versions
• Prebuilt DAOS packages are a good thing! We use these, no in-house build.
• online package repositories would be even better (no login please, see oneAPI)
• better OS integeration, support for installation and deployment
• documentation improved a lot
• documentation sometimes more promising than reality
• configuration defaults and comments from examples do not apply
• # scm_class default: dcpm → scm config validation failed: scm_class not set
• immutable after reformat hints are good
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 5 / 10
Installation Experiences with DAOS 1.0 (II)
compared to earlier versions
• error messages not always helpful: failed to connect to pool: -1026
• logs are more useful (sometimes)
• content of stderr vs. log files vs. system logging (journal)
• MPI(CH) integration appreciated!
• middleware in general: unclear version management → packages?!
• user interaction with fusefs/POSIX container: orphaned/forgotten mounts
• dfuse daemon might be a good thing
• PSM2 issue: multi-tennant usage with OmniPath not supported
• intent to use sockets or verbs instead
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 6 / 10
Simple DAOS Key Value API Benchmark
• use DAOS key value (kv) library, DAOS 1.0.1 for very simple test
• perform operation 1000× in blocking fashion, median reported
• key = 4 Bytes, value = 32 Bytes
0 20 40 60 80 100 120 140
fi_pingpong RTT psm2
fi_pingpong RTT sockets
put
get (size-only)
get (value)
get (miss)
remove
2
26
135
118
118
54
92
time / us
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 7 / 10
Current Status
Phase 1: Installation / integration
• hardware shipped mid September, ready to boot OS end of September
• DAOS software integration in HLRN cluster near completion
• early tests on exploration testbed with critical HLRN workloads
MPI IO middleware works seemless with MPI-ready application, see DUG’19
netCDF/HDF5 testing planned (waiting for compatible versions)
before
after
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 8 / 10
Next Planned Step
Phase 2: Research with integration testbed
• Usability & user interface: application integration for a few test cases
• Administration: experiences with capabilities of the management of pools,
containers, . . ., monitoring & performance
Phase 3: Access for selected user projects
• intent to use per-project pool
• provide DAOS as optional and additional offer for heavy IO worklads besides
existing Lustre (work), NFS (home), and SSD drive (scratch)
• support power users in migration → easy when MPI/HDF/netCDF is used for IO
• disseminiation planned for 2021 ff.
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 9 / 10
Summary
• DAOS improved significantly since our first contact...and is still improving
• integration into production system in progress
• mature for early access of (power) users
Thanks for your attention!
Questions?
Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 10 / 10

More Related Content

What's hot

Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
Ceph Community
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Community
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
Ceph Community
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Ceph Community
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Ceph Community
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Community
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Community
 
SUSE - performance analysis-with_ceph
SUSE - performance analysis-with_cephSUSE - performance analysis-with_ceph
SUSE - performance analysis-with_ceph
inwin stack
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
Karan Singh
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
Ceph Community
 
Ceph optimized Storage / Global HW solutions for SDS, David Alvarez
Ceph optimized Storage / Global HW solutions for SDS, David AlvarezCeph optimized Storage / Global HW solutions for SDS, David Alvarez
Ceph optimized Storage / Global HW solutions for SDS, David Alvarez
Ceph Community
 
Ceph's journey at SUSE
Ceph's journey at SUSECeph's journey at SUSE
Ceph's journey at SUSE
Ceph Community
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
Kyle Bader
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
New use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis RicoNew use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis Rico
Ceph Community
 

What's hot (20)

Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
 
SUSE - performance analysis-with_ceph
SUSE - performance analysis-with_cephSUSE - performance analysis-with_ceph
SUSE - performance analysis-with_ceph
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Ceph optimized Storage / Global HW solutions for SDS, David Alvarez
Ceph optimized Storage / Global HW solutions for SDS, David AlvarezCeph optimized Storage / Global HW solutions for SDS, David Alvarez
Ceph optimized Storage / Global HW solutions for SDS, David Alvarez
 
Ceph's journey at SUSE
Ceph's journey at SUSECeph's journey at SUSE
Ceph's journey at SUSE
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
New use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis RicoNew use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis Rico
 

Similar to DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed

Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
All Things Open
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
rhatr
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Stackato
StackatoStackato
Stackato
Jonas Brømsø
 
Performant Django - Ara Anjargolian
Performant Django - Ara AnjargolianPerformant Django - Ara Anjargolian
Performant Django - Ara Anjargolian
Hakka Labs
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
Andrew Yongjoon Kong
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Wangda Tan
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
Doug Chang
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
MariaDB Corporation
 
Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
Edgar Alejandro Villegas
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
DataWorks Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
Kamesh Pemmaraju
 
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Red_Hat_Storage
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
SQL Saturday San Diego
SQL Saturday San DiegoSQL Saturday San Diego
SQL Saturday San Diego
Kellyn Pot'Vin-Gorman
 

Similar to DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed (20)

Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Stackato
StackatoStackato
Stackato
 
Performant Django - Ara Anjargolian
Performant Django - Ara AnjargolianPerformant Django - Ara Anjargolian
Performant Django - Ara Anjargolian
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
SQL Saturday San Diego
SQL Saturday San DiegoSQL Saturday San Diego
SQL Saturday San Diego
 

More from Andrey Kudryavtsev

DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
Andrey Kudryavtsev
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
Andrey Kudryavtsev
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
Andrey Kudryavtsev
 
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesDUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
Andrey Kudryavtsev
 
DUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware UpdateDUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware Update
Andrey Kudryavtsev
 
DUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY MappingDUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY Mapping
Andrey Kudryavtsev
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Andrey Kudryavtsev
 
DUG'20: 06 - DAOS Adventures at CERN Openlab
DUG'20: 06 - DAOS Adventures at CERN OpenlabDUG'20: 06 - DAOS Adventures at CERN Openlab
DUG'20: 06 - DAOS Adventures at CERN Openlab
Andrey Kudryavtsev
 
DUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature UpdateDUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature Update
Andrey Kudryavtsev
 
DUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOSDUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOS
Andrey Kudryavtsev
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
 
DUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS UpdateDUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS Update
Andrey Kudryavtsev
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
Andrey Kudryavtsev
 

More from Andrey Kudryavtsev (13)

DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
 
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesDUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
 
DUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware UpdateDUG'20: 09 - DAOS Middleware Update
DUG'20: 09 - DAOS Middleware Update
 
DUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY MappingDUG'20: 08 - DAOS-SEGY Mapping
DUG'20: 08 - DAOS-SEGY Mapping
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOS
 
DUG'20: 06 - DAOS Adventures at CERN Openlab
DUG'20: 06 - DAOS Adventures at CERN OpenlabDUG'20: 06 - DAOS Adventures at CERN Openlab
DUG'20: 06 - DAOS Adventures at CERN Openlab
 
DUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature UpdateDUG'20: 04 - DAOS Feature Update
DUG'20: 04 - DAOS Feature Update
 
DUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOSDUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOS
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
DUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS UpdateDUG'20: 01 - Welcome & DAOS Update
DUG'20: 01 - Welcome & DAOS Update
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 

DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS Testbed

  • 1. Very Early Experiences with a 0.5 PByte DAOS Testbed Steffen Christgau, Tobias Watermann, Thomas Steinke Supercomputing Department Zuse Institute Berlin DAOS User Group Meeting 2020
  • 2. Computing + Data Storing @ Zuse Institute Berlin • ZIB operates HLRN-IV "Lise" for German science community • Motivation for DAOS: Our current Lustre installation w/o Burst Buffer ⇒ poor IOPS performance Complement Lustre? — Burst Buffer, BeeGFS/BeeOND, DAOS Workloads that can benefit from DAOS: turbulence simulation, astrophysics, small/many files I/O, · · · AI/DL Evaluation of new storage concepts vs. "traditional" concepts • DAOS as research and later production platform Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 1 / 10
  • 3. HLRN-IV "Lise" @ ZIB 8 PFlop/s peak 1,270 nodes, Intel CLX AP 121,920 cores 3,000+ users — 200+ projects 8 PB Lustre A B ~32 Å ~16 Å • Chemistry incl. Material Science • Earth Science incl. Climate Research • Engineering • Life Science (biology, medicine) • Physics incl. Astro and High-Energy Physics Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 2 / 10
  • 4. DAOS @ ZIB: Exploration Testbed • DAOS user since July 2019 • version 0.6... and git commits before that • manual compilation process • Exploration Testbed: used for DCPM and DAOS exploration isolated from HLRN 2 Inspur dual-socket nodes (CLX-SP Platinum 8260L) 3 + 6 TB Optane DCPMM and 384 GB + 768 GB DRAM 8 + 16 TB Optane SSD single 100 Gb/s OmniPath back-to-back CentOS 7 Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 3 / 10
  • 5. DAOS @ HLRN: Integration Testbed larger testbed to be integrated in HLRN infrastructure • 20 dual-socket server nodes (CLX Gold 6240R) • 192 GB DRAM • 1.5 TB DCPM • 25.6 TB NVMe NAND SSD • 2 x 100 Gbit/s OmniPath total capacity 512 TB SSD + 30 TB DCPM Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 4 / 10
  • 6. Installation Experiences with DAOS 1.0 (I) compared to earlier versions • Prebuilt DAOS packages are a good thing! We use these, no in-house build. • online package repositories would be even better (no login please, see oneAPI) • better OS integeration, support for installation and deployment • documentation improved a lot • documentation sometimes more promising than reality • configuration defaults and comments from examples do not apply • # scm_class default: dcpm → scm config validation failed: scm_class not set • immutable after reformat hints are good Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 5 / 10
  • 7. Installation Experiences with DAOS 1.0 (II) compared to earlier versions • error messages not always helpful: failed to connect to pool: -1026 • logs are more useful (sometimes) • content of stderr vs. log files vs. system logging (journal) • MPI(CH) integration appreciated! • middleware in general: unclear version management → packages?! • user interaction with fusefs/POSIX container: orphaned/forgotten mounts • dfuse daemon might be a good thing • PSM2 issue: multi-tennant usage with OmniPath not supported • intent to use sockets or verbs instead Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 6 / 10
  • 8. Simple DAOS Key Value API Benchmark • use DAOS key value (kv) library, DAOS 1.0.1 for very simple test • perform operation 1000× in blocking fashion, median reported • key = 4 Bytes, value = 32 Bytes 0 20 40 60 80 100 120 140 fi_pingpong RTT psm2 fi_pingpong RTT sockets put get (size-only) get (value) get (miss) remove 2 26 135 118 118 54 92 time / us Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 7 / 10
  • 9. Current Status Phase 1: Installation / integration • hardware shipped mid September, ready to boot OS end of September • DAOS software integration in HLRN cluster near completion • early tests on exploration testbed with critical HLRN workloads MPI IO middleware works seemless with MPI-ready application, see DUG’19 netCDF/HDF5 testing planned (waiting for compatible versions) before after Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 8 / 10
  • 10. Next Planned Step Phase 2: Research with integration testbed • Usability & user interface: application integration for a few test cases • Administration: experiences with capabilities of the management of pools, containers, . . ., monitoring & performance Phase 3: Access for selected user projects • intent to use per-project pool • provide DAOS as optional and additional offer for heavy IO worklads besides existing Lustre (work), NFS (home), and SSD drive (scratch) • support power users in migration → easy when MPI/HDF/netCDF is used for IO • disseminiation planned for 2021 ff. Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 9 / 10
  • 11. Summary • DAOS improved significantly since our first contact...and is still improving • integration into production system in progress • mature for early access of (power) users Thanks for your attention! Questions? Christgau/Watermann/Steinke (ZIB) Very Early Experiences with a 0.5 PByte DAOS Testbed DAOS User Group Meeting 2020 10 / 10