This document provides an overview of parallel HDF5 and performance tuning in the HDF5 library. It discusses the design and implementation of parallel HDF5 (PHDF5), including requirements, layers, programming restrictions, examples of APIs, and support for creating/accessing files and datasets in parallel. It also provides examples of writing and reading datasets in parallel using hyperslabs, including by rows, columns, pattern, and chunks. The goal of PHDF5 is to enable efficient parallel I/O while maintaining compatibility with serial HDF5 files.
This document contains information about PHP including:
1. An introduction to PHP and its uses for web development.
2. Descriptions of common PHP functions like include, require, and include_once for including files.
3. Discussions of using PHP with other technologies like PDF and PEAR.
The document provides various ways to work with Perl modules from the command line and within Perl scripts. Some of the key points discussed include:
1. Using the perl command with various options like -M, -T, -l, and -w to list installed Perl modules from the command line.
2. Using the ExtUtils::Installed module to get a list of installed modules within a Perl script.
3. Using the perldoc command to view documentation for built-in and installed Perl modules.
4. Configuring the CPAN module to install additional Perl modules from the command line.
5. Checking if a specific module is installed and viewing its version number.
This document contains several sections about PHP:
1. It introduces PHP and provides basic examples of including other PHP files and echoing text.
2. It discusses using includes to break websites into header, sidebar, footer files. It also shows how includes can be used conditionally for login states.
3. Further sections explain features like include vs require, using URLs and paths, generating PDFs with PHP libraries like FPDF, and the PEAR repository for PHP extensions.
- The document discusses strategies for analyzing large datasets that are too big to fit into memory, including using cloud computing, the ff and rsqlite packages in R, and sampling with the data.sample package.
- The ff and rsqlite packages allow working with data beyond RAM limits but require rewriting code, while data.sample provides sampling without rewriting code but introduces sampling error.
- Cloud computing avoids rewriting code and has no memory limits but requires setup, and sampling is good for analysis but not reporting exact values.
This document provides information about file I/O functions in C including fopen(), fprintf(), fseek(), ftell(), feof(), ferror(), setvbuf(), tmpfile(), mktemp(), and functions for handling wide characters and internationalization. It includes code examples demonstrating the use of these functions for opening, reading, writing, positioning within, and checking for errors on files. The focus is on file I/O in C and Unicode/internationalization support for file operations.
misr_view 4.0 is a visualization and analysis tool for MISR data that has several new features including a data transform utility, map projection tool, vector overlay tool, and color bar utility. It allows users to view, analyze, manipulate and reproject MISR data.
The document discusses the NSIDC DAAC's efforts to facilitate access to Earth Observing System (EOS) data in HDF-EOS format. It summarizes tools developed by NSIDC to help users access, visualize, and process HDF-EOS data, including the PHDIS tool for viewing metadata and subsetting data, a swath-to-grid tool for gridding MODIS data, and a bit viewer for MODIS quality flags. It also discusses future plans to enhance these tools and develop additional capabilities like output in HDF-EOS and GIS formats.
This document contains information about PHP including:
1. An introduction to PHP and its uses for web development.
2. Descriptions of common PHP functions like include, require, and include_once for including files.
3. Discussions of using PHP with other technologies like PDF and PEAR.
The document provides various ways to work with Perl modules from the command line and within Perl scripts. Some of the key points discussed include:
1. Using the perl command with various options like -M, -T, -l, and -w to list installed Perl modules from the command line.
2. Using the ExtUtils::Installed module to get a list of installed modules within a Perl script.
3. Using the perldoc command to view documentation for built-in and installed Perl modules.
4. Configuring the CPAN module to install additional Perl modules from the command line.
5. Checking if a specific module is installed and viewing its version number.
This document contains several sections about PHP:
1. It introduces PHP and provides basic examples of including other PHP files and echoing text.
2. It discusses using includes to break websites into header, sidebar, footer files. It also shows how includes can be used conditionally for login states.
3. Further sections explain features like include vs require, using URLs and paths, generating PDFs with PHP libraries like FPDF, and the PEAR repository for PHP extensions.
- The document discusses strategies for analyzing large datasets that are too big to fit into memory, including using cloud computing, the ff and rsqlite packages in R, and sampling with the data.sample package.
- The ff and rsqlite packages allow working with data beyond RAM limits but require rewriting code, while data.sample provides sampling without rewriting code but introduces sampling error.
- Cloud computing avoids rewriting code and has no memory limits but requires setup, and sampling is good for analysis but not reporting exact values.
This document provides information about file I/O functions in C including fopen(), fprintf(), fseek(), ftell(), feof(), ferror(), setvbuf(), tmpfile(), mktemp(), and functions for handling wide characters and internationalization. It includes code examples demonstrating the use of these functions for opening, reading, writing, positioning within, and checking for errors on files. The focus is on file I/O in C and Unicode/internationalization support for file operations.
misr_view 4.0 is a visualization and analysis tool for MISR data that has several new features including a data transform utility, map projection tool, vector overlay tool, and color bar utility. It allows users to view, analyze, manipulate and reproject MISR data.
The document discusses the NSIDC DAAC's efforts to facilitate access to Earth Observing System (EOS) data in HDF-EOS format. It summarizes tools developed by NSIDC to help users access, visualize, and process HDF-EOS data, including the PHDIS tool for viewing metadata and subsetting data, a swath-to-grid tool for gridding MODIS data, and a bit viewer for MODIS quality flags. It also discusses future plans to enhance these tools and develop additional capabilities like output in HDF-EOS and GIS formats.
HDF5 is designed to work well on high performance parallel systems and clusters. This tutorial will review the high performance features of HDF5, including:
o Design of Parallel HDF5 Library
o Parallel HDF5 Programming Model and Environment
It is desired that participants are familiar with MPI and MPI I/0 and have a basic knowledge of sequential HDF5 Library. The lecture will prepare them for the Parallel I/O hands-on session.
This Tutorial is designed for the users who have exposure to MPI I/O and basic concepts of HDF5 and would like to learn about Parallel HDF5 Library. The Tutorial will cover Parallel HDF5 design and programming model. Several C and Fortran examples will be used to illustrate the basic ideas of the Parallel HDF5 programming model. Some performance issues including collective chunked I/O will be discussed. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
This document discusses using Python with the H5py module to interact with HDF5 files. Some key points made include:
- H5py allows HDF5 files to be manipulated as if they were Python dictionaries, with dataset names as keys and arrays as values.
- NumPy provides array manipulation capabilities to work with the dataset values retrieved from HDF5 files.
- Examples demonstrate reading and writing HDF5 datasets, comparing contents of datasets between files, and recursively listing contents of an HDF5 file.
- Using Python with H5py is more concise than other languages like C/Fortran, reducing development time and potential for errors.
The why and how of moving to PHP 5.5/5.6Wim Godden
With PHP 5.6 out and many production environments still running 5.2 or 5.3, it's time to paint a clear picture on why everyone should move to 5.5 and 5.6 and how to get code ready for the latest version of PHP. In this talk, we'll look at some handy tools and techniques to ease the migration.
HDF5 is a powerful and feature-rich creature, and getting the most out of it requires powerful tools. The MathWorks provides a "low-level" interface to the HDF5 library that closely corresponds to the C API and exposes much of its richness. This short tutorial will present ways to use the low-level MATLAB interface to build those tools and tackle such topics as subsetting, chunking, and compression.
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
Stefano will give an introduction to the most common and used programming models for performing parallel I/O on supercomputers. He will first give a broad overview of parallel APIs for programming I/O on supercomputers. He will then introduce MPI I/O, one of the most used programming interfaces for parallel I/O, presenting its basic concepts, providing programming examples and guidelines for achieving high performance I/O on supercomputers.
Visit: https://www.eudat.eu/eudat-summer-school
This document provides examples of adding Climate and Forecast (CF) metadata attributes to HDF5 files using different programming languages and interfaces. It summarizes how to add CF attributes like units, long_name, and coordinates to datasets in HDF5 files using C, Fortran, Python, netCDF4, HDF-EOS5, and HDFView. It also briefly mentions using HDF5 dimension scales to associate coordinate variables with datasets.
This document discusses parallel HDF5 (PHDF5). It begins with an overview of PHDF5 design and requirements, including supporting MPI programming and providing a single file image to all processes. Performance analysis shows that using larger I/O data sizes, collective I/O instead of independent I/O, and I/O system hints can improve performance. The h5perf tool can measure I/O performance using different APIs.
The document provides tips for optimizing PHP code, including using string functions instead of regular expressions where possible, passing references to reduce memory usage, using persistent database connections, and checking mysql_unbuffered_query() for faster queries. It also discusses HTTP requests and responses, cookie expiry, references in PHP, returning references from functions, and the debug_backtrace() function. The document concludes with tips for improving security such as checking for uninitialized variables, validating user input, and restricting access to included files.
Session Server - Maintaing State between several ServersStephan Schmidt
This document summarizes maintaining state between servers using a session server. It discusses the need for sessions in stateless HTTP, limitations of built-in PHP sessions, and how a central session server addresses these issues. It then covers building a session server in PHP using sockets and processes (PCNTL), and implementing a full-featured session server using the Net_Server and HTTP_SessionServer classes along with a client library.
The document provides steps to set up PHP development on IBM i:
1. Install Zend Core and Zend Studio on IBM i to get the PHP runtime and IDE.
2. Configure requirements like PASE and utilities, and download/install Zend products.
3. Write a simple PHP file using Zend Studio that outputs phpinfo() and run it on the local server.
4. Access IBM i database and resources using the PHP Toolkit bundled with Zend Core.
The why and how of moving to PHP 5.4/5.5Wim Godden
With PHP 5.5 out and many production environments still running 5.2 (or older), it's time to paint a clear picture on why everyone should move to 5.4 and 5.5 and how to get code ready for the latest version of PHP. In this talk, we'll migrate an old piece of code using some standard and some very non-standard tools and techniques.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
This document contains biographical information about Boulos Dib, an independent consultant specializing in software development. It provides details about Dib's early experience with personal computers and programming languages. It also lists upcoming presentations by Dib on LightSwitch and Silverlight at the NYC Code Camp in October 2011. The document concludes with an overview of PowerShell scripting.
A brief overview of using HDF5 with Python and Andrew Collette's h5py module will be presented, including examples which show how and why Python can be used in the place of HDF5 tools. Extensions to the HDF5 API will be proposed which would further improve the utility of Python/h5py.
Pig is a platform for analyzing large datasets that sits between low-level MapReduce programming and high-level SQL queries. It provides a language called Pig Latin that allows users to specify data analysis programs without dealing with low-level details. Pig Latin scripts are compiled into sequences of MapReduce jobs for execution. HCatalog allows data to be shared between Pig, Hive, and other tools by reading metadata about schemas, locations, and formats.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: datatype and partial I/O
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
This document provides an overview and introduction to PHP extensions. It discusses compiling PHP with debugging enabled, creating a basic extension skeleton, configuring and installing extensions, and activating extensions. It also covers extension lifetime, PHP memory management using the Zend Memory Manager, PHP variables called zvals which are containers for data, and zval types. The document is intended to provide attendees with the necessary background knowledge to participate in a workshop about PHP extensions.
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
More Related Content
Similar to Overview of Parallel HDF5 and Performance Tuning in HDF5 Library
HDF5 is designed to work well on high performance parallel systems and clusters. This tutorial will review the high performance features of HDF5, including:
o Design of Parallel HDF5 Library
o Parallel HDF5 Programming Model and Environment
It is desired that participants are familiar with MPI and MPI I/0 and have a basic knowledge of sequential HDF5 Library. The lecture will prepare them for the Parallel I/O hands-on session.
This Tutorial is designed for the users who have exposure to MPI I/O and basic concepts of HDF5 and would like to learn about Parallel HDF5 Library. The Tutorial will cover Parallel HDF5 design and programming model. Several C and Fortran examples will be used to illustrate the basic ideas of the Parallel HDF5 programming model. Some performance issues including collective chunked I/O will be discussed. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
This document discusses using Python with the H5py module to interact with HDF5 files. Some key points made include:
- H5py allows HDF5 files to be manipulated as if they were Python dictionaries, with dataset names as keys and arrays as values.
- NumPy provides array manipulation capabilities to work with the dataset values retrieved from HDF5 files.
- Examples demonstrate reading and writing HDF5 datasets, comparing contents of datasets between files, and recursively listing contents of an HDF5 file.
- Using Python with H5py is more concise than other languages like C/Fortran, reducing development time and potential for errors.
The why and how of moving to PHP 5.5/5.6Wim Godden
With PHP 5.6 out and many production environments still running 5.2 or 5.3, it's time to paint a clear picture on why everyone should move to 5.5 and 5.6 and how to get code ready for the latest version of PHP. In this talk, we'll look at some handy tools and techniques to ease the migration.
HDF5 is a powerful and feature-rich creature, and getting the most out of it requires powerful tools. The MathWorks provides a "low-level" interface to the HDF5 library that closely corresponds to the C API and exposes much of its richness. This short tutorial will present ways to use the low-level MATLAB interface to build those tools and tackle such topics as subsetting, chunking, and compression.
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
Stefano will give an introduction to the most common and used programming models for performing parallel I/O on supercomputers. He will first give a broad overview of parallel APIs for programming I/O on supercomputers. He will then introduce MPI I/O, one of the most used programming interfaces for parallel I/O, presenting its basic concepts, providing programming examples and guidelines for achieving high performance I/O on supercomputers.
Visit: https://www.eudat.eu/eudat-summer-school
This document provides examples of adding Climate and Forecast (CF) metadata attributes to HDF5 files using different programming languages and interfaces. It summarizes how to add CF attributes like units, long_name, and coordinates to datasets in HDF5 files using C, Fortran, Python, netCDF4, HDF-EOS5, and HDFView. It also briefly mentions using HDF5 dimension scales to associate coordinate variables with datasets.
This document discusses parallel HDF5 (PHDF5). It begins with an overview of PHDF5 design and requirements, including supporting MPI programming and providing a single file image to all processes. Performance analysis shows that using larger I/O data sizes, collective I/O instead of independent I/O, and I/O system hints can improve performance. The h5perf tool can measure I/O performance using different APIs.
The document provides tips for optimizing PHP code, including using string functions instead of regular expressions where possible, passing references to reduce memory usage, using persistent database connections, and checking mysql_unbuffered_query() for faster queries. It also discusses HTTP requests and responses, cookie expiry, references in PHP, returning references from functions, and the debug_backtrace() function. The document concludes with tips for improving security such as checking for uninitialized variables, validating user input, and restricting access to included files.
Session Server - Maintaing State between several ServersStephan Schmidt
This document summarizes maintaining state between servers using a session server. It discusses the need for sessions in stateless HTTP, limitations of built-in PHP sessions, and how a central session server addresses these issues. It then covers building a session server in PHP using sockets and processes (PCNTL), and implementing a full-featured session server using the Net_Server and HTTP_SessionServer classes along with a client library.
The document provides steps to set up PHP development on IBM i:
1. Install Zend Core and Zend Studio on IBM i to get the PHP runtime and IDE.
2. Configure requirements like PASE and utilities, and download/install Zend products.
3. Write a simple PHP file using Zend Studio that outputs phpinfo() and run it on the local server.
4. Access IBM i database and resources using the PHP Toolkit bundled with Zend Core.
The why and how of moving to PHP 5.4/5.5Wim Godden
With PHP 5.5 out and many production environments still running 5.2 (or older), it's time to paint a clear picture on why everyone should move to 5.4 and 5.5 and how to get code ready for the latest version of PHP. In this talk, we'll migrate an old piece of code using some standard and some very non-standard tools and techniques.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
This document contains biographical information about Boulos Dib, an independent consultant specializing in software development. It provides details about Dib's early experience with personal computers and programming languages. It also lists upcoming presentations by Dib on LightSwitch and Silverlight at the NYC Code Camp in October 2011. The document concludes with an overview of PowerShell scripting.
A brief overview of using HDF5 with Python and Andrew Collette's h5py module will be presented, including examples which show how and why Python can be used in the place of HDF5 tools. Extensions to the HDF5 API will be proposed which would further improve the utility of Python/h5py.
Pig is a platform for analyzing large datasets that sits between low-level MapReduce programming and high-level SQL queries. It provides a language called Pig Latin that allows users to specify data analysis programs without dealing with low-level details. Pig Latin scripts are compiled into sequences of MapReduce jobs for execution. HCatalog allows data to be shared between Pig, Hive, and other tools by reading metadata about schemas, locations, and formats.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: datatype and partial I/O
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
This document provides an overview and introduction to PHP extensions. It discusses compiling PHP with debugging enabled, creating a basic extension skeleton, configuring and installing extensions, and activating extensions. It also covers extension lifetime, PHP memory management using the Zend Memory Manager, PHP variables called zvals which are containers for data, and zval types. The document is intended to provide attendees with the necessary background knowledge to participate in a workshop about PHP extensions.
Similar to Overview of Parallel HDF5 and Performance Tuning in HDF5 Library (20)
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document provides an overview of HSDS (HDF Server and Data Service), which allows HDF5 files to be stored and accessed from the cloud. Key points include:
- HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects for scalability and parallelism.
- Features include streaming support, fancy indexing for complex queries, and caching for improved performance.
- HSDS can be deployed on Docker, Kubernetes, or AWS Lambda depending on needs.
- Case studies show HSDS is used by organizations like NREL and NSF to make petabytes of scientific data publicly accessible in the cloud.
This document discusses creating cloud-optimized HDF5 files by rearranging internal structures for more efficient data access in cloud object stores. It describes cloud-native and cloud-optimized storage formats, with the latter involving storing the entire HDF5 file as a single object. The benefits of cloud-optimized HDF5 include fast scanning and using the HDF5 library. Key aspects covered include using optimal chunk sizes, compression, and minimizing variable-length datatypes.
This document discusses updates and performance improvements to the HDF5 OPeNDAP data handler. It provides a history of the handler since 2001 and describes recent updates including supporting DAP4, new data types, and NetCDF data models. A performance study showed that passing compressed HDF5 data through the handler without decompressing/recompressing led to speedups of around 17-30x by leveraging HDF5 direct I/O APIs. This allows outputting HDF5 files as NetCDF files much faster through the handler.
This document provides instructions for using the Hyrax software to serve scientific data files stored on Amazon S3 using the OPeNDAP data access protocol. It describes how to generate ancillary metadata files called DMR++ files using the get_dmrpp tool that provide information about the data file structure and locations. The document explains how to run get_dmrpp inside a Docker container to process data files on S3 and generate customized DMR++ files that the Hyrax server can use to serve the files to clients.
This document provides an overview and examples of accessing cloud data and services using the Earthdata Login (EDL), Pydap, and MATLAB. It discusses some common problems users encounter, such as being unable to access HDF5 data on AWS S3 using MATLAB or read data from OPeNDAP servers using Pydap. Solutions presented include using EDL to get temporary AWS tokens for S3 access in MATLAB and providing code examples on the HDFEOS website to help users access S3 data and OPeNDAP services. The document also notes some limitations, such as tokens being valid for only 1 hour, and workarounds like requesting new tokens or using the MATLAB HDF5 API instead of the netCDF API.
The HDF5 Roadmap and New Features document outlines upcoming changes and improvements to the HDF5 library. Key points include:
- HDF5 1.13.x releases will include new features like selection I/O, the Onion VFD for versioned files, improved VFD SWMR for single-writer multiple-reader access, and subfiling for parallel I/O.
- The Virtual Object Layer allows customizing HDF5 object storage and introduces terminal and pass-through connectors.
- The Onion VFD stores versions of HDF5 files in a separate onion file for versioned access.
- VFD SWMR improves on legacy SWMR by implementing single-writer multiple-reader capabilities
This document discusses user analysis of the HDFEOS.org website and plans for future improvements. It finds that the majority of the site's 100 daily users are "quiet", not posting on forums or other interactive elements. The main user types are locators, who search for examples or data; mergers, who combine or mosaic datasets; and converters, who change file formats. The document outlines recent updates focused on these user types, like adding Python examples for subsetting and calculating latitude and longitude. It proposes future work on artificial intelligence/machine learning uses of HDF files and examples for processing HDF data in the cloud.
This document summarizes a presentation about the current status and future directions of the Hierarchical Data Format (HDF) software. It provides updates on recent HDF5 releases, development efforts including new compression methods and ways to access HDF5 data, and outreach resources. It concludes by inviting the audience to share wishes for future HDF development.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
This document summarizes MathWorks' work to modernize MATLAB's support for HDF5. Key points include:
1) MATLAB now supports HDF5 1.10.7 features like single-writer/multiple-reader access and virtual datasets through new and updated low-level functions.
2) Performance benchmarks show some improvements but also regressions compared to the previous HDF5 version, and work continues to optimize code and support future versions.
3) There are compatibility considerations for Linux filter plugins, but interim solutions are provided until MathWorks can ship a single HDF5 version.
HSDS provides HDF as a service through a REST API that can scale across nodes. New releases will enable serverless operation using AWS Lambda or direct client access without a server. This allows HDF data to be accessed remotely without managing servers. HSDS stores each HDF object separately, making it compatible with cloud object storage. Performance on AWS Lambda is slower than a dedicated server but has no management overhead. Direct client access has better performance but limits collaboration between clients.
HDF5 and Zarr are data formats that can be used to store and access scientific data. This presentation discusses approaches to translating between the two formats. It describes how HDF5 files were translated to the Zarr format by creating a separate Zarr store to hold HDF5 file chunks, and storing chunk location metadata. It also discusses an implementation that translates Zarr data to the HDF5 format by using a special chunking layout and storing chunk information in an HDF5 compound dataset. Limitations of the translations include lack of support for some HDF5 dataset properties in Zarr, and lack of support for some Zarr compression methods in the HDF5 implementation.
The document discusses HDF for the cloud, including new features of the HDF Server and what's next. Key points:
- HDF Server uses a "sharded schema" that maps HDF5 objects to individual storage objects, allowing parallel access and updates without transferring entire files.
- Implementations include HSDS software that uses the sharded schema with an API and SDKs for different languages like h5pyd for Python.
- New features of HSDS 0.6 include support for POSIX, Azure, AWS Lambda, and role-based access control.
- Future work includes direct access to storage without a server intermediary for some use cases.
This document compares different methods for accessing HDF and netCDF files stored on Amazon S3, including Apache Drill, THREDDS Data Server (TDS), and HDF5 Virtual File Driver (VFD). A benchmark test of accessing a 24GB HDF5/netCDF-4 file on S3 from Amazon EC2 found that TDS performed the best, responding within 2 minutes, while Apache Drill failed after 7 minutes. The document concludes that TDS 5.0 is the clear winner based on performance and support for role-based access control and HDF4 files, but the best solution depends on use case and software.
This document discusses STARE-PODS, a proposal to NASA/ACCESS-19 to develop a scalable data store for earth science data using the SpatioTemporal Adaptive Resolution Encoding (STARE) indexing scheme. STARE allows diverse earth science data to be unified and indexed, enabling the data to be partitioned and stored in a Parallel Optimized Data Store (PODS) for efficient analysis. The HDF Virtual Object Layer and Virtual Data Set technologies can then provide interfaces to access the data in STARE-PODS in a familiar way. The goal is for STARE-PODS to organize diverse data for alignment and parallel/distributed storage and processing to enable integrative analysis at scale.
This document provides an overview and update on HDF5 and its ecosystem. Key points include:
- HDF5 1.12.0 was recently released with new features like the Virtual Object Layer and external references.
- The HDF5 library now supports accessing data in the cloud using connectors like S3 VFD and REST VOL without needing to modify applications.
- Projects like HDFql and H5CPP provide additional interfaces for querying and working with HDF5 files from languages like SQL, C++, and Python.
- The HDF5 community is moving development to GitHub and improving documentation resources on the HDF wiki site.
This document summarizes new features in HDF5 1.12.0, including support for storing references to objects and attributes across files, new storage backends using a virtual object layer (VOL), and virtual file drivers (VFDs) for Amazon S3 and HDFS. It outlines the HDF5 roadmap for 2019-2022, which includes continued support for HDF5 1.8 and 1.10, and new features in future 1.12.x releases like querying, indexing, and provenance tracking.
More from The HDF-EOS Tools and Information Center (20)
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Overview of Parallel HDF5 and Performance Tuning in HDF5 Library
1. Overview of Parallel HDF5
and Performance Tuning in HDF5
Library
Elena Pourmal, Albert Cheng
Science Data Processing
Workshop
February 27, 2002
-1-
2. Outline
_ Overview of Parallel HDF5 Design
_ Setting up Parallel Environment
_ Programming model for
_ Creating and Accessing a File
_ Creating and Accessing a Dataset
_ Writing and Reading Hyperslabs
_ Parallel Tutorial available at
_ http://hdf.ncsa.uiuc.edu/HDF5/doc/Tutor
_ Performance tuning in HDF5
-2-
3. PHDF5 Requirements
_ PHDF5 files compatible with serial HDF5
files
_ Shareable between different serial or
parallel platforms
_ Single file image to all processes
_ One file per process design is undesirable
_ Expensive post processing
_ Not useable by different number of processes
_ Standard parallel I/O interface
_ Must be portable to different platforms
-3-
4. PHDF5 Initial Target
_ Support for MPI programming
_ Not for shared memory
programming
_ Threads
_ OpenMP
_ Has some experiments with
_ Thread-safe support for Pthreads
_ OpenMP if called correctly
-4-
5. Implementation Requirements
_ No use of Threads
_ Not commonly supported (1998)
_ No reserved process
_ May interfere with parallel
algorithms
_ No spawn process
_ Not commonly supported even now
-5-
7. Programming Restrictions
_ Most PHDF5 APIs are collective
_ MPI definition of collective: All processes of the
communicator must participate in the right
order.
_ PHDF5 opens a parallel file with a communicator
_ Returns a file-handle
_ Future access to the file via the file-handle
_ All processes must participate in collective
PHDF5 APIs
_ Different files can be opened via different
communicators
-7-
8. Examples of PHDF5 API
_ Examples of PHDF5 collective API
_ File operations: H5Fcreate, H5Fopen,
H5Fclose
_ Objects creation: H5Dcreate, H5Dopen,
H5Dclose
_ Objects structure: H5Dextend (increase
dimension sizes)
_ Array data transfer can be collective or
independent
_ Dataset operations: H5Dwrite, H5Dread
-8-
9. What Does PHDF5 Support ?
_ After a file is opened by the
processes of a communicator
_ All parts of file are accessible by all
processes
_ All objects in the file are accessible by
all processes
_ Multiple processes write to the same
data array
_ Each process writes to individual data
array
-9-
10. PHDF5 API Languages
_ C and F90 language interfaces
_ Platforms supported: IBM SP2 and
SP3, Intel TFLOPS, SGI Origin 2000,
HP-UX 11.00 System V, Alpha
Compaq Clusters.
- 10 -
11. Creating and Accessing a File
Programming model
_ HDF5 uses access template object
to control the file access
mechanism
_ General model to access HDF5 file
in parallel:
_ Setup access template
_ Open File
_ Close File
- 11 -
12. Setup access template
Each process of the MPI communicator creates an
access template and sets it up with MPI parallel
access information
C:
herr_t H5Pset_fapl_mpio(hid_t plist_id,
MPI_Comm comm, MPI_Info info);
F90:
h5pset_fapl_mpio_f(plist_id, comm, info);
integer(hid_t) :: plist_id
integer
:: comm, info
- 12 -
13. C Example
Parallel File Create
23
24
26
27
28
29
33
34
35
36
37
38
42
49
50
51
52
54
comm = MPI_COMM_WORLD;
info = MPI_INFO_NULL;
/*
* Initialize MPI
*/
MPI_Init(&argc, &argv);
/*
* Set up file access property list for MPI-IO access
*/
plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);
file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC,
H5P_DEFAULT, plist_id);
/*
* Close the file.
*/
H5Fclose(file_id);
MPI_Finalize();
- 13 -
15. Creating and Opening Dataset
All processes of the MPI communicator
open/close a dataset by a collective call
– C: H5Dcreate or H5Dopen;
H5Dclose
– F90: h5dfreate_f or h5dopen_f;
h5dclose_f
_ All processes of the MPI communicator
extend dataset with unlimited
dimensions before writing to it
– C: H5Dextend
– F90: h5dextend_f
- 15 -
16. C Example
Parallel Dataset Create
56
57
58
59
60
61
62
63
64
65
66
67
68
file_id = H5Fcreate(…);
/*
* Create the dataspace for the dataset.
*/
dimsf[0] = NX;
dimsf[1] = NY;
filespace = H5Screate_simple(RANK, dimsf, NULL);
70
71
72
73
74
H5Dclose(dset_id);
/*
* Close the file.
*/
H5Fclose(file_id);
/*
* Create the dataset with default properties collective.
*/
dset_id = H5Dcreate(file_id, “dataset1”, H5T_NATIVE_INT,
filespace, H5P_DEFAULT);
- 16 -
17. F90 Example
Parallel Dataset Create
43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error, access_prp = plist_id)
73 CALL h5screate_simple_f(rank, dimsf, filespace, error)
76 !
77 ! Create the dataset with default properties.
78 !
79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER,
filespace, dset_id, error)
90
91
92
93
94
95
!
! Close the dataset.
CALL h5dclose_f(dset_id, error)
!
! Close the file.
CALL h5fclose_f(file_id, error)
- 17 -
18. Accessing a Dataset
_ All processes that have opened
dataset may do collective I/O
_ Each process may do independent
and arbitrary number of data I/O
access calls
– F90: h5dwrite_f and
h5dread_f
– C:
H5Dwrite and H5Dread
- 18 -
19. Accessing a Dataset
Programming model
_ Create and set dataset transfer
property
– C:
H5Pset_dxpl_mpio
– H5FD_MPIO_COLLECTIVE
– H5FD_MPIO_INDEPENDENT
– F90: h5pset_dxpl_mpio_f
– H5FD_MPIO_COLLECTIVE_F
– H5FD_MPIO_INDEPENDENT_F
_ Access dataset with the defined
transfer property
- 19 -
20. C Example: Collective write
95 /*
96
* Create property list for collective dataset write.
97
*/
98 plist_id = H5Pcreate(H5P_DATASET_XFER);
99 H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
100
101 status = H5Dwrite(dset_id, H5T_NATIVE_INT,
102
memspace, filespace, plist_id, data);
- 20 -
22. Writing and Reading Hyperslabs
Programming model
_ Each process defines memory and
file hyperslabs
_ Each process executes partial
write/read call
_ Collective calls
_ Independent calls
- 22 -
41. H5Pset_meta_block_size
_ Sets the minimum metadata block size
allocated for metadata aggregation.
The larger the size, the fewer the
number of small data objects in the file.
_ The aggregated block of metadata is
usually written in a single write action
and always in a contiguous block,
potentially significantly improving
library and application performance.
_ Default is 2KB
- 41 -
42. H5Pset_alignment
_ Sets the alignment properties of a file access
property list so that any file object greater
than or equal in size to threshold bytes will be
aligned on an address which is a multiple of
alignment. This makes significant improvement
to file systems that are sensitive to data block
alignments.
_ Default values for threshold and alignment are
one, implying no alignment. Generally the
default values will result in the best
performance for single-process access to the
file. For MPI-IO and other parallel systems,
choose an alignment which is a multiple of the
- 42 disk block size.
43. H5Pset_fapl_split
_ Sets file driver to store metadata and
raw data in two separate files, metadata
and raw data files.
_ Significant I/O improvement if the
metadata file is stored in Unix file
systems (good for small I/O) while the
raw data file is stored in Parallel file
systems (good for large I/O).
_ Default is no split.
- 43 -
44. Writing contiguous dataset on Tflop at
LANL
(Mb/sec, various buffer sizes)
Measured functionality:
20
16
HDF5 writing to a standard HDF5 file
Mb/sec
12
Writing directly with MPI I/O (no HDF5)
8
HDF5 writing with the split driver
4
2
4
8
Number of processes
- 44 -
16
Each process writes 10Mb
of data.
45. H5Pset_cache
_ Sets:
_ the number of elements (objects) in the
meta data cache
_ the number of elements, the total number
of bytes, and the preemption policy value in
the raw data chunk cache
_ The right values depend on the
individual application access pattern.
_ Default for preemption value is 0.75
- 45 -
46. H5Pset_fapl_mpio
_ MPI-IO hints can be passed to the MPIIO layer via the Info parameter.
_ E.g., telling Romio to use 2-phases I/O
speeds up collective I/O in the ASCI Red
machine.
- 46 -
47. Data Transfer Level Knobs
_ H5Pset_buffer
_ H5Pset_sieve_buf_size
_ H5Pset_hyper_cache
- 47 -
48. H5Pset_buffer
_ Sets the maximum size for the type
conversion buffer and background
buffer used during data transfer. The
bigger the size, the better the
performance.
_ Default is 1 MB.
- 48 -
49. H5Pset_sieve_buf_size
_ Sets the maximum size of the data sieve
buffer. The bigger the size, the fewer
I/O requests issued for raw data access.
_ Default is 64KB
- 49 -
50. H5Pset_hyper_cache
_ Indicates whether to cache hyperslab
blocks during I/O, a process which can
significantly increase I/O speeds.
_ Default is to cache blocks with no limit
on block size for serial I/O and to not
cache blocks for parallel I/O.
- 50 -
51. Chunk Cache Effect
by H5Pset_cache
_ Write one integer dataset
256x256x1024 (256MB)
_ Using chunks of 256x16x1024
(16MB)
_ Two tests of
_ Default chunk cache size (1MB)
_ Set chunk cache size 16MB
- 51 -
52. Chunk Cache
Time Definitions
_ Total: time to open file, write dataset,
close dataset and close file.
_ Dataset Write: time to write the whole
dataset
_ Chunk Write: best time to write a chunk
_ User Time: total Unix user time of test
_ System Time: total Unix system time of
test
- 52 -
53. Chunk Cache size Results
Cache
Chunk
buffer size write time
on MB
(sec)
Dataset
write time
(sec)
Total
time(sec)
1
132.58
2450.25 2453.09 14
2200.1
16
0.376
7.83
3.45
8.27
- 53 -
User time
(sec)
6.21
System
time (sec)
54. Chunk Cache Size
Summary
_ Big chunk cache size improves
performance
_ Poor performance mostly due to
increased system time
_ Many more I/O requests
_ Smaller I/O requests
- 54 -