This document provides an overview and outline of topics related to advanced features in HDF5, including:
- HDF5 supports various datatypes like atomic, compound, array, and variable-length datatypes. It allows creation of complex user-defined datatypes.
- Partial I/O in HDF5 allows reading and writing subsets of datasets using hyperslab selections, which describe subsets through properties like start point, stride, count, and block size.
- Chunking and compression can be used to improve performance and reduce storage needs when working with subsets of large HDF5 datasets.
The document discusses Ivan Zoratti's presentation on using MySQL for big data. It defines big data and how it can be structured as either unstructured or structured data. It then outlines various technologies that can be used with MySQL like storage engines, partitioning, columnar databases, and the MariaDB optimizer. The presentation provides an overview of how these technologies can help manage large and complex data sets with MySQL.
Okra production (रामतोरिया / ŕ¤ŕ¤żŕ¤¨ŕĄŤŕ¤¦ŕĄ€)Lokendra Badu
Â
Okra or ladyfinger is an important vegetable crop grown in subtropical regions. It has high temperatures, humidity and short growing period of around 4 months. Okra is used in soups, stews and its fiber is used in paper industry. Some popular varieties grown in Nepal are Kajati and Cafeldsu. Okra seed is planted from March-May in the Terai and June-August in mid hills. Regular irrigation and weeding is required and pests like cutworm can affect the crop. Okra yields 500-1000 kg per ropani and is harvested when pods reach 6-8 cm in length.
This document discusses concepts for removing dust from household floors and cooling power electronics. It provides several concepts for dust removal devices, including human-powered bellows with a cloth filter, an axial electric fan with a muslin filter, and a centrifugal fan with centrifugal dust separation. It also provides concepts for cooling microchips, such as using thermal conduction and convection, evaporation, or radiation. Function structures are presented for a vacuum cleaner and heat sinks. The document also discusses attaching paper sheets, engineering materials and their properties, and uses materials data to solve problems involving sound velocity, beam deflection, vibration, springs, torsion, stress concentrations, and unstable cracks.
This tutorial is designed for anyone who needs to work with data stored in HDF and HDF5 files.
The first part of the tutorial will focus on the HDF5 utilities to display the contents of HDF5 files, to extract and to import data from and to HDF5 files, to compare two HDF5 files, and more. Participants will be guided through the hand-on examples and will learn about different tools options. New changes and advanced features will be covered in a separate session (Updates on HDF tools) on Wednesday.
The second part of tutorial includes a hands-on session to learn the HDF (4 & 5) Java browsing tool, HDFView. The tool and special plug-ins will be used to work with the existing HDF, HDF-EOS, and netCDF-4 files, and to create a new HDF5 file. The tutorial will cover basic features of HDFView.
This tutorial is designed for the HDF5 users with some HDF5 experience.
It will cover advanced features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: partial I/O, chunked storage layout, compression and other filters including new n-bit and scale+offset filters. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array and compound datatypes.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: datatype and partial I/O
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
This Tutorial is designed for the HDF5 users with some HDF5 experience. It will cover properties of the HDF5 objects that affect I/O performance and file sizes. The following HDF5 features will be discussed: partial I/O, chunking and compression, and complex HDF5 datatypes such as strings, variable-length arrays and compound datatypes.
We will also discuss references to objects and datasets regions and how they can be used for indexing. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
The document discusses Ivan Zoratti's presentation on using MySQL for big data. It defines big data and how it can be structured as either unstructured or structured data. It then outlines various technologies that can be used with MySQL like storage engines, partitioning, columnar databases, and the MariaDB optimizer. The presentation provides an overview of how these technologies can help manage large and complex data sets with MySQL.
Okra production (रामतोरिया / ŕ¤ŕ¤żŕ¤¨ŕĄŤŕ¤¦ŕĄ€)Lokendra Badu
Â
Okra or ladyfinger is an important vegetable crop grown in subtropical regions. It has high temperatures, humidity and short growing period of around 4 months. Okra is used in soups, stews and its fiber is used in paper industry. Some popular varieties grown in Nepal are Kajati and Cafeldsu. Okra seed is planted from March-May in the Terai and June-August in mid hills. Regular irrigation and weeding is required and pests like cutworm can affect the crop. Okra yields 500-1000 kg per ropani and is harvested when pods reach 6-8 cm in length.
This document discusses concepts for removing dust from household floors and cooling power electronics. It provides several concepts for dust removal devices, including human-powered bellows with a cloth filter, an axial electric fan with a muslin filter, and a centrifugal fan with centrifugal dust separation. It also provides concepts for cooling microchips, such as using thermal conduction and convection, evaporation, or radiation. Function structures are presented for a vacuum cleaner and heat sinks. The document also discusses attaching paper sheets, engineering materials and their properties, and uses materials data to solve problems involving sound velocity, beam deflection, vibration, springs, torsion, stress concentrations, and unstable cracks.
This tutorial is designed for anyone who needs to work with data stored in HDF and HDF5 files.
The first part of the tutorial will focus on the HDF5 utilities to display the contents of HDF5 files, to extract and to import data from and to HDF5 files, to compare two HDF5 files, and more. Participants will be guided through the hand-on examples and will learn about different tools options. New changes and advanced features will be covered in a separate session (Updates on HDF tools) on Wednesday.
The second part of tutorial includes a hands-on session to learn the HDF (4 & 5) Java browsing tool, HDFView. The tool and special plug-ins will be used to work with the existing HDF, HDF-EOS, and netCDF-4 files, and to create a new HDF5 file. The tutorial will cover basic features of HDFView.
This tutorial is designed for the HDF5 users with some HDF5 experience.
It will cover advanced features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: partial I/O, chunked storage layout, compression and other filters including new n-bit and scale+offset filters. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array and compound datatypes.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: datatype and partial I/O
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
This Tutorial is designed for the HDF5 users with some HDF5 experience. It will cover properties of the HDF5 objects that affect I/O performance and file sizes. The following HDF5 features will be discussed: partial I/O, chunking and compression, and complex HDF5 datatypes such as strings, variable-length arrays and compound datatypes.
We will also discuss references to objects and datasets regions and how they can be used for indexing. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
This tutorial is designed for new HDF5 users. We will go over a brief history of HDF and HDF5 software, and will cover basic HDF5 Data Model objects and their properties; we will give an overview of the HDF5 Libraries and APIs, and discuss the HDF5 programming model. Simple C and Fortran examples, and Java tool HDFView will be used to illustrate HDF5 concepts.
This tutorial is designed for new HDF5 users. We will cover HDF5 abstractions such as datasets, groups, attributes, and datatypes. Simple C examples will cover the programming model and basic features of the API, and will give new users the knowledge they need to navigate through the rich collection of HDF5 interfaces. Participants will be guided through an interactive demonstration of the fundamentals of HDF5.
This tutorial is for new HDF5 users.
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
The document discusses the representation of data products from the National Polar-orbiting Operational Environmental Satellite System (NPOESS) in HDF5 format. NPOESS collects global satellite sensor data and creates data products that are packaged into HDF5 files for distribution. The data products contain various types of records, including raw, calibrated, and processed environmental data. Metadata is included in the HDF5 files to describe the data. Granules of data are initially produced and can be aggregated into larger files. Unified Modeling Language (UML) diagrams are used to represent the structure and relationships within the HDF5 files to help users understand and work with the NPOESS data.
This Tutorial gives a brief introduction to HDF5 for people who have never used it. It covers the HDF5 Data Model including HDF5 objects and their properties. It also briefly describes the HDF5 Programming Model and prepares participants for further self-study of HDF5 and hands-on sessions.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
This tutorial is designed for new HDF5 users. We will cover basic HDF5 Data Model objects and their properties, give an overview of the HDF5 Libraries and APIs, and discuss the HDF5 programming model. Simple C and Fortran examples will be used to illustrate HDF5 concepts.
The document summarizes HDF5 advanced topics presented at an HDF5 workshop. It discusses HDF5 groups and links which organize data objects in a file. It also covers HDF5 datasets and datatypes like compound and reference datatypes. HDF5 references allow accessing specific regions of datasets or objects in other files. The document provides examples in Python to demonstrate HDF5 groups, links, datatypes and references.
This document summarizes discussions from the HDF Group's HDF/HDF-EOS Workshop XIV about data interoperability. It covered topics like enabling one set of APIs to handle multiple data formats through projects like netCDF4 and CDM. It also discussed format conversions and translations between formats like HDF4, HDF5, netCDF and others. Finally, it addressed semantic and content interoperability challenges like representing latitude and longitude in different formats and how standards like CF conventions help with interpretation of metadata across tools and applications. Interoperability issues that can arise from simultaneous access of HDF5 files via HDF5 and netCDF-4 libraries were also presented.
This tutorial is designed for users with some HDF5 experience. It will cover advanced features of the HDF5 library that can be used to achieve better I/O performance and more efficient storage. The following HDF5 features will be discussed: partial I/O; compression and other filters, including new n-bit and scale+offset filters and data storage options. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array datatypes, and compound datatypes.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. It will cover functionality and useful features of the HDF5 utilities, which include h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. The tutorial will also introduce recently changes and new features of the utilities.
The HDFView is a visual tool for browsing and editing HDF4 and HDF5 files. Some basic features and new changes of HDFView will be presented. Details of recent development in HDF-Java products will be discussed in a separate presentation.
The presentation will provide an overview of subsetting software development activity at UAH. Updates have been made to all packages, reflecting the latest versions of HDF5 and HE5. The library of tools (HSE) for subsetting HDF-EOS data is up-to-date for SGI, Sun, and Linux platforms. Subsetting software is operational at NSIDC DAAC and GDAAC, in testing at LPDAAC. Ongoing work and plans will also be described, including row/column subsetting and index subsampling.
This document provides an overview of HDF-EOS, which is an extension to HDF that defines standard data structures for remote sensing and in-situ data with tightly coupled geolocation information. It describes the core components of HDF-EOS files, including Grid, Swath, and Point structures, and provides examples. It also outlines the development of an HDF5-based version to overcome limitations of the HDF4-based library and allow for larger files.
The document discusses a project to improve long-term preservation of Earth Observing System (EOS) data by creating independent maps of HDF4 data objects. The project aims to map HDF4 files to allow simple readers to access data without relying on HDF software. It involves categorizing NASA HDF4 data, prototyping an XML mapping file format, and building tools to create maps and read data based on maps. The project will investigate integrating the mapping schema with standards, address HDF-EOS2 requirements, redesign the schema, implement production mapping and reading tools, and deploy the tools at NASA data centers.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
Â
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
The document provides an introduction to the HDF5 format and The HDF Group software. It discusses:
- HDF5 is a data model, library, and file format for managing large amounts of numerical data. It was developed starting in 1996 as an improved successor to the HDF4 format.
- The HDF5 data model defines objects like datasets, groups, and attributes that provide a flexible way to organize and describe stored data. Properties of these objects can be modified to control storage and performance.
- The HDF5 library provides interfaces to work with HDF5 files and objects from C, C++, Fortran, Java and other languages. A set of command line utilities and the HDFView browser can be used to
HDFS (Hadoop Distributed File System) is designed to store very large files across commodity hardware in a Hadoop cluster. It partitions files into blocks and replicates blocks across multiple nodes for fault tolerance. The document discusses HDFS design, concepts like data replication, interfaces for interacting with HDFS like command line and Java APIs, and challenges related to small files and arbitrary modifications.
This document provides an overview and introduction to the HDF5 data model, programming model, and library APIs. The morning session will include a lecture on HDF5 introduction, programming model, and library APIs. The afternoon will have hands-on sessions on introduction to HDF5 files, groups, datasets and attributes as well as advanced topics like hyperslab selections, compound datatypes, and parallel HDF5. The goals are to introduce HDF5, provide knowledge on how data is organized and used by applications, and provide examples of reading and writing HDF5 files. The afternoon aims to help users start working with the HDF5 library on the NCSA machines by running examples and creating their own programs.
This Tutorial is designed for the users who have exposure to MPI I/O and basic concepts of HDF5 and would like to learn about Parallel HDF5 Library. The Tutorial will cover Parallel HDF5 design and programming model. Several C and Fortran examples will be used to illustrate the basic ideas of the Parallel HDF5 programming model. Some performance issues including collective chunked I/O will be discussed. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
This tutorial is designed for new HDF5 users. We will go over a brief history of HDF and HDF5 software, and will cover basic HDF5 Data Model objects and their properties; we will give an overview of the HDF5 Libraries and APIs, and discuss the HDF5 programming model. Simple C and Fortran examples, and Java tool HDFView will be used to illustrate HDF5 concepts.
This tutorial is designed for new HDF5 users. We will cover HDF5 abstractions such as datasets, groups, attributes, and datatypes. Simple C examples will cover the programming model and basic features of the API, and will give new users the knowledge they need to navigate through the rich collection of HDF5 interfaces. Participants will be guided through an interactive demonstration of the fundamentals of HDF5.
This tutorial is for new HDF5 users.
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
The document discusses the representation of data products from the National Polar-orbiting Operational Environmental Satellite System (NPOESS) in HDF5 format. NPOESS collects global satellite sensor data and creates data products that are packaged into HDF5 files for distribution. The data products contain various types of records, including raw, calibrated, and processed environmental data. Metadata is included in the HDF5 files to describe the data. Granules of data are initially produced and can be aggregated into larger files. Unified Modeling Language (UML) diagrams are used to represent the structure and relationships within the HDF5 files to help users understand and work with the NPOESS data.
This Tutorial gives a brief introduction to HDF5 for people who have never used it. It covers the HDF5 Data Model including HDF5 objects and their properties. It also briefly describes the HDF5 Programming Model and prepares participants for further self-study of HDF5 and hands-on sessions.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
This tutorial is designed for new HDF5 users. We will cover basic HDF5 Data Model objects and their properties, give an overview of the HDF5 Libraries and APIs, and discuss the HDF5 programming model. Simple C and Fortran examples will be used to illustrate HDF5 concepts.
The document summarizes HDF5 advanced topics presented at an HDF5 workshop. It discusses HDF5 groups and links which organize data objects in a file. It also covers HDF5 datasets and datatypes like compound and reference datatypes. HDF5 references allow accessing specific regions of datasets or objects in other files. The document provides examples in Python to demonstrate HDF5 groups, links, datatypes and references.
This document summarizes discussions from the HDF Group's HDF/HDF-EOS Workshop XIV about data interoperability. It covered topics like enabling one set of APIs to handle multiple data formats through projects like netCDF4 and CDM. It also discussed format conversions and translations between formats like HDF4, HDF5, netCDF and others. Finally, it addressed semantic and content interoperability challenges like representing latitude and longitude in different formats and how standards like CF conventions help with interpretation of metadata across tools and applications. Interoperability issues that can arise from simultaneous access of HDF5 files via HDF5 and netCDF-4 libraries were also presented.
This tutorial is designed for users with some HDF5 experience. It will cover advanced features of the HDF5 library that can be used to achieve better I/O performance and more efficient storage. The following HDF5 features will be discussed: partial I/O; compression and other filters, including new n-bit and scale+offset filters and data storage options. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array datatypes, and compound datatypes.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. It will cover functionality and useful features of the HDF5 utilities, which include h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. The tutorial will also introduce recently changes and new features of the utilities.
The HDFView is a visual tool for browsing and editing HDF4 and HDF5 files. Some basic features and new changes of HDFView will be presented. Details of recent development in HDF-Java products will be discussed in a separate presentation.
The presentation will provide an overview of subsetting software development activity at UAH. Updates have been made to all packages, reflecting the latest versions of HDF5 and HE5. The library of tools (HSE) for subsetting HDF-EOS data is up-to-date for SGI, Sun, and Linux platforms. Subsetting software is operational at NSIDC DAAC and GDAAC, in testing at LPDAAC. Ongoing work and plans will also be described, including row/column subsetting and index subsampling.
This document provides an overview of HDF-EOS, which is an extension to HDF that defines standard data structures for remote sensing and in-situ data with tightly coupled geolocation information. It describes the core components of HDF-EOS files, including Grid, Swath, and Point structures, and provides examples. It also outlines the development of an HDF5-based version to overcome limitations of the HDF4-based library and allow for larger files.
The document discusses a project to improve long-term preservation of Earth Observing System (EOS) data by creating independent maps of HDF4 data objects. The project aims to map HDF4 files to allow simple readers to access data without relying on HDF software. It involves categorizing NASA HDF4 data, prototyping an XML mapping file format, and building tools to create maps and read data based on maps. The project will investigate integrating the mapping schema with standards, address HDF-EOS2 requirements, redesign the schema, implement production mapping and reading tools, and deploy the tools at NASA data centers.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
Â
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
The document provides an introduction to the HDF5 format and The HDF Group software. It discusses:
- HDF5 is a data model, library, and file format for managing large amounts of numerical data. It was developed starting in 1996 as an improved successor to the HDF4 format.
- The HDF5 data model defines objects like datasets, groups, and attributes that provide a flexible way to organize and describe stored data. Properties of these objects can be modified to control storage and performance.
- The HDF5 library provides interfaces to work with HDF5 files and objects from C, C++, Fortran, Java and other languages. A set of command line utilities and the HDFView browser can be used to
HDFS (Hadoop Distributed File System) is designed to store very large files across commodity hardware in a Hadoop cluster. It partitions files into blocks and replicates blocks across multiple nodes for fault tolerance. The document discusses HDFS design, concepts like data replication, interfaces for interacting with HDFS like command line and Java APIs, and challenges related to small files and arbitrary modifications.
This document provides an overview and introduction to the HDF5 data model, programming model, and library APIs. The morning session will include a lecture on HDF5 introduction, programming model, and library APIs. The afternoon will have hands-on sessions on introduction to HDF5 files, groups, datasets and attributes as well as advanced topics like hyperslab selections, compound datatypes, and parallel HDF5. The goals are to introduce HDF5, provide knowledge on how data is organized and used by applications, and provide examples of reading and writing HDF5 files. The afternoon aims to help users start working with the HDF5 library on the NCSA machines by running examples and creating their own programs.
This Tutorial is designed for the users who have exposure to MPI I/O and basic concepts of HDF5 and would like to learn about Parallel HDF5 Library. The Tutorial will cover Parallel HDF5 design and programming model. Several C and Fortran examples will be used to illustrate the basic ideas of the Parallel HDF5 programming model. Some performance issues including collective chunked I/O will be discussed. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document provides an overview of HSDS (HDF Server and Data Service), which allows HDF5 files to be stored and accessed from the cloud. Key points include:
- HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects for scalability and parallelism.
- Features include streaming support, fancy indexing for complex queries, and caching for improved performance.
- HSDS can be deployed on Docker, Kubernetes, or AWS Lambda depending on needs.
- Case studies show HSDS is used by organizations like NREL and NSF to make petabytes of scientific data publicly accessible in the cloud.
This document discusses creating cloud-optimized HDF5 files by rearranging internal structures for more efficient data access in cloud object stores. It describes cloud-native and cloud-optimized storage formats, with the latter involving storing the entire HDF5 file as a single object. The benefits of cloud-optimized HDF5 include fast scanning and using the HDF5 library. Key aspects covered include using optimal chunk sizes, compression, and minimizing variable-length datatypes.
This document discusses updates and performance improvements to the HDF5 OPeNDAP data handler. It provides a history of the handler since 2001 and describes recent updates including supporting DAP4, new data types, and NetCDF data models. A performance study showed that passing compressed HDF5 data through the handler without decompressing/recompressing led to speedups of around 17-30x by leveraging HDF5 direct I/O APIs. This allows outputting HDF5 files as NetCDF files much faster through the handler.
This document provides instructions for using the Hyrax software to serve scientific data files stored on Amazon S3 using the OPeNDAP data access protocol. It describes how to generate ancillary metadata files called DMR++ files using the get_dmrpp tool that provide information about the data file structure and locations. The document explains how to run get_dmrpp inside a Docker container to process data files on S3 and generate customized DMR++ files that the Hyrax server can use to serve the files to clients.
This document provides an overview and examples of accessing cloud data and services using the Earthdata Login (EDL), Pydap, and MATLAB. It discusses some common problems users encounter, such as being unable to access HDF5 data on AWS S3 using MATLAB or read data from OPeNDAP servers using Pydap. Solutions presented include using EDL to get temporary AWS tokens for S3 access in MATLAB and providing code examples on the HDFEOS website to help users access S3 data and OPeNDAP services. The document also notes some limitations, such as tokens being valid for only 1 hour, and workarounds like requesting new tokens or using the MATLAB HDF5 API instead of the netCDF API.
The HDF5 Roadmap and New Features document outlines upcoming changes and improvements to the HDF5 library. Key points include:
- HDF5 1.13.x releases will include new features like selection I/O, the Onion VFD for versioned files, improved VFD SWMR for single-writer multiple-reader access, and subfiling for parallel I/O.
- The Virtual Object Layer allows customizing HDF5 object storage and introduces terminal and pass-through connectors.
- The Onion VFD stores versions of HDF5 files in a separate onion file for versioned access.
- VFD SWMR improves on legacy SWMR by implementing single-writer multiple-reader capabilities
This document discusses user analysis of the HDFEOS.org website and plans for future improvements. It finds that the majority of the site's 100 daily users are "quiet", not posting on forums or other interactive elements. The main user types are locators, who search for examples or data; mergers, who combine or mosaic datasets; and converters, who change file formats. The document outlines recent updates focused on these user types, like adding Python examples for subsetting and calculating latitude and longitude. It proposes future work on artificial intelligence/machine learning uses of HDF files and examples for processing HDF data in the cloud.
This document summarizes a presentation about the current status and future directions of the Hierarchical Data Format (HDF) software. It provides updates on recent HDF5 releases, development efforts including new compression methods and ways to access HDF5 data, and outreach resources. It concludes by inviting the audience to share wishes for future HDF development.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
This document summarizes MathWorks' work to modernize MATLAB's support for HDF5. Key points include:
1) MATLAB now supports HDF5 1.10.7 features like single-writer/multiple-reader access and virtual datasets through new and updated low-level functions.
2) Performance benchmarks show some improvements but also regressions compared to the previous HDF5 version, and work continues to optimize code and support future versions.
3) There are compatibility considerations for Linux filter plugins, but interim solutions are provided until MathWorks can ship a single HDF5 version.
HSDS provides HDF as a service through a REST API that can scale across nodes. New releases will enable serverless operation using AWS Lambda or direct client access without a server. This allows HDF data to be accessed remotely without managing servers. HSDS stores each HDF object separately, making it compatible with cloud object storage. Performance on AWS Lambda is slower than a dedicated server but has no management overhead. Direct client access has better performance but limits collaboration between clients.
HDF5 and Zarr are data formats that can be used to store and access scientific data. This presentation discusses approaches to translating between the two formats. It describes how HDF5 files were translated to the Zarr format by creating a separate Zarr store to hold HDF5 file chunks, and storing chunk location metadata. It also discusses an implementation that translates Zarr data to the HDF5 format by using a special chunking layout and storing chunk information in an HDF5 compound dataset. Limitations of the translations include lack of support for some HDF5 dataset properties in Zarr, and lack of support for some Zarr compression methods in the HDF5 implementation.
The document discusses HDF for the cloud, including new features of the HDF Server and what's next. Key points:
- HDF Server uses a "sharded schema" that maps HDF5 objects to individual storage objects, allowing parallel access and updates without transferring entire files.
- Implementations include HSDS software that uses the sharded schema with an API and SDKs for different languages like h5pyd for Python.
- New features of HSDS 0.6 include support for POSIX, Azure, AWS Lambda, and role-based access control.
- Future work includes direct access to storage without a server intermediary for some use cases.
This document compares different methods for accessing HDF and netCDF files stored on Amazon S3, including Apache Drill, THREDDS Data Server (TDS), and HDF5 Virtual File Driver (VFD). A benchmark test of accessing a 24GB HDF5/netCDF-4 file on S3 from Amazon EC2 found that TDS performed the best, responding within 2 minutes, while Apache Drill failed after 7 minutes. The document concludes that TDS 5.0 is the clear winner based on performance and support for role-based access control and HDF4 files, but the best solution depends on use case and software.
This document discusses STARE-PODS, a proposal to NASA/ACCESS-19 to develop a scalable data store for earth science data using the SpatioTemporal Adaptive Resolution Encoding (STARE) indexing scheme. STARE allows diverse earth science data to be unified and indexed, enabling the data to be partitioned and stored in a Parallel Optimized Data Store (PODS) for efficient analysis. The HDF Virtual Object Layer and Virtual Data Set technologies can then provide interfaces to access the data in STARE-PODS in a familiar way. The goal is for STARE-PODS to organize diverse data for alignment and parallel/distributed storage and processing to enable integrative analysis at scale.
This document provides an overview and update on HDF5 and its ecosystem. Key points include:
- HDF5 1.12.0 was recently released with new features like the Virtual Object Layer and external references.
- The HDF5 library now supports accessing data in the cloud using connectors like S3 VFD and REST VOL without needing to modify applications.
- Projects like HDFql and H5CPP provide additional interfaces for querying and working with HDF5 files from languages like SQL, C++, and Python.
- The HDF5 community is moving development to GitHub and improving documentation resources on the HDF wiki site.
This document summarizes new features in HDF5 1.12.0, including support for storing references to objects and attributes across files, new storage backends using a virtual object layer (VOL), and virtual file drivers (VFDs) for Amazon S3 and HDFS. It outlines the HDF5 roadmap for 2019-2022, which includes continued support for HDF5 1.8 and 1.10, and new features in future 1.12.x releases like querying, indexing, and provenance tracking.
More from The HDF-EOS Tools and Information Center (20)
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Â
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
Â
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Fueling AI with Great Data with Airbyte WebinarZilliz
Â
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
Â
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
Â
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Â
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Â
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Â
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Â
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Â
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
Â
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Â
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
Â
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. đź’»
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
4. An HDF5 Datatype is…
• A description of dataset element type
• Grouped into “classes”:
•
•
•
•
•
•
Atomic – integers, floating-point values
Enumerated
Compound – like C structs
Array
Opaque
References
• Object – similar to soft link
• Region – similar to soft link to dataset + selection
• Variable-length
• Strings – fixed and variable-length
• Sequences – similar to Standard C++ vector class
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
4
5. HDF5 Datatypes
• HDF5 has a rich set of pre-defined datatypes and
supports the creation of an unlimited variety of
complex user-defined datatypes.
• Self-describing:
• Datatype definitions are stored in the HDF5 file
with the data.
• Datatype definitions include information such as
byte order (endianness), size, and floating point
representation to fully describe how the data is
stored and to insure portability across platforms.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
5
6. Datatype Conversion
• Datatypes that are compatible, but not identical
are converted automatically when I/O is
performed
• Compatible datatypes:
• All atomic datatypes are compatible
• Identically structured array, variable-length and
compound datatypes whose base type or fields are
compatible
• Enumerated datatype values on a “by name” basis
• Make datatypes identical for best performance
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
6
7. Datatype Conversion Example
Array of integers on IA32 platform
Native integer is little-endian, 4 bytes
Array of integers on SPARC64 platform
Native integer is big-endian, 8 bytes
H5T_NATIVE_INT
H5T_NATIVE_INT
Little-endian 4 bytes integer
H5Dwrite
H5Dread
H5Dwrite
H5T_STD_I32LE
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
VAX G-floating
7
8. Datatype Conversion
Datatype of data on disk
dataset = H5Dcreate(file, DATASETNAME, H5T_STD_I64BE,
space, H5P_DEFAULT, H5P_DEFAULT);
Datatype of data in memory buffer
H5Dwrite(dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
H5P_DEFAULT, buf);
H5Dwrite(dataset, H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL,
H5P_DEFAULT, buf);
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
8
10. HDF5 Compound Datatypes
• Compound types
• Comparable to C structs
• Members can be any datatype
• Can write/read by a single field or a set of fields
• Not all data filters can be applied
(shuffling, SZIP)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
10
11. Creating and Writing Compound Dataset
h5_compound.c example
typedef struct s1_t {
int a;
float b;
double c;
} s1_t;
s1_t
Sep. 28-30, 2010
s1[LENGTH];
HDF/HDF-EOS Workshop XIV
11
12. Creating and Writing Compound Dataset
/* Create datatype in memory. */
s1_tid = H5Tcreate(H5T_COMPOUND, sizeof(s1_t));
H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a),
H5T_NATIVE_INT);
H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c),
H5T_NATIVE_DOUBLE);
H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b),
H5T_NATIVE_FLOAT);
Note:
• Use HOFFSET macro instead of calculating offset by hand.
• Order of H5Tinsert calls is not important if HOFFSET is used.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
12
13. Creating and Writing Compound Dataset
/* Create dataset and write data */
dataset = H5Dcreate(file, DATASETNAME, s1_tid, space,
H5P_DEFAULT, H5P_DEFAULT);
status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL,
H5P_DEFAULT, s1);
Note:
• In this example memory and file datatypes are the same.
• Type is not packed.
• Use H5Tpack to save space in the file.
status = H5Tpack(s1_tid);
status = H5Dcreate(file, DATASETNAME, s1_tid, space,
H5P_DEFAULT, H5P_DEFAULT);
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
13
14. Reading Compound Dataset
/* Create datatype in memory and read data. */
dataset
= H5Dopen(file, DATASETNAME, H5P_DEFAULT);
s2_tid
= H5Dget_type(dataset);
mem_tid
= H5Tget_native_type(s2_tid);
buf = malloc(H5Tget_size(mem_tid)*number_of_elements);
status
= H5Dread(dataset, mem_tid, H5S_ALL,
H5S_ALL, H5P_DEFAULT, buf);
Note:
• We could construct memory type as we did in writing example.
• For general applications we need to discover the type in the
file, find out corresponding memory type, allocate space and do
read.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
14
16. Table Example
a_name b_name c_name
(integer) (float) (double)
0
0.
1.0000
1
2
3
1.
4.
9.
0.5000
0.3333
0.2500
4
5
6
7
8
9
16.
25.
36.
49.
64.
81.
0.2000
0.1667
0.1429
0.1250
0.1111
0.1000
Sep. 28-30, 2010
Multiple ways to store a table
• Dataset for each field
• Dataset with compound datatype
• If all fields have the same type:
â—¦ 2-dim array
â—¦ 1-dim array of array datatype
• Continued…
Choose to achieve your goal!
•
•
•
•
•
Storage overhead?
Do I always read all fields?
Do I read some fields more often?
Do I want to use compression?
Do I want to access some records?
HDF/HDF-EOS Workshop XIV
16
18. HDF5 Fixed and Variable Length Array Storage
•Data
•Data
Time
•Data
•Data
•Data
•Data
Time
•Data
•Data
•Data
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
18
19. Storing Variable Length Data in HDF5
• Each element is represented by C structure
typedef struct {
size_t length;
void
*p;
} hvl_t;
• Base type can be any HDF5 type
H5Tvlen_create(base_type)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
19
21. Reading HDF5 Variable Length Array
• HDF5 library allocates memory to read data in
• Application only needs to allocate array of hvl_t
elements (pointers and lengths)
• Application must reclaim memory for data read in
hvl_t rdata[LENGTH];
/* Create the memory vlen type */
tvl = H5Tvlen_create(H5T_NATIVE_INT);
ret = H5Dread(dataset, tvl, H5S_ALL, H5S_ALL,
H5P_DEFAULT, rdata);
/* Reclaim the read VL data */
H5Dvlen_reclaim(tvl, H5S_ALL, H5P_DEFAULT,rdata);
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
21
22. Variable Length vs. Array
• Pros of variable length datatypes vs. arrays:
• Uses less space if compression unavailable
• Automatically stores length of data
• No maximum size
• Size of an array is its effective maximum size
• Cons of variable length datatypes vs. arrays:
• Substantial performance overhead
• Each element a “pointer” to piece of metadata
• Variable length data cannot be compressed
• Unused space in arrays can be “compressed away”
• Must be 1-dimensional
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
22
24. Storing Strings in HDF5
• Array of characters (Array datatype or extra dimension in
dataset)
• Quick access to each character
• Extra work to access and interpret each string
• Fixed length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, size);
• Wasted space in shorter strings
• Can be compressed
• Variable length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, H5T_VARIABLE);
• Overhead as for all VL datatypes
• Compression will not be applied to actual data
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
24
30. Collect data one way ….
Array of images (3D)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
31
31. Display data another way …
Stitched image (2D array)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
32
32. Data is too big to read….
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
33
33. HDF5 Library Features
• HDF5 Library provides capabilities to
• Describe subsets of data and perform write/read
operations on subsets
• Hyperslab selections and partial I/O
• Store descriptions of the data subsets in a file
• Object references
• Region references
• Use efficient storage mechanism to achieve good
performance while writing/reading subsets of data
• Chunking, compression
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
34
34. Partial I/O in HDF5
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
35
35. How to Describe a Subset in HDF5?
• Before writing and reading a subset of data
one has to describe it to the HDF5 Library.
• HDF5 APIs and documentation refer to a
subset as a “selection” or “hyperslab
selection”.
• If specified, HDF5 Library will perform I/O on a
selection only and not on all elements of a
dataset.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
36
36. Types of Selections in HDF5
• Two types of selections
• Hyperslab selection
• Regular hyperslab
• Simple hyperslab
• Result of set operations on hyperslabs
(union, difference, …)
• Point selection
• Hyperslab selection is especially important for
doing parallel I/O in HDF5 (See Parallel HDF5
Tutorial)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
37
39. Hyperslab Selection
Result of union operation on three simple hyperslabs
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
40
40. Hyperslab Description
• Start - starting location of a hyperslab (1,1)
• Stride - number of elements that separate each
block (3,2)
• Count - number of blocks (2,6)
• Block - block size (2,1)
• Everything is “measured” in number of elements
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
41
41. Simple Hyperslab Description
• Two ways to describe a simple hyperslab
• As several blocks
• Stride – (1,1)
• Count – (4,6)
• Block – (1,1)
• As one block
• Stride – (1,1)
• Count – (1,1)
• Block – (4,6)
No performance penalty for
one way or another
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
42
42. H5Sselect_hyperslab Function
space_id Identifier of dataspace
op
Selection operator
H5S_SELECT_SET or H5S_SELECT_OR
start
Array with starting coordinates of hyperslab
stride
Array specifying which positions along a dimension
to select
count
Array specifying how many blocks to select from the
dataspace, in each dimension
block
Array specifying size of element block
(NULL indicates a block size of a single element in
a dimension)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
43
43. Reading/Writing Selections
Programming model for reading from a dataset in
a file
1. Open a dataset.
2. Get file dataspace handle of the dataset and specify
subset to read from.
a. H5Dget_space returns file dataspace handle
a.
File dataspace describes array stored in a file (number of
dimensions and their sizes).
b. H5Sselect_hyperslab selects elements of the array
that participate in I/O operation.
3. Allocate data buffer of an appropriate shape and size
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
44
44. Reading/Writing Selections
Programming model (continued)
4. Create a memory dataspace and specify subset to write
to.
1.
2.
Memory dataspace describes data buffer (its rank and
dimension sizes).
Use H5Screate_simple function to create memory
dataspace.
Use H5Sselect_hyperslab to select elements of the data
buffer that participate in I/O operation.
Issue H5Dread or H5Dwrite to move the data between
3.
5.
file and memory buffer.
6. Close file dataspace and memory dataspace when
done.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
45
45. Example : Reading Two Rows
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-1
-1
-1
Data in a file
4x6 matrix
Buffer in memory
1-dim array of length 14
-1
-1
Sep. 28-30, 2010
-1
-1
-1
-1
-1
HDF/HDF-EOS Workshop XIV
-1
-1
46
-1
-1
49. Things to Remember
• Number of elements selected in a file and in a
memory buffer must be the same
• H5Sget_select_npoints returns number of
selected elements in a hyperslab selection
• HDF5 partial I/O is tuned to move data between
selections that have the same dimensionality;
avoid choosing subsets that have different ranks
(as in example above)
• Allocate a buffer of an appropriate size when
reading data; use H5Tget_native_type and
H5Tget_size to get the correct size of the data
element in memory.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
50
52. Contiguous storage layout
• Metadata header separate from dataset data
• Data stored in one contiguous block in HDF5 file
Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
Dataset data
Application memory
File
Sep. 28-30, 2010
Dataset data
HDF/HDF-EOS Workshop XIV
53
53. What is HDF5 Chunking?
• Data is stored in chunks of predefined size
• Two-dimensional instance may be referred to as
data tiling
• HDF5 library usually writes/reads the whole chunk
Contiguous
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
Chunked
54
54. What is HDF5 Chunking?
• Dataset data is divided into equally sized blocks (chunks).
• Each chunk is stored separately as a contiguous block in
HDF5 file.
Metadata cache
Dataset data
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
File
Sep. 28-30, 2010
A
B
C
D
Chunk
index
Application memory
header
Chunk
index
A
HDF/HDF-EOS Workshop XIV
C
D
B
55
55. Why HDF5 Chunking?
• Chunking is required for several HDF5 features
• Enabling compression and other filters like
checksum
• Extendible datasets
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
56
56. Why HDF5 Chunking?
• If used appropriately chunking improves partial
I/O for big datasets
Only two chunks are involved in I/O
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
57
57. Creating Chunked Dataset
1.
2.
3.
Create a dataset creation property list.
Set property list to use chunked storage layout.
Create dataset with the above property list.
dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 200;
H5Pset_chunk(dcpl_id, rank, ch_dims);
dset_id = H5Dcreate (…, dcpl_id);
H5Pclose(dcpl_id);
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
58
58. Creating Chunked Dataset
• Things to remember:
• Chunk always has the same rank as a dataset
• Chunk’s dimensions do not need to be factors
of dataset’s dimensions
• Caution: May cause more I/O than desired
(see white portions of the chunks below)
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
59
59. Creating Chunked Dataset
• Chunk size cannot be changed after the dataset is
created
• Do not make chunk sizes too small (e.g. 1x1)!
• Metadata overhead for each chunk (file space)
• Each chunk is read individually
• Many small reads inefficient
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
60
60. Writing or Reading Chunked Dataset
1.
2.
Chunking mechanism is transparent to application.
Use the same set of operation as for contiguous
dataset, for example,
H5Dopen(…);
H5Sselect_hyperslab (…);
H5Dread(…);
3.
Selections do not need to coincide precisely with the
chunks boundaries.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
61
61. HDF5 Chunking and compression
•
Chunking is required for compression and
other filters
HDF5 filters modify data during I/O operations
Filters provided by HDF5:
•
•
•
•
•
•
Checksum (H5Pset_fletcher32)
Data transformation (in 1.8.*)
Shuffling filter (H5Pset_shuffle)
Compression (also called filters) in HDF5
•
•
•
•
Sep. 28-30, 2010
Scale + offset (in 1.8.*) (H5Pset_scaleoffset)
N-bit (in 1.8.*) (H5Pset_nbit)
GZIP (deflate) (H5Pset_deflate)
SZIP (H5Pset_szip)
HDF/HDF-EOS Workshop XIV
62
63. Creating Compressed Dataset
1.
2.
3.
4.
Create a dataset creation property list
Set property list to use chunked storage layout
Set property list to use filters
Create dataset with the above property list
dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 100;
H5Pset_chunk(dcpl_id, rank, ch_dims);
H5Pset_deflate(dcpl_id, 9);
dset_id = H5Dcreate (…, dcpl_id);
H5Pclose(dcpl_id);
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
64
65. Accessing a row in contiguous dataset
One seek is needed to find the starting location of row of data.
Data is read/written using one disk access.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
66
66. Accessing a row in chunked dataset
Five seeks is needed to find each chunk. Data is read/written
using five disk accesses. Chunking storage is less efficient
than contiguous storage.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
67
67. Quiz time
• How might I improve this situation, if it is
common to access my data in this way?
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
68
68. Accessing data in contiguous dataset
M rows
M seeks are needed to find the starting location of the element.
Data is read/written using M disk accesses. Performance may be
very bad.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
69
69. Motivation for chunking storage
M rows
Two seeks are needed to find two chunks. Data is
read/written using two disk accesses. For this pattern
chunking helps with I/O performance.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
70
70. Motivation for chunk cache
A
B
H5Dwrite
H5Dwrite
Selection shown is written by two H5Dwrite calls (one for
each row).
Chunks A and B are accessed twice (one time for each
row). If both chunks fit into cache, only two I/O accesses
needed to write the shown selections.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
71
71. Motivation for chunk cache
A
B
H5Dwrite
H5Dwrite
Question: What happens if there is a space for only one
chunk at a time?
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
72
72. Advanced Exercise
•
•
•
•
Write data to a dataset
Dataset is 512x2048, 4-byte native integers
Chunks are 256x128: 128KB each, 2MB rows
Write by rows
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
73
73. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Read into cache
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
74
74. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Write to disk
Sep. 28-30, 2010
Read into cache
HDF/HDF-EOS Workshop XIV
75
75. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Write to disk
Sep. 28-30, 2010
Read into cache
HDF/HDF-EOS Workshop XIV
76
76. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Write to disk
Sep. 28-30, 2010
Read into cache
HDF/HDF-EOS Workshop XIV
77
77. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Write to disk
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
Read into cache
78
78. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Write to disk
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
Read into cache
79
79. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Read into cache
Sep. 28-30, 2010
Write to disk
HDF/HDF-EOS Workshop XIV
80
80. Advanced Exercise
• Very slow performance
• What is going wrong?
• Chunk cache is only 1MB by default
Read into cache
Sep. 28-30, 2010
Write to disk
HDF/HDF-EOS Workshop XIV
81
81. Exercise 1
• Improve performance by changing only chunk
size
Access pattern is fixed, limited memory
• One solution: 64x2048 chunks
• Row of chunks fits in cache
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
82
82. Exercise 2
• Improve performance by changing only access
pattern
• File already exists, cannot change chunk size
• One solution: Access by chunk
• Each selection fits in cache, contiguous on disk
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
83
83. Exercise 3
• Improve performance while not changing chunk
size or access pattern
• No memory limitation
• One solution: Chunk cache set to size of row of
chunks
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
84
84. Exercise 4
• Improve performance while not changing chunk
size or access pattern
• Chunk cache size can be set to max. 1MB
• One solution: Disable chunk cache
• Avoids repeatedly reading/writing whole chunks
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
85
85. More Information
• More detailed information on chunking and the
chunk cache can be found in the draft “Chunking
in HDF5” document at:
http://www.hdfgroup.org/HDF5/doc/_topic/Chunking
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
86
87. Acknowledgements
This work was supported by cooperative agreement
number NNX08AO77A from the National
Aeronautics and Space Administration (NASA).
Any opinions, findings, conclusions, or
recommendations expressed in this material are
those of the author[s] and do not necessarily reflect
the views of the National Aeronautics and Space
Administration.
Sep. 28-30, 2010
HDF/HDF-EOS Workshop XIV
88