The HDF group has experienced remarkable successes, producing high quality open source software that is widely used throughout the world. This talk presents a collection of thoughts on the HDF Project's approach to software engineering. In writing these notes we have come to realize that any success the group has had are due to several factors, including:
o Strong, responsible, and continuing relationships with users
o An approach to needs identification, software design, and software implementation based on sound principles of software engineering
o Effective technical processes for developing, testing, integrating and maintaining software
o Business and social processes based on sound group management principles
These factors are little more than platitudes, however. The manner in which they are successfully applied can only be understood by examining the details. In this talk, we describe some of the details, emphasizing mostly those areas in which we have had success.
Tri-State Generation and Transmission intranet transformation case study at T...Prescient Digital Media
Tri-State Generation and Transmission intranet transformation case study "Get IT" presented by Nicole Carlson at The Intranet Global Forum 2015 in New York City, Oct 23, 2015.
Update on HDF, including recent changes to the software, upcoming releases, collaborations, future plans. Will include an overview of the upcoming HDF5 1.8 release, and updates on the netCDF4/HDF5 merge, HDF5 support for indexing, BioHDF, the HDF5-Storage Resource Broker project, and the HDF spin-off THG.
This document summarizes Mike Folk's presentation at the Science Data Processing Workshop from February 26-28, 2002. The presentation provided updates on HDF4 and HDF5, including recent releases and future plans. HDF4 and HDF5 are open source data formats and software libraries for scientific data that support efficient storage of arrays, images, and tables. The presentation outlined ongoing work to improve performance, add new features, and facilitate the transition from HDF4 to HDF5.
Software management plans in research softwareShoaib Sufi
Slides from the 14th August 2019 webinar presentation as part of the Best Practices for HPC Software Developers (Webinar) series - https://ideas-productivity.org/events/hpc-best-practices-webinars/ - more info at https://www.exascaleproject.org/event/smp-rp/ and a recording on YouTube is at - https://www.youtube.com/watch?v=7sELeZStzdY&feature=youtu.be
Abstract:
Software is a necessary by-product of research. Software in this context can range from small shell scripts to complex and layered software ecosystems. Dealing with software as a first class citizen at the time of grant formulation is aided by the development of a Software Management Plan (SMP). An SMP can help to formalize a set of structures and goals that ensure your software is accessible and reusable in the short, medium and long term. SMP’s aim at becoming for software what Data Management Plans (DMP’s) have become for research data (DMP’s are mandatory for National Science Foundation grants). This webinar takes you through the questions you should consider when developing a Software Management Plan, how to manage the implementation of the plan, and some of the current motivation driving discussion in this area of research management.
The HDF Group creates and maintains HDF5, a file format and software library for managing large, complex datasets. Its goals are to enable effective management of data throughout its lifecycle and establish a sustainable organization to accomplish this. HDF5 is used widely in science and industry, with notable applications at NASA, NOAA, and national laboratories. It addresses challenges of complex data organization, efficient storage and access, and long-term data preservation.
The document provides an update on the HDF software projects. It discusses recent releases of HDF4, HDF5, and HDF Java products. It highlights new features, platforms supported, and organizations contributing to development. Upcoming work includes improvements to parallel I/O, data indexing and viewing tools, and harmonization with netCDF and OPeNDAP formats.
Slides of the AIMS webinar on the Conceptual Design of TAPipedia, introducing initial version of the Design for public feedback & comments.
http://aims.fao.org/activity/blog/new-webinarsaims%E2%80%9Cdesigning-tapipedia-information-sharing-platform-capacity-development
Tri-State Generation and Transmission intranet transformation case study at T...Prescient Digital Media
Tri-State Generation and Transmission intranet transformation case study "Get IT" presented by Nicole Carlson at The Intranet Global Forum 2015 in New York City, Oct 23, 2015.
Update on HDF, including recent changes to the software, upcoming releases, collaborations, future plans. Will include an overview of the upcoming HDF5 1.8 release, and updates on the netCDF4/HDF5 merge, HDF5 support for indexing, BioHDF, the HDF5-Storage Resource Broker project, and the HDF spin-off THG.
This document summarizes Mike Folk's presentation at the Science Data Processing Workshop from February 26-28, 2002. The presentation provided updates on HDF4 and HDF5, including recent releases and future plans. HDF4 and HDF5 are open source data formats and software libraries for scientific data that support efficient storage of arrays, images, and tables. The presentation outlined ongoing work to improve performance, add new features, and facilitate the transition from HDF4 to HDF5.
Software management plans in research softwareShoaib Sufi
Slides from the 14th August 2019 webinar presentation as part of the Best Practices for HPC Software Developers (Webinar) series - https://ideas-productivity.org/events/hpc-best-practices-webinars/ - more info at https://www.exascaleproject.org/event/smp-rp/ and a recording on YouTube is at - https://www.youtube.com/watch?v=7sELeZStzdY&feature=youtu.be
Abstract:
Software is a necessary by-product of research. Software in this context can range from small shell scripts to complex and layered software ecosystems. Dealing with software as a first class citizen at the time of grant formulation is aided by the development of a Software Management Plan (SMP). An SMP can help to formalize a set of structures and goals that ensure your software is accessible and reusable in the short, medium and long term. SMP’s aim at becoming for software what Data Management Plans (DMP’s) have become for research data (DMP’s are mandatory for National Science Foundation grants). This webinar takes you through the questions you should consider when developing a Software Management Plan, how to manage the implementation of the plan, and some of the current motivation driving discussion in this area of research management.
The HDF Group creates and maintains HDF5, a file format and software library for managing large, complex datasets. Its goals are to enable effective management of data throughout its lifecycle and establish a sustainable organization to accomplish this. HDF5 is used widely in science and industry, with notable applications at NASA, NOAA, and national laboratories. It addresses challenges of complex data organization, efficient storage and access, and long-term data preservation.
The document provides an update on the HDF software projects. It discusses recent releases of HDF4, HDF5, and HDF Java products. It highlights new features, platforms supported, and organizations contributing to development. Upcoming work includes improvements to parallel I/O, data indexing and viewing tools, and harmonization with netCDF and OPeNDAP formats.
Slides of the AIMS webinar on the Conceptual Design of TAPipedia, introducing initial version of the Design for public feedback & comments.
http://aims.fao.org/activity/blog/new-webinarsaims%E2%80%9Cdesigning-tapipedia-information-sharing-platform-capacity-development
The survey results show that Hydra projects have an average team size of 6 people. Agile Scrum is the most commonly used methodology. Jira and GitHub are popular tools for managing requirements and source control. The main benefits of Hydra cited are the active community for sharing knowledge and best practices, and the reusable technology including Ruby on Rails and Fedora. The biggest challenges are obtaining resources and avoiding technical debt as the software evolves.
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document discusses the HDF (Hierarchical Data Format) software. It provides an overview of HDF4 and HDF5, including past and future work. HDF is a format and software for scientific data storage and exchange that is free and supported by various organizations. The document outlines activities in 2000 to support users, develop tools like HDF5View, and refine the XML description of HDF5. It concludes by focusing on areas for 2001 like supporting new satellite missions and expanding HDF5 applications and performance.
HDF is a file format for managing scientific data in heterogeneous environments. It provides data interoperability through I/O software, utilities, and search/access tools. HDF supports a variety of data types and structures, large datasets, metadata, portability across systems, fast I/O, and efficient storage. HDF-EOS extends HDF to define standard profiles for organizing Earth science remote sensing and in-situ data.
Mike Folk from the National Center for Supercomputing Applications gave an update on HDF software in 2003. HDF is supported by several government agencies for applications in earth science, simulations, and data-intensive computing. Version 4.2 Release 1 was planned for October 2003 with bug fixes and new features. HDF5 1.6.0 was released in July 2003 with new filters, properties, and performance improvements. Work was also being done on high-level APIs, parallel HDF5, tools, and collaborations with other projects.
Open source caqdas what is in the box and what is missingMerlien Institute
This document discusses open source computer-aided qualitative data analysis software (CAQDAS) alternatives to proprietary packages. It provides an overview of the open source movement and free/open source software philosophies. It then summarizes the features and capabilities of several popular open source CAQDAS tools, including QDA Miner Lite, Dedoose, NVivo, Atlas.ti, RQDA, and DReSS. Overall, it finds that while open source CAQDAS packages offer many useful features, their functionality is more limited than proprietary alternatives.
The webinar discussed FAIRDOM services that can help applicants to the ERACoBioTech call with their data management plans and requirements. FAIRDOM offers webinars on developing data management plans, and their platform and tools can help with organizing, storing, sharing, and publishing research data and models in a FAIR manner by utilizing metadata standards. Different levels of support are available, from general community resources through their hub, to premium customized support for individual projects. Consortia can include FAIRDOM as a subcontractor within the guidelines of the ERACoBioTech call.
The document outlines a 9 step process for developing a successful technology plan for nonprofit river groups. It discusses identifying stakeholders, needs, assets, potential solutions, creating a living document and budget, fundraising strategy, timelines, and taking an iterative approach. The goal is to help pull together the right people, focus on goals, and create a compelling story for funding. Key steps include identifying needs, exploring off-the-shelf and custom solutions, creating a total cost of ownership budget, and developing a fundraising strategy focused on problems solved rather than just technology.
Scaling Application Development & Delivery across the EnterpriseCollabNet
Software and applications are core to your business. Agile project planning and management have gone mainstream and the rest of the delivery chain has yet to catch up. According to Forrester 87% of organizations have not connected their Agile project planning to their downstream delivery processes. Organizations who are successful at the workgroup level are further challenged with scaling these successes across an entire enterprise.
An overview of the Hydra digital repository framework and the community that builds and maintains it. Presented at Open Repositories 2013 in Charlottetown, Prince Edward Island, Canada.
In this webinar, CollabNet shares its codified Blueprint for Enterprise Agility, resulting from over a decade of working with industry leading enterprises on hundreds of large scale development projects across a wide range of industries. Join Senior Director Kevin Hancock as he shares the 5 steps that have proven to be the essential elements to attaining enterprise agility. This approach has proven to be flexible enough to meet the needs of the diverse development processes, point tools, and application frameworks and deployment clouds required by the broad needs of the enterprise.
The Digital Repository Service (DRS) is Harvard Library's digital preservation repository that provides long-term preservation and access to over 63 million digital files totaling 204 terabytes. The DRS uses a modular architecture with a combination of third-party and custom tools to deposit, preserve, and deliver content. Current projects include migrating metadata to new schemas and adding support for video preservation. Challenges include long-running backend projects and supporting a long tail of formats. Future work focuses on additional format migrations, easier deposit processes, and expanding medium-term preservation support.
Criteria and evaluation of research data repository platforms @ the Universit...heila1
project scope and project team, research data life cycle, e-research framework; product investigation, criteria and evaluation, recommendations, next steps, documents produced; collaboration university library and IT
The document discusses Seagate's plans to integrate hard disk drives (HDDs) with flash storage, systems, services, and consumer devices to deliver unique hybrid solutions for customers. It notes Seagate's annual revenue, employees, manufacturing plants, and design centers. It also discusses Seagate exploring the use of big data analytics and Hadoop across various potential use cases and outlines Seagate's high-level plans for Hadoop implementation.
DevTalk: The Road to Continuous Delivery: Driving LessonsPerforce
At Perforce, we don’t just preach Continuous Delivery, we practice it. In this webinar, we’ll talk about the challenges of implementing Continuous Delivery with codebases of different ages and sizes for both SaaS and on-premise products.
Laurette Cisneros, Manager of Build Engineering, will share lessons learned at the start of our journey to automation, including:
- How we convinced the skeptics and determined the right route
- The hurdles we encountered along the way
- Tips for getting started in your organization
Click here to register for the on-demand webinar:
http://info.perforce.com/ondemand-webinar-continuous-delivery-p4-build.html
This 3-day course teaches students how to automate data flow between systems using Hortonworks Data Flow (HDF). The course covers installing, configuring, and building data flows in Apache NiFi, as well as data provenance, security, monitoring, and best practices. Students will learn through lectures, hands-on labs, and demos to prepare them to automate data integration in enterprise environments. Prerequisites include programming experience while experience with Linux, data flow tools, and Hadoop is helpful but not required.
This document discusses the software development life cycle (SDLC) model. It defines the SDLC as a detailed plan for creating, developing, implementing, and eventually retiring software. The SDLC involves phases such as concept development, planning, requirements analysis, design, development, testing, implementation, and maintenance. Two common SDLC models are the waterfall model and iterative model. Following an SDLC is important for health IT systems to ensure software meets needs, integrates properly, and has appropriate documentation for maintenance.
Open source stak of big data techs open suse asiaMuhammad Rifqi
This document summarizes the key technologies in the open source stack for big data. It discusses Hadoop, the leading open source framework for distributed storage and processing of large data sets. Components of Hadoop include HDFS for distributed file storage and MapReduce for distributed computations. Other related technologies are also summarized like Hive for data warehousing, Pig for data flows, Sqoop for data transfer between Hadoop and databases, and approaches like Lambda architecture for batch and real-time processing. The document provides a high-level overview of implementing big data solutions using open source Hadoop technologies.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
More Related Content
Similar to HDF Software Process - Lessons Learned & Success Factors
The survey results show that Hydra projects have an average team size of 6 people. Agile Scrum is the most commonly used methodology. Jira and GitHub are popular tools for managing requirements and source control. The main benefits of Hydra cited are the active community for sharing knowledge and best practices, and the reusable technology including Ruby on Rails and Fedora. The biggest challenges are obtaining resources and avoiding technical debt as the software evolves.
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document discusses the HDF (Hierarchical Data Format) software. It provides an overview of HDF4 and HDF5, including past and future work. HDF is a format and software for scientific data storage and exchange that is free and supported by various organizations. The document outlines activities in 2000 to support users, develop tools like HDF5View, and refine the XML description of HDF5. It concludes by focusing on areas for 2001 like supporting new satellite missions and expanding HDF5 applications and performance.
HDF is a file format for managing scientific data in heterogeneous environments. It provides data interoperability through I/O software, utilities, and search/access tools. HDF supports a variety of data types and structures, large datasets, metadata, portability across systems, fast I/O, and efficient storage. HDF-EOS extends HDF to define standard profiles for organizing Earth science remote sensing and in-situ data.
Mike Folk from the National Center for Supercomputing Applications gave an update on HDF software in 2003. HDF is supported by several government agencies for applications in earth science, simulations, and data-intensive computing. Version 4.2 Release 1 was planned for October 2003 with bug fixes and new features. HDF5 1.6.0 was released in July 2003 with new filters, properties, and performance improvements. Work was also being done on high-level APIs, parallel HDF5, tools, and collaborations with other projects.
Open source caqdas what is in the box and what is missingMerlien Institute
This document discusses open source computer-aided qualitative data analysis software (CAQDAS) alternatives to proprietary packages. It provides an overview of the open source movement and free/open source software philosophies. It then summarizes the features and capabilities of several popular open source CAQDAS tools, including QDA Miner Lite, Dedoose, NVivo, Atlas.ti, RQDA, and DReSS. Overall, it finds that while open source CAQDAS packages offer many useful features, their functionality is more limited than proprietary alternatives.
The webinar discussed FAIRDOM services that can help applicants to the ERACoBioTech call with their data management plans and requirements. FAIRDOM offers webinars on developing data management plans, and their platform and tools can help with organizing, storing, sharing, and publishing research data and models in a FAIR manner by utilizing metadata standards. Different levels of support are available, from general community resources through their hub, to premium customized support for individual projects. Consortia can include FAIRDOM as a subcontractor within the guidelines of the ERACoBioTech call.
The document outlines a 9 step process for developing a successful technology plan for nonprofit river groups. It discusses identifying stakeholders, needs, assets, potential solutions, creating a living document and budget, fundraising strategy, timelines, and taking an iterative approach. The goal is to help pull together the right people, focus on goals, and create a compelling story for funding. Key steps include identifying needs, exploring off-the-shelf and custom solutions, creating a total cost of ownership budget, and developing a fundraising strategy focused on problems solved rather than just technology.
Scaling Application Development & Delivery across the EnterpriseCollabNet
Software and applications are core to your business. Agile project planning and management have gone mainstream and the rest of the delivery chain has yet to catch up. According to Forrester 87% of organizations have not connected their Agile project planning to their downstream delivery processes. Organizations who are successful at the workgroup level are further challenged with scaling these successes across an entire enterprise.
An overview of the Hydra digital repository framework and the community that builds and maintains it. Presented at Open Repositories 2013 in Charlottetown, Prince Edward Island, Canada.
In this webinar, CollabNet shares its codified Blueprint for Enterprise Agility, resulting from over a decade of working with industry leading enterprises on hundreds of large scale development projects across a wide range of industries. Join Senior Director Kevin Hancock as he shares the 5 steps that have proven to be the essential elements to attaining enterprise agility. This approach has proven to be flexible enough to meet the needs of the diverse development processes, point tools, and application frameworks and deployment clouds required by the broad needs of the enterprise.
The Digital Repository Service (DRS) is Harvard Library's digital preservation repository that provides long-term preservation and access to over 63 million digital files totaling 204 terabytes. The DRS uses a modular architecture with a combination of third-party and custom tools to deposit, preserve, and deliver content. Current projects include migrating metadata to new schemas and adding support for video preservation. Challenges include long-running backend projects and supporting a long tail of formats. Future work focuses on additional format migrations, easier deposit processes, and expanding medium-term preservation support.
Criteria and evaluation of research data repository platforms @ the Universit...heila1
project scope and project team, research data life cycle, e-research framework; product investigation, criteria and evaluation, recommendations, next steps, documents produced; collaboration university library and IT
The document discusses Seagate's plans to integrate hard disk drives (HDDs) with flash storage, systems, services, and consumer devices to deliver unique hybrid solutions for customers. It notes Seagate's annual revenue, employees, manufacturing plants, and design centers. It also discusses Seagate exploring the use of big data analytics and Hadoop across various potential use cases and outlines Seagate's high-level plans for Hadoop implementation.
DevTalk: The Road to Continuous Delivery: Driving LessonsPerforce
At Perforce, we don’t just preach Continuous Delivery, we practice it. In this webinar, we’ll talk about the challenges of implementing Continuous Delivery with codebases of different ages and sizes for both SaaS and on-premise products.
Laurette Cisneros, Manager of Build Engineering, will share lessons learned at the start of our journey to automation, including:
- How we convinced the skeptics and determined the right route
- The hurdles we encountered along the way
- Tips for getting started in your organization
Click here to register for the on-demand webinar:
http://info.perforce.com/ondemand-webinar-continuous-delivery-p4-build.html
This 3-day course teaches students how to automate data flow between systems using Hortonworks Data Flow (HDF). The course covers installing, configuring, and building data flows in Apache NiFi, as well as data provenance, security, monitoring, and best practices. Students will learn through lectures, hands-on labs, and demos to prepare them to automate data integration in enterprise environments. Prerequisites include programming experience while experience with Linux, data flow tools, and Hadoop is helpful but not required.
This document discusses the software development life cycle (SDLC) model. It defines the SDLC as a detailed plan for creating, developing, implementing, and eventually retiring software. The SDLC involves phases such as concept development, planning, requirements analysis, design, development, testing, implementation, and maintenance. Two common SDLC models are the waterfall model and iterative model. Following an SDLC is important for health IT systems to ensure software meets needs, integrates properly, and has appropriate documentation for maintenance.
Open source stak of big data techs open suse asiaMuhammad Rifqi
This document summarizes the key technologies in the open source stack for big data. It discusses Hadoop, the leading open source framework for distributed storage and processing of large data sets. Components of Hadoop include HDFS for distributed file storage and MapReduce for distributed computations. Other related technologies are also summarized like Hive for data warehousing, Pig for data flows, Sqoop for data transfer between Hadoop and databases, and approaches like Lambda architecture for batch and real-time processing. The document provides a high-level overview of implementing big data solutions using open source Hadoop technologies.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
Similar to HDF Software Process - Lessons Learned & Success Factors (20)
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
This document provides an overview of HSDS (HDF Server and Data Service), which allows HDF5 files to be stored and accessed from the cloud. Key points include:
- HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects for scalability and parallelism.
- Features include streaming support, fancy indexing for complex queries, and caching for improved performance.
- HSDS can be deployed on Docker, Kubernetes, or AWS Lambda depending on needs.
- Case studies show HSDS is used by organizations like NREL and NSF to make petabytes of scientific data publicly accessible in the cloud.
This document discusses creating cloud-optimized HDF5 files by rearranging internal structures for more efficient data access in cloud object stores. It describes cloud-native and cloud-optimized storage formats, with the latter involving storing the entire HDF5 file as a single object. The benefits of cloud-optimized HDF5 include fast scanning and using the HDF5 library. Key aspects covered include using optimal chunk sizes, compression, and minimizing variable-length datatypes.
This document discusses updates and performance improvements to the HDF5 OPeNDAP data handler. It provides a history of the handler since 2001 and describes recent updates including supporting DAP4, new data types, and NetCDF data models. A performance study showed that passing compressed HDF5 data through the handler without decompressing/recompressing led to speedups of around 17-30x by leveraging HDF5 direct I/O APIs. This allows outputting HDF5 files as NetCDF files much faster through the handler.
This document provides instructions for using the Hyrax software to serve scientific data files stored on Amazon S3 using the OPeNDAP data access protocol. It describes how to generate ancillary metadata files called DMR++ files using the get_dmrpp tool that provide information about the data file structure and locations. The document explains how to run get_dmrpp inside a Docker container to process data files on S3 and generate customized DMR++ files that the Hyrax server can use to serve the files to clients.
This document provides an overview and examples of accessing cloud data and services using the Earthdata Login (EDL), Pydap, and MATLAB. It discusses some common problems users encounter, such as being unable to access HDF5 data on AWS S3 using MATLAB or read data from OPeNDAP servers using Pydap. Solutions presented include using EDL to get temporary AWS tokens for S3 access in MATLAB and providing code examples on the HDFEOS website to help users access S3 data and OPeNDAP services. The document also notes some limitations, such as tokens being valid for only 1 hour, and workarounds like requesting new tokens or using the MATLAB HDF5 API instead of the netCDF API.
The HDF5 Roadmap and New Features document outlines upcoming changes and improvements to the HDF5 library. Key points include:
- HDF5 1.13.x releases will include new features like selection I/O, the Onion VFD for versioned files, improved VFD SWMR for single-writer multiple-reader access, and subfiling for parallel I/O.
- The Virtual Object Layer allows customizing HDF5 object storage and introduces terminal and pass-through connectors.
- The Onion VFD stores versions of HDF5 files in a separate onion file for versioned access.
- VFD SWMR improves on legacy SWMR by implementing single-writer multiple-reader capabilities
This document discusses user analysis of the HDFEOS.org website and plans for future improvements. It finds that the majority of the site's 100 daily users are "quiet", not posting on forums or other interactive elements. The main user types are locators, who search for examples or data; mergers, who combine or mosaic datasets; and converters, who change file formats. The document outlines recent updates focused on these user types, like adding Python examples for subsetting and calculating latitude and longitude. It proposes future work on artificial intelligence/machine learning uses of HDF files and examples for processing HDF data in the cloud.
This document summarizes a presentation about the current status and future directions of the Hierarchical Data Format (HDF) software. It provides updates on recent HDF5 releases, development efforts including new compression methods and ways to access HDF5 data, and outreach resources. It concludes by inviting the audience to share wishes for future HDF development.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
This document summarizes MathWorks' work to modernize MATLAB's support for HDF5. Key points include:
1) MATLAB now supports HDF5 1.10.7 features like single-writer/multiple-reader access and virtual datasets through new and updated low-level functions.
2) Performance benchmarks show some improvements but also regressions compared to the previous HDF5 version, and work continues to optimize code and support future versions.
3) There are compatibility considerations for Linux filter plugins, but interim solutions are provided until MathWorks can ship a single HDF5 version.
HSDS provides HDF as a service through a REST API that can scale across nodes. New releases will enable serverless operation using AWS Lambda or direct client access without a server. This allows HDF data to be accessed remotely without managing servers. HSDS stores each HDF object separately, making it compatible with cloud object storage. Performance on AWS Lambda is slower than a dedicated server but has no management overhead. Direct client access has better performance but limits collaboration between clients.
HDF5 and Zarr are data formats that can be used to store and access scientific data. This presentation discusses approaches to translating between the two formats. It describes how HDF5 files were translated to the Zarr format by creating a separate Zarr store to hold HDF5 file chunks, and storing chunk location metadata. It also discusses an implementation that translates Zarr data to the HDF5 format by using a special chunking layout and storing chunk information in an HDF5 compound dataset. Limitations of the translations include lack of support for some HDF5 dataset properties in Zarr, and lack of support for some Zarr compression methods in the HDF5 implementation.
The document discusses HDF for the cloud, including new features of the HDF Server and what's next. Key points:
- HDF Server uses a "sharded schema" that maps HDF5 objects to individual storage objects, allowing parallel access and updates without transferring entire files.
- Implementations include HSDS software that uses the sharded schema with an API and SDKs for different languages like h5pyd for Python.
- New features of HSDS 0.6 include support for POSIX, Azure, AWS Lambda, and role-based access control.
- Future work includes direct access to storage without a server intermediary for some use cases.
This document compares different methods for accessing HDF and netCDF files stored on Amazon S3, including Apache Drill, THREDDS Data Server (TDS), and HDF5 Virtual File Driver (VFD). A benchmark test of accessing a 24GB HDF5/netCDF-4 file on S3 from Amazon EC2 found that TDS performed the best, responding within 2 minutes, while Apache Drill failed after 7 minutes. The document concludes that TDS 5.0 is the clear winner based on performance and support for role-based access control and HDF4 files, but the best solution depends on use case and software.
This document discusses STARE-PODS, a proposal to NASA/ACCESS-19 to develop a scalable data store for earth science data using the SpatioTemporal Adaptive Resolution Encoding (STARE) indexing scheme. STARE allows diverse earth science data to be unified and indexed, enabling the data to be partitioned and stored in a Parallel Optimized Data Store (PODS) for efficient analysis. The HDF Virtual Object Layer and Virtual Data Set technologies can then provide interfaces to access the data in STARE-PODS in a familiar way. The goal is for STARE-PODS to organize diverse data for alignment and parallel/distributed storage and processing to enable integrative analysis at scale.
This document provides an overview and update on HDF5 and its ecosystem. Key points include:
- HDF5 1.12.0 was recently released with new features like the Virtual Object Layer and external references.
- The HDF5 library now supports accessing data in the cloud using connectors like S3 VFD and REST VOL without needing to modify applications.
- Projects like HDFql and H5CPP provide additional interfaces for querying and working with HDF5 files from languages like SQL, C++, and Python.
- The HDF5 community is moving development to GitHub and improving documentation resources on the HDF wiki site.
This document summarizes new features in HDF5 1.12.0, including support for storing references to objects and attributes across files, new storage backends using a virtual object layer (VOL), and virtual file drivers (VFDs) for Amazon S3 and HDFS. It outlines the HDF5 roadmap for 2019-2022, which includes continued support for HDF5 1.8 and 1.10, and new features in future 1.12.x releases like querying, indexing, and provenance tracking.
The document discusses leveraging cloud resources like Amazon Web Services to improve software testing for the HDF group. Currently HDF software is tested on various in-house systems, but moving more testing to the cloud could provide better coverage of operating systems and distributions at a lower cost. AWS spot instances are being used to run HDF5 build and regression tests across different Linux distributions in around 30 minutes for approximately $0.02 per hour.
More from The HDF-EOS Tools and Information Center (20)
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
HDF Software Process - Lessons Learned & Success Factors
1. HDF Software Process
Lessons Learned & Success Factors
Mike Folk, Elena Pourmal , Bob McGrath
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
NOBUGS 2004
HDF-EOS Workshop VIII
-1-
HDF
2. Outline
•
•
•
•
•
•
•
What is HDF? and Who is HDF?
HDF “Architecture”
Some statistics
How do we measure success?
How can we achieve success?
Group practices
Summing up – strengths, weaknesses, needs
-2-
HDF
4. HDF in a nutshell – what it is
• File format and I/O Libraries for storing,
managing and archiving large complex
scientific and other data
• Tools and utilities
• Open source, free for any use (U of I license)
• Well maintained and supported
• From HDF group, NCSA Univ of Illinois
• http://hdf.ncsa.uiuc.edu
-4-
HDF
5. HDF in a nutshell - features
• General
– simple and flexible data model
• Flexible
– store data of diverse origins, sizes, types
– supports complex data structures and types
• Portable
– available for many operating systems and machines
• Scalable
– works in high end computing environments
– accommodates date of any size or multiplicity
• Efficient
– fast access, including parallel i/o
– Stores big data efficiently
-5-
HDF
6. HDF in a nutshell - users
• Apps in industry, academia, government
– More than 200 distinct applications
• Large user base
– E.g. NASA estimates 1.6 million users
• Underlying format for community standards
– E.g. HDF-EOS, SAF, CGNS, NPOESS, NeXus
-6-
HDF
7. Example of HDF file: mixing and grouping
objects
Text : This file was create as a part of…
see http://hdf.ncsa.uiuc.edu
foo
a
3-D array
z
1GB
lat | lon | temp
----|-----|----12 | 23 | 3.1
15 | 24 | 4.2
17 | 21 | 3.6
c
b
palette
x
_foo_y
Table
Raster image
Raster image
-7-
2-D array
HDF
11. Supported languages and compilers
• C
• Wrappers:
– C++
– Fortran90
– Java
• Vendors’ compilers (SUN, IBM, HP, etc.)
• PGI and Absoft (Fortran)
• GNU C (e.g. gcc 3.3.2)
- 11 -
HDF
12. Supported Machines and OS
•
•
•
•
•
•
•
Solaris 2.7, 2.8 (32/64-bit)
IRIX6.5 IRIX64-6.5
HPUX 11.00
AIX 5.1 (32/64-bit modes)
OSF1
FreeBSD
Linux (SuSe, RH8, RH9)
including 64-bit
- 12 -
•
•
•
•
•
•
•
Altix (SGI Linux)
IA-32 and IA-64
Windows 2000, XP
MAC OS X
Crays (T3E, SV1, T90IEEE)
DOE National Labs machines
Linux Clusters
HDF
13. Architecture in context
Tools & Applications
C
C++
F90
Java
HDF5 Applications
Programming Interface
Low level Interface
IA32
SGI Wintel Cray
File
Linux RH IRIX32 XP
SV1
Serial
- 13 -
Parallel
HDF
14. Architecture in context
Tools & Applications
HDF-EOS SAF CGNS
C
C++
F90
Java
HDF5 Applications
Programming Interface
Low level Interface
IA32
SGI Wintel Cray
Linux RH IRIX32 XP
Serial
- 15 -
SV1
Parallel
File
HDF
15. The testing challenge
Machines × operating systems
× compilers × languages
× serial and parallel
× compression options
× configuration options
× virtual file options
× backward compatibility
= a large number
- 16 -
HDF
22. How do we measure success?
•
•
•
•
•
•
•
Mission
Goals and objectives
Strong and continuing relationships with users
High quality software
Strong committed development team
Great working environment
Adequate funding
- 23 -
HDF
23. Mission, goals and objectives
• Mission
– To develop, promote, deploy, and support open and
free technologies that facilitate scientific data
exchange, access, analysis, archiving and discovery
• Goals (examples)
– Innovate and evolve the technologies in concert with a
changing world of technologies
– Maintain a high level of quality and reliability
– Collaborate and build communities
– Build a team
- 24 -
HDF
24. Mission, goals and objectives
• Objectives - how we reach the goal
• Example:
– Goal
• Maintain a high level of quality and reliability
– Objectives
• Improve testing
• Implement a program to insure excellent software
engineering practices
• Develop and execute a plan to meet
quality/reliability standards
- 25 -
HDF
25. Users
•
•
•
•
Number of users
Happy users
Unhappy users
Users achieve their goals by using HDF
technologies
• Users coming back with new needs
• Financial support from users
- 26 -
HDF
26. Software
• Technology that addresses users’ needs and
demands (current and future)
– E.g. big files, parallel access, multiple objects
• Usability
–
–
–
–
Number and types of applications
Appropriate APIs and data models
Available tools
Interoperability with other software
• E.g. IDL, MatLab, Mathematica
- 27 -
HDF
27. Software
• Stability
– Can data be shared?
– Can software run on needed platforms
• Sustainability
– Can read data written 15 years ago on obsolete platform
– Is software available in 15 years?
• Acceptability
– De facto standard
• Open standard for exchange of remote-sensed data
• Over 3,000,000,000,000,000 bytes stored in HDF and HDF-EOS
- 28 -
HDF
29. How can we achieve success?
• Maintain strong, responsible, and continuing
relationships with users
• An approach to needs identification, software
design, and software implementation based on
sound principles of software engineering
• Effective technical processes for developing,
testing, integrating and maintaining software
• Business and social processes based on sound
group management principles
- 30 -
HDF
30. Stages of software development at
HDF
•
•
•
•
•
Getting started
Creating an implementation approach
Implementation and maintenance
Relations with users and sponsors
Group practices
- 31 -
HDF
31. Getting started
•
•
•
•
Discover a need
Identify a sponsor
Clarify the need, its role, and its importance
Enter task into the project plan
–
–
–
–
Make initial estimate of time and resources for the task
Give it a priority
Identify task’s lead
Identify a person who will work on the task
- 32 -
HDF
32. Creating implementation approach
• Write up a needs/approach RFC (Request For
Comment)
– Actively solicit feedback from developers/sponsors
– Revise until satisfied
• Write up a design/approach RFC
– Get feedback from developers/sponsors
– Revise until satisfied
• Revise project plan according to RFC results
• Archive RFC
- 33 -
HDF
33. Implementation and maintenance
• Identify validation plan (need improvement)
• Implement
– Library or tool
– Tests
– Documentation
• Ask sponsor and friendly users for feedback
• Review results and repeat appropriate steps above as
needed
• Clean up (documentation, Web, etc.) and announce
• Support (debug, fix, add more tests, advertise)
- 34 -
HDF
34. Relations with users and sponsors
• Who are our sponsors?
– Organizations and communities with
institutional and financial commitment to HDF
• NCSA, NASA, DOE ASCI, Boeing, …
– Agencies supporting R&D
• NCSA, NASA, DOE, NSF, …
– Collaborators who make in-kind contributions
• Cactus, PyTables, NeXUS, CGNS …
– HDF group members
- 35 -
HDF
35. Relations with users and sponsors
• Each task is associated with a sponsor
• Each task has a priority, which should be
confirmed with sponsor
• Each task falls into one of these categories
– Research
– R&D (research, possibly integrate into product)
– Development
• Technology infusion
• Library or tools enhancement
- 36 -
HDF
37. Group practices - technical
• Source code management: CVS
• Bug tracking: Bugzilla
– Bugs entered by support staff and developers
– Prioritized by staff
– Easy bugs fixed “on the fly”
- 38 -
HDF
38. Group practices - technical
• The testing challenge
• Code testing
–
–
–
–
–
Testing before code check-in
Regression testing
Remote testing
Different configurations testing
Backward compatibility testing
- 39 -
HDF
39. Daily test report
From: HDF group system admin <hdfadmin@ncsa.uiuc.edu>
To: hdf5lib@ncsa.uiuc.edu
Subject: HDF5_Daily_Tests_FAILED!!!
*** HDF5 Tests on 041022 ***
=============================
Watchers List
=============================
HDF5 Daily test features/platforms watchers and procedure
--------------------------------------------------------Procedure:
The watcher will investigate and report
the cause of failure by 11am.
The developer who checked in the error code
may report so by then too.
The watcher or the developer should get the
failure fixed and report it by 3pm.
- 41 -
HDF
40. Group practices - technical
• Release levels
– Development release
– Official release
– Past releases
- 42 -
HDF
42. Group practices – business and social
HDF Project
HDF Project
• Staff breakdown
–
–
–
–
–
–
–
User support
Documentation
QA
Software development
Testing
Team leadership
System administration
Support,
Support,
doc, QA,
doc, QA,
maintenance
maintenance
- 44 -
Basic library
Basic library
development
development
Tools and
Tools and
Java
Java
Parallel I/O,
Parallel I/O,
Grid,
Grid,
big machines
big machines
• Team lead for each team
• Most staff in two or more teams
• Staff relationships
– Complement each other
– Overlap each other
– Keep each other honest
HDF
43. Group practices – business and social
• Accountability of everyone to the whole process
• Help desk
• Approaches to carrying out tasks
– Paying attention to technical proposals
– Weekly HDf5 developer’s meetings
– HDF seminars
• Management and administration
– Performance reviews with emphasis on goals, development
– Critical to success
– That’s another talk
- 45 -
HDF
45. Strengths
• User support
• Staff
– High quality, diverse staff with good morale
– Staff commitment and enthusiasm
• Ability to address all aspects of product development
– Emphasis on quality control
– Fast bug fixing and frequent releases
– Ability to focus on a single product over a long term
• High level of support from sponsors
• Project’s visibility through NCSA, NASA, DOE, users
- 47 -
HDF
46. Weaknesses
•
Software development team
–
–
Library expertise still concentrated among too few
developers
Team communication is challenging
•
Processes
–
–
–
–
–
Release/maintenance take too much time and
resources
Configuration and porting are a huge time sink
We don’t do enough prototyping
Hard to keep up with new technologies
Parallel I/O hard to support
- 48 -
HDF
47. More weaknesses & challenges
• Usability
–
–
–
–
Software too hard to use for casual users
Insufficient documentation
Insufficient tools for high level users
Insufficient interoperability with common tools and
formats
• Marketing
– Marketing effort is inadequate
– Need to connect better with users and potential users
• Viable long-term support
- 49 -
HDF
48. Most immediate needs
•
•
•
•
Configuration and build
Testing and prototyping
Marketing
Reporting
– Performance reports
– General reports to users
– HDF book
• Sustainable business model
- 50 -
HDF
Format and software for scientific data. HDF5 is a different format from earlier versions of HDF, as is the library.
Stores images, multidimensional arrays, tables, etc. That is, you can construct all of these different kinds structures and store them in HDF5. You can also mix and match them in HDF5 files according to your needs.
Emphasis on storage and I/O efficiency Both the library and the format are designed to address this.
Free and commercial software support As far as HDF5 goes, this is just a goal now. There is commercial support for HDF4, but little if any for HDF5 at this time. We are working with vendors to change this.
Emphasis on standards You can store data in HDF5 in a variety of ways, so we try to work with users to encourage them to organize HDF5 files in standard ways.
Users from many engineering and scientific fields
Like HDF4, HDF5 has a grouping structure.
The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all.
It is useful to think about HDF software in terms of layers.
At the bottom layer is the HDF5 file or other data source.
Above that are two layers corresponding the the HDF library.
First there is a low level interface that concentrates on basic I/O: opening and closing files, reading and writing bytes, seeking, etc. HDF5 provides a public API at this level so that people can write their own drivers for reading and writing to places other than those already provided with the library. Those that are already provided include UNIX stdio, and MPI-IO.
Then comes the high-level, object -specific interface. This is the API that most people who develop HDF5 applications use. This is where you create a dataset or group, read and write datasets and subsets, etc.
At the top are applications, or perhaps APIs used by applications. Examples of the latter are the HDF-EOS API that supports NASA’s EOSDIS datatypes, and the DSL API that supports the ASCI data models.