Karen Cariani and Jon W. Dunn presentation, Open Repositories 2016, Dublin, June 2016. https://www.conftool.com/or2016/index.php?page=browseSessions&form_session=141#paperID104
HDF is a file format for managing scientific data in heterogeneous environments. It provides data interoperability through I/O software, utilities, and search/access tools. HDF supports a variety of data types and structures, large datasets, metadata, portability across systems, fast I/O, and efficient storage. HDF-EOS extends HDF to define standard profiles for organizing Earth science remote sensing and in-situ data.
This presentation discusses the development of Archival Information Packages (AIPs) for NASA HDF-EOS data. It outlines the components of an AIP according to the OAIS reference model. It then describes efforts to implement AIPs using standards like METS, PREMIS, ISO 19115 and HDF-5 to package HDF data files with associated metadata. The goals were to prototype AIPs at the data set and granule level and test the usability of digital library standards for geospatial data. Next steps involve further work applying these standards to create preservation-ready packages of NASA's HDF data holdings.
NISO Two-Part Webinar: Sustainable Information
Part 2: Digital Preservation of Audio-Visual Content
December 17, 2014
AXF: Finally a Storage and Preservation Standard for the Ages
Brian Campanotti, Chief Technical Officer, Front Porch Digital
NISO Two-Part Webinar: Sustainable Information
Part 2: Digital Preservation of Audio-Visual Content
About the Webinar
Audio-visual resources in digital formats present even more challenges to preservation than do digital text resources. Reformatting information to a common file format can be difficult and may require specialists to ensure it is done with no loss in integrity. While digital text may still be usable if done imperfectly (e.g. skewed but still readable pages), even small errors in digital A/V files could render the material unusable.
This webinar will share the experiences of several projects that are working to ensure that A/V files can be preserved with their full integrity ensured.
Agenda
Introduction
Todd Carpenter, Executive Director, NISO
Planning for Video Preservation Services at Harvard
Andrea Goethals, Manager of Digital Preservation and Repository Services, Harvard University Library
David Ackerman, Head of Media Preservation, Harvard University Library
AXF: Finally a Storage and Preservation Standard for the Ages
Brian Campanotti, Chief Technical Officer, Front Porch Digital
An Open-Source Preservation Solution: Hydra/Blacklight
Tom Cramer, Chief Technology Strategist & Associate Director, Digital Library Systems & Services, Stanford University Libraries
Wednesday, December 10, 2014
NISO Two-Part Webinar: Sustainable Information
Part 1: Digital Preservation for Text
National Digital Stewardship Alliance (NDSA) Levels of Preservation
Trevor Owens, Digital Archivist, National Digital Information Infrastructure and Preservation Program (NDIIPP), Office of Strategic Initiatives, Library of Congress
Preserving the Law: Digital Curation in a Law Library Setting
Leah Prescott, Associate Law Librarian for Digital Initiatives and Special Collections, Georgetown University Law Library
Rosetta digital preservation system: Enabling institutions to preserve and provide access to their
digital collections
Edward M. Corrado, Director of Library Technology, Binghamton University Libraries
Introduction to Digital Humanities: Metadata standards and ontologies LIBIS
Metadata standards and ontologies are important for digital humanities research. Key points from the document include:
- Standards help ensure consistency, reliability, and interoperability. They are developed through an open process involving interested parties.
- The standards landscape includes formats, technical protocols, descriptive standards for libraries, archives, and museums. Dublin Core is commonly used for discovery.
- Ontologies provide rules for describing context and relationships through semantic web technologies like RDF. They help link and integrate data.
- Standards and ontologies in digital cultural heritage include BIBFRAME, CIDOC-CRM, SKOS, and others to represent information for discovery, interpretation, and reuse.
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
This lecture was presented on April 23, 2015 as the second lecture within the the series "Data management for Quantitative Biology" at the University of Tübingen in Germany.
HDF is a file format for managing scientific data in heterogeneous environments. It provides data interoperability through I/O software, utilities, and search/access tools. HDF supports a variety of data types and structures, large datasets, metadata, portability across systems, fast I/O, and efficient storage. HDF-EOS extends HDF to define standard profiles for organizing Earth science remote sensing and in-situ data.
This presentation discusses the development of Archival Information Packages (AIPs) for NASA HDF-EOS data. It outlines the components of an AIP according to the OAIS reference model. It then describes efforts to implement AIPs using standards like METS, PREMIS, ISO 19115 and HDF-5 to package HDF data files with associated metadata. The goals were to prototype AIPs at the data set and granule level and test the usability of digital library standards for geospatial data. Next steps involve further work applying these standards to create preservation-ready packages of NASA's HDF data holdings.
NISO Two-Part Webinar: Sustainable Information
Part 2: Digital Preservation of Audio-Visual Content
December 17, 2014
AXF: Finally a Storage and Preservation Standard for the Ages
Brian Campanotti, Chief Technical Officer, Front Porch Digital
NISO Two-Part Webinar: Sustainable Information
Part 2: Digital Preservation of Audio-Visual Content
About the Webinar
Audio-visual resources in digital formats present even more challenges to preservation than do digital text resources. Reformatting information to a common file format can be difficult and may require specialists to ensure it is done with no loss in integrity. While digital text may still be usable if done imperfectly (e.g. skewed but still readable pages), even small errors in digital A/V files could render the material unusable.
This webinar will share the experiences of several projects that are working to ensure that A/V files can be preserved with their full integrity ensured.
Agenda
Introduction
Todd Carpenter, Executive Director, NISO
Planning for Video Preservation Services at Harvard
Andrea Goethals, Manager of Digital Preservation and Repository Services, Harvard University Library
David Ackerman, Head of Media Preservation, Harvard University Library
AXF: Finally a Storage and Preservation Standard for the Ages
Brian Campanotti, Chief Technical Officer, Front Porch Digital
An Open-Source Preservation Solution: Hydra/Blacklight
Tom Cramer, Chief Technology Strategist & Associate Director, Digital Library Systems & Services, Stanford University Libraries
Wednesday, December 10, 2014
NISO Two-Part Webinar: Sustainable Information
Part 1: Digital Preservation for Text
National Digital Stewardship Alliance (NDSA) Levels of Preservation
Trevor Owens, Digital Archivist, National Digital Information Infrastructure and Preservation Program (NDIIPP), Office of Strategic Initiatives, Library of Congress
Preserving the Law: Digital Curation in a Law Library Setting
Leah Prescott, Associate Law Librarian for Digital Initiatives and Special Collections, Georgetown University Law Library
Rosetta digital preservation system: Enabling institutions to preserve and provide access to their
digital collections
Edward M. Corrado, Director of Library Technology, Binghamton University Libraries
Introduction to Digital Humanities: Metadata standards and ontologies LIBIS
Metadata standards and ontologies are important for digital humanities research. Key points from the document include:
- Standards help ensure consistency, reliability, and interoperability. They are developed through an open process involving interested parties.
- The standards landscape includes formats, technical protocols, descriptive standards for libraries, archives, and museums. Dublin Core is commonly used for discovery.
- Ontologies provide rules for describing context and relationships through semantic web technologies like RDF. They help link and integrate data.
- Standards and ontologies in digital cultural heritage include BIBFRAME, CIDOC-CRM, SKOS, and others to represent information for discovery, interpretation, and reuse.
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
This lecture was presented on April 23, 2015 as the second lecture within the the series "Data management for Quantitative Biology" at the University of Tübingen in Germany.
This document discusses HDF and HDF5 data formats. It provides an overview of HDF, describing it as a scientific data file format and software for storing arrays, images, tables and other data structures efficiently. It then covers NCSA activities related to HDF, including Java applications, remote data access, and standardization. The document introduces HDF5 as the successor to HDF4, describing its new features like larger file sizes and improved I/O performance for big data applications. It compares the two formats, with HDF4 still in use but HDF5 addressing new demands through a more scalable model.
e-Services to Keep Your Digital Files Currentpbajcsy
This document summarizes technologies to support file format conversions for digital preservation. It presents three services: (1) a Conversion Software Registry to find conversion software between formats, (2) a File Conversion Engine to automatically execute conversions via third-party software, and (3) File Comparison Engines to evaluate information loss from conversions. These services aim to help archivists select preservation formats and evaluate conversion quality at scale. The document outlines related challenges and how the services address issues like software extensibility and computational requirements for large-scale conversions and comparisons.
Presentation on the upgrade project for the DSpace Repository "TARA" for Trinity College Dublin.
http://www.tara.tcd.ie/
Presented at Open Repositories 2010 by Niamh Brennan TCD, Gavin Henrick Enovation Solutions
This document provides an overview and introduction to the HDF5 data model, programming model, and library APIs. The morning session will include a lecture on HDF5 introduction, programming model, and library APIs. The afternoon will have hands-on sessions on introduction to HDF5 files, groups, datasets and attributes as well as advanced topics like hyperslab selections, compound datatypes, and parallel HDF5. The goals are to introduce HDF5, provide knowledge on how data is organized and used by applications, and provide examples of reading and writing HDF5 files. The afternoon aims to help users start working with the HDF5 library on the NCSA machines by running examples and creating their own programs.
This document discusses the problem of storing and processing many small files in HDFS and Hadoop. It introduces the concept of "harballing" where Hadoop uses an archiving technique called Hadoop Archive (HAR) to collect many small files into a single large file to reduce overhead on the namenode and improve performance. HAR packs small files into an archive file with a .har extension so the original files can still be accessed efficiently and in parallel. This reduction of small files through harballing increases scalability by reducing namespace usage and load on the namenode.
HDFS is a distributed file system designed for large data sets and commodity hardware. It uses a master/slave architecture with one NameNode that manages the file system namespace and regulates access, and multiple DataNodes that manage storage. HDFS stores files as blocks and replicates blocks across DataNodes for fault tolerance. The NameNode executes file operations and determines block placement according to policies that optimize for data reliability, availability, and performance.
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Karen R
Discover was created as a pilot site for the National Library of New Zealand's digital collection to ensure interoperability and access through the use of standards. It contains 2,500 multimedia items from the Alexander Turnbull Library's collections. Metadata was created using Dublin Core and other standards and mapped from existing library catalogs to support resource discovery, description, and preservation. The metadata is expressed in XML and RDF to enable delivery in different syntaxes depending on requirements and is managed through the NLNZ Digital Resource application.
Harmonization of vocabularies for water dataSimon Cox
The document discusses challenges and approaches to harmonizing water quality vocabularies using semantic web technologies. It outlines issues with formalizing vocabularies using SKOS, RDFS, and OWL; handling collections and versioning of terms; and distributing and discovering vocabularies. While progress has been made, challenges remain regarding conceptualizing terms as classes or individuals, developing URI patterns, and tracking vocabulary versions. Suggested paths forward include continued work on formalization methods and publishing best practices.
This document discusses data accessibility and challenges. It covers the data life cycle, including planning, generating data, reliability, ownership, metadata, versioning, and publishing. It discusses expectations for accessing and sharing data. Open access data policies are encouraged by research funders, journals, and initiatives like DataCite to assign identifiers to research data. Data can be shared through repositories, journals, websites, or informally between researchers. Factors that affect sharing and accessing data include size, computing needs, standards, repositories, data nature, governance, and metadata.
The instrument teams on Aura developed a data format standard for their data files using HDF-EOS5 as the storage mechanism. Now that data files are being shared between instrument teams, the success of this endeavor can be studied. Details of this standard, the process used for creating it, what we've learned and suggestions for future missions will be presented.
3.7.17 DSpace for Data: issues, solutions and challenges Webinar SlidesDuraSpace
Hot Topics: The DuraSpace Community Webinar Series,
“Introducing DSpace 7: Next Generation UI”
Curated by Claire Knowles, Library Digital Development Manager, The University of Edinburgh.
DSpace for Data: issues, solutions and challenges
March 7, 2017 presented by: Claire Knowles & Pauline Ward - The University of Edinburgh & Ryan Scherle - Dryad Digital Repository
Webinar: The Four Requirements of a Cloud-Era File SystemStorage Switzerland
File systems are the heart of most organizations. They’re how critical unstructured data assets are stored, organized, and shared. But with files exploding in size and quantity as well as the cloud offering potentially cheaper and more scalable storage than traditional NAS, a new set of requirements is needed. Is your file system ready?
Analytics with unified file and object Sandeep Patil
Presentation takes you through on way to achive in-place hadoop based analytics for your file and object data. Also give you example of storage integration with cloud congnitive services
9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro
This technical seminar document provides an overview of Hadoop technology. It discusses what Hadoop is, who developed it, how it works, and how companies use it. The document outlines Hadoop's key components like HDFS for storage and MapReduce for processing. It also summarizes Hadoop's benefits, such as cost savings, scalability, and ability to process large amounts of data across commodity servers in a parallel fashion.
This document summarizes a presentation on the Hypatia platform, which was developed to help archivists manage, preserve, and provide access to digital archival materials. Key points include:
- Hypatia is an open source software based on Hydra and Fedora that aims to be a repository solution for digital archives.
- It grew out of the Archives Information Management System (AIMS) project and leverages the Hydra framework.
- The presentation covered Hypatia's functional requirements gathering, data models, demonstration of capabilities, and plans for future development and community involvement.
Fedora is an open source digital repository system that is flexible, durable, and standards-based. It is developed and supported by a thriving community to store, preserve, and provide access to digital objects. Fedora repositories can handle both simple and complex use cases and content models. Examples of Fedora implementations include institutional repositories, research data repositories, digital archives and special collections, and manuscript collections.
Sector and Hadoop both provide distributed file systems and parallel data processing capabilities. However, Sector was designed for wide area data sharing and distribution across networks while Hadoop focuses on large data processing within a single data center. Sector also allows direct processing of files and directories using user-defined functions, making it up to 4x faster than Hadoop's MapReduce framework in some benchmarks.
The document discusses the impact of Covid-19 on learning and education, including long-term effects on academic setups due to lack of physical access and digital divides. It also discusses the need for and benefits of institutional repositories to manage and provide access to scholarly works. Key benefits include increased visibility, centralized storage, and supporting learning and teaching. Challenges include difficulties generating content and issues around policies, incentives, and costs. The document then focuses on the open-source DSpace software as a tool for creating institutional repositories, covering its features, requirements, structures, workflows, and examples of existing DSpace-based repositories.
The document summarizes Hadoop HDFS, which is a distributed file system designed for storing large datasets across clusters of commodity servers. It discusses that HDFS allows distributed processing of big data using a simple programming model. It then explains the key components of HDFS - the NameNode, DataNodes, and HDFS architecture. Finally, it provides some examples of companies using Hadoop and references for further information.
This document discusses HDF and HDF5 data formats. It provides an overview of HDF, describing it as a scientific data file format and software for storing arrays, images, tables and other data structures efficiently. It then covers NCSA activities related to HDF, including Java applications, remote data access, and standardization. The document introduces HDF5 as the successor to HDF4, describing its new features like larger file sizes and improved I/O performance for big data applications. It compares the two formats, with HDF4 still in use but HDF5 addressing new demands through a more scalable model.
e-Services to Keep Your Digital Files Currentpbajcsy
This document summarizes technologies to support file format conversions for digital preservation. It presents three services: (1) a Conversion Software Registry to find conversion software between formats, (2) a File Conversion Engine to automatically execute conversions via third-party software, and (3) File Comparison Engines to evaluate information loss from conversions. These services aim to help archivists select preservation formats and evaluate conversion quality at scale. The document outlines related challenges and how the services address issues like software extensibility and computational requirements for large-scale conversions and comparisons.
Presentation on the upgrade project for the DSpace Repository "TARA" for Trinity College Dublin.
http://www.tara.tcd.ie/
Presented at Open Repositories 2010 by Niamh Brennan TCD, Gavin Henrick Enovation Solutions
This document provides an overview and introduction to the HDF5 data model, programming model, and library APIs. The morning session will include a lecture on HDF5 introduction, programming model, and library APIs. The afternoon will have hands-on sessions on introduction to HDF5 files, groups, datasets and attributes as well as advanced topics like hyperslab selections, compound datatypes, and parallel HDF5. The goals are to introduce HDF5, provide knowledge on how data is organized and used by applications, and provide examples of reading and writing HDF5 files. The afternoon aims to help users start working with the HDF5 library on the NCSA machines by running examples and creating their own programs.
This document discusses the problem of storing and processing many small files in HDFS and Hadoop. It introduces the concept of "harballing" where Hadoop uses an archiving technique called Hadoop Archive (HAR) to collect many small files into a single large file to reduce overhead on the namenode and improve performance. HAR packs small files into an archive file with a .har extension so the original files can still be accessed efficiently and in parallel. This reduction of small files through harballing increases scalability by reducing namespace usage and load on the namenode.
HDFS is a distributed file system designed for large data sets and commodity hardware. It uses a master/slave architecture with one NameNode that manages the file system namespace and regulates access, and multiple DataNodes that manage storage. HDFS stores files as blocks and replicates blocks across DataNodes for fault tolerance. The NameNode executes file operations and determines block placement according to policies that optimize for data reliability, availability, and performance.
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Karen R
Discover was created as a pilot site for the National Library of New Zealand's digital collection to ensure interoperability and access through the use of standards. It contains 2,500 multimedia items from the Alexander Turnbull Library's collections. Metadata was created using Dublin Core and other standards and mapped from existing library catalogs to support resource discovery, description, and preservation. The metadata is expressed in XML and RDF to enable delivery in different syntaxes depending on requirements and is managed through the NLNZ Digital Resource application.
Harmonization of vocabularies for water dataSimon Cox
The document discusses challenges and approaches to harmonizing water quality vocabularies using semantic web technologies. It outlines issues with formalizing vocabularies using SKOS, RDFS, and OWL; handling collections and versioning of terms; and distributing and discovering vocabularies. While progress has been made, challenges remain regarding conceptualizing terms as classes or individuals, developing URI patterns, and tracking vocabulary versions. Suggested paths forward include continued work on formalization methods and publishing best practices.
This document discusses data accessibility and challenges. It covers the data life cycle, including planning, generating data, reliability, ownership, metadata, versioning, and publishing. It discusses expectations for accessing and sharing data. Open access data policies are encouraged by research funders, journals, and initiatives like DataCite to assign identifiers to research data. Data can be shared through repositories, journals, websites, or informally between researchers. Factors that affect sharing and accessing data include size, computing needs, standards, repositories, data nature, governance, and metadata.
The instrument teams on Aura developed a data format standard for their data files using HDF-EOS5 as the storage mechanism. Now that data files are being shared between instrument teams, the success of this endeavor can be studied. Details of this standard, the process used for creating it, what we've learned and suggestions for future missions will be presented.
3.7.17 DSpace for Data: issues, solutions and challenges Webinar SlidesDuraSpace
Hot Topics: The DuraSpace Community Webinar Series,
“Introducing DSpace 7: Next Generation UI”
Curated by Claire Knowles, Library Digital Development Manager, The University of Edinburgh.
DSpace for Data: issues, solutions and challenges
March 7, 2017 presented by: Claire Knowles & Pauline Ward - The University of Edinburgh & Ryan Scherle - Dryad Digital Repository
Webinar: The Four Requirements of a Cloud-Era File SystemStorage Switzerland
File systems are the heart of most organizations. They’re how critical unstructured data assets are stored, organized, and shared. But with files exploding in size and quantity as well as the cloud offering potentially cheaper and more scalable storage than traditional NAS, a new set of requirements is needed. Is your file system ready?
Analytics with unified file and object Sandeep Patil
Presentation takes you through on way to achive in-place hadoop based analytics for your file and object data. Also give you example of storage integration with cloud congnitive services
9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro
This technical seminar document provides an overview of Hadoop technology. It discusses what Hadoop is, who developed it, how it works, and how companies use it. The document outlines Hadoop's key components like HDFS for storage and MapReduce for processing. It also summarizes Hadoop's benefits, such as cost savings, scalability, and ability to process large amounts of data across commodity servers in a parallel fashion.
This document summarizes a presentation on the Hypatia platform, which was developed to help archivists manage, preserve, and provide access to digital archival materials. Key points include:
- Hypatia is an open source software based on Hydra and Fedora that aims to be a repository solution for digital archives.
- It grew out of the Archives Information Management System (AIMS) project and leverages the Hydra framework.
- The presentation covered Hypatia's functional requirements gathering, data models, demonstration of capabilities, and plans for future development and community involvement.
Fedora is an open source digital repository system that is flexible, durable, and standards-based. It is developed and supported by a thriving community to store, preserve, and provide access to digital objects. Fedora repositories can handle both simple and complex use cases and content models. Examples of Fedora implementations include institutional repositories, research data repositories, digital archives and special collections, and manuscript collections.
Sector and Hadoop both provide distributed file systems and parallel data processing capabilities. However, Sector was designed for wide area data sharing and distribution across networks while Hadoop focuses on large data processing within a single data center. Sector also allows direct processing of files and directories using user-defined functions, making it up to 4x faster than Hadoop's MapReduce framework in some benchmarks.
The document discusses the impact of Covid-19 on learning and education, including long-term effects on academic setups due to lack of physical access and digital divides. It also discusses the need for and benefits of institutional repositories to manage and provide access to scholarly works. Key benefits include increased visibility, centralized storage, and supporting learning and teaching. Challenges include difficulties generating content and issues around policies, incentives, and costs. The document then focuses on the open-source DSpace software as a tool for creating institutional repositories, covering its features, requirements, structures, workflows, and examples of existing DSpace-based repositories.
The document summarizes Hadoop HDFS, which is a distributed file system designed for storing large datasets across clusters of commodity servers. It discusses that HDFS allows distributed processing of big data using a simple programming model. It then explains the key components of HDFS - the NameNode, DataNodes, and HDFS architecture. Finally, it provides some examples of companies using Hadoop and references for further information.
Big data refers to large and complex datasets that are difficult to process using traditional methods. Key challenges include capturing, storing, searching, sharing, and analyzing large datasets in domains like meteorology, physics simulations, biology, and the internet. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of computers. It allows for the distributed processing of large data sets in a reliable, fault-tolerant and scalable manner.
This document discusses file systems for cloud computing. It begins by providing context on the growth of the internet and cloud computing. It then defines what cloud computing and files are. The main points are:
- Distributed file systems provide access to data stored on servers using file system interfaces like opening, reading and writing files.
- HDFS is a popular distributed file system designed for large data sets stored on commodity hardware. It uses replication for reliability and has a single metadata node for coordination.
- HDFS is optimized for large streaming reads and writes. It partitions files into blocks and replicates them across multiple data nodes for reliability and load balancing.
This document is an outline for a presentation on advanced topics in computer science focusing on big data and Hadoop. It introduces big data and Hadoop, describing Hadoop's features like HDFS and MapReduce. It discusses how companies use big data and Hadoop, components of the Hadoop framework, and compares distributed and parallel file systems. The document also covers YARN, shortcomings of Hadoop, and its future directions.
This document provides an overview of securing Hadoop applications and clusters. It discusses authentication using Kerberos, authorization using POSIX permissions and HDFS ACLs, encrypting HDFS data at rest, and configuring secure communication between Hadoop services and clients. The principles of least privilege and separating duties are important to apply for a secure Hadoop deployment. Application code may need changes to use Kerberos authentication when accessing Hadoop services.
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET Journal
This document proposes a novel approach to improve the efficiency of processing small files in the Hadoop Distributed File System (HDFS) using Apache Spark. It discusses how HDFS is optimized for large files but suffers from low efficiency when handling many small files. The proposed approach uses Spark to judge file sizes, merge small files to improve block utilization, and process the files in-memory for faster performance compared to the traditional MapReduce approach. Evaluation results show the Spark-based system reduces NameNode memory usage and improves processing speeds by up to 100 times compared to conventional Hadoop processing.
Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across commodity hardware. It is highly fault tolerant, provides high throughput, and is suitable for applications with large datasets. HDFS uses a master/slave architecture where a NameNode manages the file system namespace and DataNodes store data blocks. The NameNode ensures data replication across DataNodes for reliability. HDFS is optimized for batch processing workloads where computations are moved to nodes storing data blocks.
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
Hadoop has quickly evolved into the system of choice for storing and processing Big Data, and is now widely used to support mission-critical applications that operate within a ‘data lake’ style infrastructures. A critical requirement of such applications is the need for continuous operation even in the event of various system failures. This requirement has driven adoption of multi-data center Hadoop architectures, a.k.a geo-distributed or global Hadoop. In this session we will provide a brief introduction to WANdisco, then dig into how our Non-Stop Hadoop solution addresses real world use cases, and also a show live demonstration of Non-Stop namenode operation across two WAN connected hadoop clusters.
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
Desktop as a Service supporting Environmental ‘omicsDavid Wallom
Within the Environmental 'omics community Bio-Linux is a widely used tool. This has the advantage of providing in a single deliverable package all necessary software and tools to support common analyses. With the growth in data volumes within the community and increasing
constraints on user access and control over their own desktops an alternative delivery method of Bio-Linux and, in future, the Docker container environment is necessary.
Within the EOS Cloud project we have constructed a Desktop as a Service system to centrally host virtual machines with these tools preconfigured and maintained. To enable efficient use of the resources we have enabled user controlled resource scaling so that users are able to utilise small scale VMs for task configuration and data manipulation and boost to a larger scale to run analysis applications all the while
maintaining the user environment in a consistent manner. Alongside this within the project we have been developed tools to simplify the increasingly popular Docker software usage model. This includes ensure uniformity of behaviour between the host system and the running Docker container.
Within the invitation only trial user community we identify two different exemplars groups and explain their usage and how the products and services developed within the project are useful for them. We conclude discussing the useful nature of Desktop as a Service, how it is of great benefit to the bioinformatics community but could also be of great use elsewhere, where the need for a stable user environment with applications already available that do not rely on local ICT
support.
Similar to HydraDAM2: Repository Challenges and Solutions for Large Media Files (20)
AMP: An Audiovisual Metadata Platform to Support Mass DescriptionJon W. Dunn
The document summarizes plans to develop an open-source Audiovisual Metadata Platform (AMP) to support automated and manual metadata generation for large audiovisual collections. Key points:
1. AMP would integrate multiple metadata generation mechanisms (automated tools and human workflows) to enrich metadata for audiovisual collections at scale.
2. An initial planning grant brought together stakeholders to define requirements and develop a conceptual architecture for AMP.
3. Next steps include further exploring technical components, example workflows, and challenges to building out an initial AMP implementation.
An Audiovisual Metadata Platform to Support Mass DescriptionJon W. Dunn
This document summarizes a workshop on developing an audiovisual metadata platform (AMP) to support mass metadata generation for audiovisual collections. The workshop brought together 16 stakeholders over 3 days. They identified user personas and requirements, technical considerations, and potential metadata generation workflows and mechanisms. The group developed a conceptual architecture and workflow scenario as a starting point for the open source AMP platform. Key next steps include further exploring technologies, addressing challenges, and cultivating the metadata generation community.
Sakai11 Citations BOF Introductory SlidesJon W. Dunn
The document summarizes an upcoming session about future directions for development and support of Citations Helper. It will include a background section, results from a recent survey, issues to consider such as potential changes to features, and reports from institutions working on new citation-related projects, followed by an open discussion.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
HydraDAM2: Repository Challenges and Solutions for Large Media Files
1. HydraDAM2:
Repository Challenges and
Solutions for Large Media Files
Karen Cariani
Senior Director
WGBH Media Library and
Archives
Jon Dunn
Assistant Dean for Library
Technologies
Indiana University
3. Challenges of Audio and Video
Descriptive metadata
Technical metadata
Large preservation files
Multiple files with similar metadata
Storage dependent on frequency of access
—Bandwidth capability
4. Preservation Needs
Multiple Copies
Save original files
Validity – check sum
Regular storage migration
Persistence
File format issues
— Migration ease
— Future playback
Fixity check big files
Big files
— Speed of access of preservation
files for reuse
— Processing speed
5. Some History: HydraDAM1
Began with HydraDAM 1 that was based on Sufia and
Fedora 4
—Self-deposit institutional repository application
Adapted to add bulk ingest, bulk edit, characterization
of files, transcoding of proxies
Limitations:
—Assumed full workflow pipeline for ingestion of A/V
materials
—Processing performance problems
6. Indiana University Context
• Over 3 million special collections
items at IU Bloomington
• Within and outside the
Libraries
• Many sources of A/V
• Music and other performing
arts
• Ethnomusicology,
anthropology
• Public broadcasting stations
• Film collections
• Athletics
7. MDPI:
IU Media Digitization and Preservation Initiative
Goal: “To digitize, preserve and make universally available
by IU’s Bicentennial—subject to copyright or other legal
restrictions—all of the time-based media objects on all
campuses of IU judged important by experts.“
280,000+ items
~7PB over 4 years
9TB per day peak
http://mdpi.iu.edu/
8. IU MDPI Repository Needs
Media Files
and Metadata
Digital
Preservation
Repository
(HydraDAM2)
Access
Repository
Masters,
Mezzanines
Transcodes
Out-of-Region
Storage
9. HydraDAM2 Project Objectives
To extend the HydraDAM digital asset management system to operate in
conjunction with Fedora 4.
— Hydra “Head” for digital audio/video preservation
Develop Fedora 4 content models for digital asset preservation objects, including
descriptive, structural, and digital provenance metadata, based on current
standards and practices and utilizing new features in Fedora 4 for storage and
indexing of RDF.
Implement support in HydraDAM for different storage models, appropriate to
different types of institutions.
Integrate HydraDAM into preservation workflows that feed access systems at IU
(Avalon) and WGBH (Open Vault) and conduct testing of large files and high-
throughput workflows.
Document and disseminate information about our implementation and experience
to the library, archive, digital repository, audiovisual preservation, and Hydra
communities.
10. NEH Desired Outcomes
How hard is it to do?
Is it implementable elsewhere?
Is it feasible for broad use?
NEH Preservation and Access R&D Grant:
January 2015 – January 2017
11. Project progress
Slow start getting developers in place
Coordinating work across organizations
Developing data models in common and different
Determining where code splits for different storage
needs
Workable agile development schedule split across
geographically different organizations
12. Storage use cases
WGBH storing files offline on LTO tape directly from
local workstation
—Bandwidth issues to move large preservation files
across the network
—Easier for us to hand deliver
Indiana University utilizing a central HSM system for
nearline storage
—Auto delivery of large files through network
13. Storage use cases
Not storing media preservation files in Fedora or in
filesystem managed by Fedora
— WGBH: Just the location of the files on LTO tape
— IU: URL in Fedora that redirects to download of
content from HSM
How do we accommodate both needs with common
code?
— Where does the code split off?
14. Not in Fedora because
Files are big
Costly in terms of performance to push in and out of
Fedora
Federation or projection in Fedora would allow Fedora
to register content in and out but limitations
—Now deprecated in Fedora
Volumes (petabytes) of data too large to put on
spinning disk because too costly
So storing on tape
16. PCDM:Collection
Hydra:
GenericWork
extends
PCDM:Object
Hydra: FileSet
extends
PCDM:Object
PCDM:File PCDM:File PCDM:File
master file A
binary
mezzanine file A
binary
access file A
binary
hasMember
hasMember
hasFile
hasFile
hasFile
HydraDAM2
PCDM
IU case
PCDM:File PCDM:File PCDM:File
master file B
binary
mezzanine file B
binary
access file B
binary
hasFile
hasFile
hasFile
PCDM:File
POD XML
binary
PCDM:File
Memnon/IU XML
binary
PCDM:File
MODS XML
binary
17. Apache Camel Routes
Asynchronous Storage Proxy
Rails application with AS UI gem
Local Tape
Storage
Services Large files
on Disk
Notify
Cloud
Storage
Services
Service
translation
blueprint
Service
translation
blueprint
Service
translation
blueprint
Asynchronous aware
user interface provides
interactions
Proxy provides API
with common
endpoints and
responses
Translations map
from common
API to specific
storage APIs
Should be able to
be an API-X
sharable service
Fedora 4 Asynchronous Storage: Proof of Concept
18. Fedora 4
RDF resource container
node
Non-RDF resource node
URL redirect
Asynchronous Interactions UI
Apache Camel Routes
Asynchronous Storage Proxy
Slow storage
service
Invoking from asynchronous interactions from Fedora 4 API
Redirecting node via
external-body MIME type;
can be set using Fedora 4
API and also via Hydra
Works file behaviors
The URL to redirect to would be
wherever the Asynchronous
Interactions UI is deployed,
immediately invoking interactions for a
unique identifier (preferably using
persistent URLs)
Access to redirecting nodes
via Fedora 4 API invokes
immediate redirect to stored
URL
33. Where We’re Going
Ensure content models are on the right track
Continue development
— Build out storage proxy interaction with IU mass storage
— Built out WGBH storage implementation
— Additional user functionality
— Build out descriptive metadata / PBcore support
Batch ingest
Feed to/from Avalon Media System
Pilot implementation
Production implementation
Who are we? WGBH is Boston’s Public television station. We produce fully one third of the content broadcast on PBS, including the series you see here, as well as Downton Abbey and Sherlock. In addition to television, we have 2 radio stations and a large, award winning Interactive department that is the number one producer for the sites you’ll find on PBS.org. As you can see, we produce a wide variety of programming from public affairs, to history and science, to children’s program, arts, culture, drama and how to’s. We have been on the air since 1951 with radio and 1955 with television.
At heart and through our mission we are an educational and cultural institution. We originated out of a consortium of academic universities in the Boston area. Because we have produced so much we have a large archive of educational programming that is of interest to scholars and researchers, in addition to the public.
A quick check on preservation needs - So this digital stuff really sucks. Film or stone are a much longer lasting medium. But digital gives us much better and broader access. So how do we preserve this fragile stuff that needs migration every 3-5 years. Well you need multiple copies, and save the originals because they should be whole. Check sums – validity checks on files to make sure you have all the bits. Migration not only of the content – the files, but also all the technology and systems you use and storage. And doing this with big media files is hard, time consuming, and subject to damage and errors.
We were generously awarded a grant to see if we could build a media preservation DAM system using open source software. In particular we wanted to test the Hydra tech stack, see what it would take to build, what it would take for others to install (better documentation) and really see how to integrate with an open source community.