An overview about the different application scenarios at the Austrian National Libraries related to Web Archiving and the Austrian Books Online project.
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
This document discusses using Hadoop for large scale processing. It provides an overview of Hadoop and MapReduce frameworks and how they allow distributing processing across many nodes to efficiently process large amounts of data in parallel. It also gives examples of how Hadoop has been used at the British Library for digital preservation tasks like format migration and analysis.
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
Per Møldrup-Dalum introduced how the State and University Library in Denmark have deployed Hadoop in connection with the SCAPE project. With Hadoop the library have been able to process large amounts of data so much fast than what has been done before.
The presentation was given at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. For more information about the demo day, see this blog post, http://bit.ly/SCAPE_SB_Demo, about the event.
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
The State and University Library, Denmark, hosted an information and demonstration day on 25 June 2014 for delegates from other large cultural heritage institutions in Denmark. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
One of the presentations was given by Asger Askov Blekinge who showed how the library has worked on integrating its digital object management system with Hadoop. The library is currently digitizing 32 million newspaper pages and is using Hadoop map/reduce jobs to do quality assurance on the digitized files with the help of the SCAPE Stager/Loader so updated QA’ed files are stored in the repository.
Hybrid Cloud for CERN
Experience with Open Telecom Cloud and OpenStack
1) CERN uses a hybrid cloud approach combining on-premise resources with public clouds like the Open Telecom Cloud to address its increasing computing needs for projects like the LHC.
2) A pilot project using the Open Telecom Cloud was successful but identified some issues around networking and storage.
3) CERN is now participating in the HNSciCloud project to jointly procure innovative hybrid cloud services that fully integrate commercial clouds with in-house and European e-infrastructure resources.
The document provides an overview of the ResourceSync framework, which aims to enable synchronization of web resources between source and destination servers. It describes the core capabilities that a source server can provide, including describing content through resource lists, packaging content in dumps, describing changes through change lists, and packaging changes in dumps. It also outlines key processes for destinations, such as baseline and incremental synchronization. The agenda covers motivation/use cases, framework walkthrough, technical details, and implementation. ResourceSync is designed as a modular framework based on sitemaps to describe resources and changes.
Jachym Cepicky gave a status report on PyWPS. PyWPS is an implementation of the OGC WPS standard written in Python. Version 4 is being rewritten to take advantage of improvements in Python and geospatial libraries since version 1 was created in 2006. Version 4.0 includes validators, a server based on Werkzeug, an IOHandler, and file storage. Version 4.1 is planned to include output via GeoServer, MapServer and QGIS, a REST API, and database/external storage. Progress has been limited by lack of resources for the open source project.
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
This document discusses using Hadoop for large scale processing. It provides an overview of Hadoop and MapReduce frameworks and how they allow distributing processing across many nodes to efficiently process large amounts of data in parallel. It also gives examples of how Hadoop has been used at the British Library for digital preservation tasks like format migration and analysis.
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
Per Møldrup-Dalum introduced how the State and University Library in Denmark have deployed Hadoop in connection with the SCAPE project. With Hadoop the library have been able to process large amounts of data so much fast than what has been done before.
The presentation was given at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. For more information about the demo day, see this blog post, http://bit.ly/SCAPE_SB_Demo, about the event.
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
The State and University Library, Denmark, hosted an information and demonstration day on 25 June 2014 for delegates from other large cultural heritage institutions in Denmark. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
One of the presentations was given by Asger Askov Blekinge who showed how the library has worked on integrating its digital object management system with Hadoop. The library is currently digitizing 32 million newspaper pages and is using Hadoop map/reduce jobs to do quality assurance on the digitized files with the help of the SCAPE Stager/Loader so updated QA’ed files are stored in the repository.
Hybrid Cloud for CERN
Experience with Open Telecom Cloud and OpenStack
1) CERN uses a hybrid cloud approach combining on-premise resources with public clouds like the Open Telecom Cloud to address its increasing computing needs for projects like the LHC.
2) A pilot project using the Open Telecom Cloud was successful but identified some issues around networking and storage.
3) CERN is now participating in the HNSciCloud project to jointly procure innovative hybrid cloud services that fully integrate commercial clouds with in-house and European e-infrastructure resources.
The document provides an overview of the ResourceSync framework, which aims to enable synchronization of web resources between source and destination servers. It describes the core capabilities that a source server can provide, including describing content through resource lists, packaging content in dumps, describing changes through change lists, and packaging changes in dumps. It also outlines key processes for destinations, such as baseline and incremental synchronization. The agenda covers motivation/use cases, framework walkthrough, technical details, and implementation. ResourceSync is designed as a modular framework based on sitemaps to describe resources and changes.
Jachym Cepicky gave a status report on PyWPS. PyWPS is an implementation of the OGC WPS standard written in Python. Version 4 is being rewritten to take advantage of improvements in Python and geospatial libraries since version 1 was created in 2006. Version 4.0 includes validators, a server based on Werkzeug, an IOHandler, and file storage. Version 4.1 is planned to include output via GeoServer, MapServer and QGIS, a REST API, and database/external storage. Progress has been limited by lack of resources for the open source project.
RIPEstat is introducing new features including a Widget API, Data API, and improved performance. The Widget API allows users to embed RIPEstat plugins on their own websites. The Data API provides access to RIPEstat data in JSON format. Performance has been improved by migrating several plugins to a new backend data cluster. Future plans include adding more widgets and data services, extending existing plugins, and improving backend performance. Feedback is encouraged on the RIPEstat website and mailing lists.
On 29 January 2020 ARCHIVER launched its Request for Tender with the purpose to award several Framework Agreements and work orders for the provision of R&D for hybrid end-to-end archival and preservation services that meet the innovation challenges of European Research communities, in the context of the European Open Science Cloud.
The tender was closed on 28 April 2020 and 15 R&D bids were submitted, with consortia that included 43 companies and organisations. The best bids have been selected and will start the first phase of the ARCHIVER R&D (Solution Design) in June 2020.
On Monday 8 June the selected consortia for the ARCHIVER design phase have been announced during a Public Award Ceremony starting at 14.00 CEST.
In light of the COVID-19 outbreak and the and consequent movement restrictions imposed in several countries, the event has been organised as a webinar, virtually hosted by Port d’Informació Científica (PIC), a member of the Buyers Group of the ARCHIVER consortium.
The Kick-off marks the beginning of the Solution Design Phase.
Tim Bell gave a presentation on 04/11/2014 in Paris about using OpenStack at CERN to help answer fundamental physics questions. Some key challenges discussed included the large amount of data generated from particle collisions, which is expected to grow to 400PB/year by 2023. OpenStack has been in production at CERN since 2013 and is used across multiple clouds totaling over 150,000 cores. The presentation covered CERN's experience migrating to OpenStack and addressed cultural barriers to adoption.
CERN operates the Large Hadron Collider (LHC) and other particle physics experiments that generate enormous amounts of data. To handle this data, CERN uses a hybrid cloud model combining on-premise resources with public cloud services. CERN is participating in a joint procurement project called HNSciCloud to procure cloud services that can seamlessly integrate with CERN's internal resources and European research networks. The goal is to establish a hybrid cloud platform to meet the growing computing needs of particle physics and other research domains dealing with large datasets.
This webinar covered tools from the UK Data Service Census Support for working with UK census data, boundaries, and postcodes. It demonstrated how to use the Boundary Data Selector to download census boundaries, the Thematic Mapper to create choropleth maps from census data, and the Postcode Data Selector to extract postcode data and add lookups to other geographies. The webinar provided an overview of the UK census and types of data available, and explained how these online tools can be used to access and visualize UK census and geographic data.
CERN is an international research organization located near Geneva that operates the largest particle physics laboratory in the world. It has over 3,000 staff members and associates from over 20 member states with an annual budget of 1 billion Swiss francs. CERN operates the Large Hadron Collider, a 27 km ring that collides protons and heavy ions to study particle physics. Data from particle collisions is collected by large detectors and analyzed using the Worldwide LHC Computing Grid, which distributes computing tasks across data centers in over 40 countries. Some of CERN's key results include the 2012 discovery of the Higgs boson. CERN faces challenges in scaling its computing capabilities to handle rapidly increasing data volumes from the LHC and is transforming its computing
The document discusses the Finnish Meteorological Institute's (FMI) approach to providing weather data in an INSPIRE compliant format. It describes how FMI opened all of its data in 2013 through a single data portal that serves as both an open data and INSPIRE portal. It then covers the various data models used to structure different types of weather data, including observations, forecasts, and radar images. Finally, it discusses experiences with implementing the different models and serving the wide range of weather data sets.
The document discusses HDF and netCDF data support in ArcGIS. It provides an overview of how HDF and netCDF data can be directly ingested and used as raster datasets, mosaic datasets, feature layers, and tables in ArcGIS. This allows for scientific data to be displayed, analyzed, and shared using common GIS tools and services. It also describes existing Python tools for working with netCDF data and outlines future areas of development, including improved support for HDF5, THREDDS/OPeNDAP access, and evolving data standards.
This document discusses using RIPE Atlas measurements to analyze how "local" internet traffic stays within countries. The presenter describes running traceroutes between RIPE Atlas probes within countries to identify the presence of internet exchange points (IXPs) and out-of-country paths. Case studies on Sweden, France, and Argentina/Chile show results. Code for processing RIPE Atlas data and running monthly measurements for many countries is provided, with the goal of identifying opportunities for networks to improve local peering and routing.
The HDF Group is providing a hosted JupyterLab environment called HDF Kita Lab that provides access to HDF data stored on AWS S3 via the HDF Kita Server. HDF Kita Lab extends JupyterLab with features like auto-configuring the Kita Server and HDF branding. It runs on a Kubernetes cluster in AWS that can scale to handle different numbers of users. Each user gets computing resources and access to HDF data on S3 for analyzing via commonly used Python packages. The data on S3 provides unlimited storage and sharing capabilities between users.
This document provides an overview of visualizing big imaging data in radio astronomy. It discusses:
1) Facilities like Pawsey Supercomputing Centre and ICRAR that provide computational resources for processing and visualizing large astronomy data.
2) Common astronomy image formats like FITS and emerging "big data" formats like JPEG2000 that allow for multi-resolution and streaming visualization.
3) The SkuareView framework that implements remote visualization of JPEG2000 encoded astronomy data using JPIP by streaming different resolutions and regions of interest without downloading full datasets.
4) A demo of using SkuareView to interactively visualize multi-TB radio astronomy datasets stored in the cloud.
This document discusses disaster recovery for OpenStack clouds using hybrid cloud solutions. It describes using replication between on-premises and cloud storage for near-zero downtime disaster recovery. The goals are to protect OpenStack application instances and data, and recover them in an alternative cloud location if needed. The proposed solution would build disaster recovery as an OpenStack project using existing components like Heat, Swift, and Cinder with a pluggable architecture.
Prototype Phase Kick-off Event and CeremonyArchiver
On Monday 7 December 2020, the selected consortia for the ARCHIVER prototype phase have been announced during a Public Award Ceremony.
The Kick-off marks the beginning of the Prototype implementation Phase, where the three selected to move forward will build prototypes of their solutions including all components, and basic functionality, interoperability, and security tests will be performed by IT specialists from the buyers’ group.
The document discusses the National Snow and Ice Data Center's (NSIDC) use of HDF and HDF-EOS file formats to manage and distribute scientific cryosphere data. NSIDC collects data from satellites like MODIS, AMSR-E and aircraft missions, processes the data, and makes it available in HDF(5) formats. The HDF formats allow for efficient storage of multi-dimensional scientific datasets along with metadata. NSIDC develops tools to allow users to access, analyze and visualize data stored in HDF files.
This document summarizes a presentation given at the ONE Conference 2013 about using cloud computing for Earth observation ground segments. It describes four cases where the European Space Agency used cloud computing:
1. Mass re-processing of satellite data on Amazon Web Services for validation purposes, allowing processing of 30,000 products in 5 weeks.
2. Coupling large data dissemination and processing capabilities on dedicated servers from Hetzner for analyzing 38,000 satellite images and serving 3,000 users.
3. A collaborative exploitation platform using multiple cloud providers through Helix Nebula for exploiting Earth observation data from various sources and making it available to over 200 users.
4. Plans for a sandbox service providing researchers and service providers
The document describes several deployment scenarios for storing and processing scientific data from instruments like the MAGIC Telescopes.
It outlines scenarios for: 1) large file safe-keeping, 2) mixed file safe-keeping including smaller files and reprocessing outputs, 3) in-archive data processing, 4) distributing data to instrument analysts, and 5) external user access.
Challenges include meeting data transfer timelines, ensuring data integrity over long periods, and providing flexible access and metadata tools to support analysis and discovery. Commercial providers could offer solutions but managing trust and customization needs is also discussed.
This document provides an overview of the status of HDF-EOS software and tools. It describes HDF-EOS5, a rewrite of HDF-EOS2 based on HDF5, which is used operationally by EOS instrument teams. The document also outlines software releases, major developments including bug fixes, and future plans, and provides contact information for support.
El documento resume las teorías sobre la relación entre la mente y el cerebro. Explica que la conciencia tiene dos componentes, el motor y el perceptual, y que el cerebro está compuesto por el sistema límbico y la formación reticular. Además, analiza las teorías monistas y dualistas sobre la relación entre la experiencia inmaterial y la materia, y los enfoques para evaluar el estado mental-cerebral como imágenes neuromagnéticas y pruebas neuropsicológicas.
RIPEstat is introducing new features including a Widget API, Data API, and improved performance. The Widget API allows users to embed RIPEstat plugins on their own websites. The Data API provides access to RIPEstat data in JSON format. Performance has been improved by migrating several plugins to a new backend data cluster. Future plans include adding more widgets and data services, extending existing plugins, and improving backend performance. Feedback is encouraged on the RIPEstat website and mailing lists.
On 29 January 2020 ARCHIVER launched its Request for Tender with the purpose to award several Framework Agreements and work orders for the provision of R&D for hybrid end-to-end archival and preservation services that meet the innovation challenges of European Research communities, in the context of the European Open Science Cloud.
The tender was closed on 28 April 2020 and 15 R&D bids were submitted, with consortia that included 43 companies and organisations. The best bids have been selected and will start the first phase of the ARCHIVER R&D (Solution Design) in June 2020.
On Monday 8 June the selected consortia for the ARCHIVER design phase have been announced during a Public Award Ceremony starting at 14.00 CEST.
In light of the COVID-19 outbreak and the and consequent movement restrictions imposed in several countries, the event has been organised as a webinar, virtually hosted by Port d’Informació Científica (PIC), a member of the Buyers Group of the ARCHIVER consortium.
The Kick-off marks the beginning of the Solution Design Phase.
Tim Bell gave a presentation on 04/11/2014 in Paris about using OpenStack at CERN to help answer fundamental physics questions. Some key challenges discussed included the large amount of data generated from particle collisions, which is expected to grow to 400PB/year by 2023. OpenStack has been in production at CERN since 2013 and is used across multiple clouds totaling over 150,000 cores. The presentation covered CERN's experience migrating to OpenStack and addressed cultural barriers to adoption.
CERN operates the Large Hadron Collider (LHC) and other particle physics experiments that generate enormous amounts of data. To handle this data, CERN uses a hybrid cloud model combining on-premise resources with public cloud services. CERN is participating in a joint procurement project called HNSciCloud to procure cloud services that can seamlessly integrate with CERN's internal resources and European research networks. The goal is to establish a hybrid cloud platform to meet the growing computing needs of particle physics and other research domains dealing with large datasets.
This webinar covered tools from the UK Data Service Census Support for working with UK census data, boundaries, and postcodes. It demonstrated how to use the Boundary Data Selector to download census boundaries, the Thematic Mapper to create choropleth maps from census data, and the Postcode Data Selector to extract postcode data and add lookups to other geographies. The webinar provided an overview of the UK census and types of data available, and explained how these online tools can be used to access and visualize UK census and geographic data.
CERN is an international research organization located near Geneva that operates the largest particle physics laboratory in the world. It has over 3,000 staff members and associates from over 20 member states with an annual budget of 1 billion Swiss francs. CERN operates the Large Hadron Collider, a 27 km ring that collides protons and heavy ions to study particle physics. Data from particle collisions is collected by large detectors and analyzed using the Worldwide LHC Computing Grid, which distributes computing tasks across data centers in over 40 countries. Some of CERN's key results include the 2012 discovery of the Higgs boson. CERN faces challenges in scaling its computing capabilities to handle rapidly increasing data volumes from the LHC and is transforming its computing
The document discusses the Finnish Meteorological Institute's (FMI) approach to providing weather data in an INSPIRE compliant format. It describes how FMI opened all of its data in 2013 through a single data portal that serves as both an open data and INSPIRE portal. It then covers the various data models used to structure different types of weather data, including observations, forecasts, and radar images. Finally, it discusses experiences with implementing the different models and serving the wide range of weather data sets.
The document discusses HDF and netCDF data support in ArcGIS. It provides an overview of how HDF and netCDF data can be directly ingested and used as raster datasets, mosaic datasets, feature layers, and tables in ArcGIS. This allows for scientific data to be displayed, analyzed, and shared using common GIS tools and services. It also describes existing Python tools for working with netCDF data and outlines future areas of development, including improved support for HDF5, THREDDS/OPeNDAP access, and evolving data standards.
This document discusses using RIPE Atlas measurements to analyze how "local" internet traffic stays within countries. The presenter describes running traceroutes between RIPE Atlas probes within countries to identify the presence of internet exchange points (IXPs) and out-of-country paths. Case studies on Sweden, France, and Argentina/Chile show results. Code for processing RIPE Atlas data and running monthly measurements for many countries is provided, with the goal of identifying opportunities for networks to improve local peering and routing.
The HDF Group is providing a hosted JupyterLab environment called HDF Kita Lab that provides access to HDF data stored on AWS S3 via the HDF Kita Server. HDF Kita Lab extends JupyterLab with features like auto-configuring the Kita Server and HDF branding. It runs on a Kubernetes cluster in AWS that can scale to handle different numbers of users. Each user gets computing resources and access to HDF data on S3 for analyzing via commonly used Python packages. The data on S3 provides unlimited storage and sharing capabilities between users.
This document provides an overview of visualizing big imaging data in radio astronomy. It discusses:
1) Facilities like Pawsey Supercomputing Centre and ICRAR that provide computational resources for processing and visualizing large astronomy data.
2) Common astronomy image formats like FITS and emerging "big data" formats like JPEG2000 that allow for multi-resolution and streaming visualization.
3) The SkuareView framework that implements remote visualization of JPEG2000 encoded astronomy data using JPIP by streaming different resolutions and regions of interest without downloading full datasets.
4) A demo of using SkuareView to interactively visualize multi-TB radio astronomy datasets stored in the cloud.
This document discusses disaster recovery for OpenStack clouds using hybrid cloud solutions. It describes using replication between on-premises and cloud storage for near-zero downtime disaster recovery. The goals are to protect OpenStack application instances and data, and recover them in an alternative cloud location if needed. The proposed solution would build disaster recovery as an OpenStack project using existing components like Heat, Swift, and Cinder with a pluggable architecture.
Prototype Phase Kick-off Event and CeremonyArchiver
On Monday 7 December 2020, the selected consortia for the ARCHIVER prototype phase have been announced during a Public Award Ceremony.
The Kick-off marks the beginning of the Prototype implementation Phase, where the three selected to move forward will build prototypes of their solutions including all components, and basic functionality, interoperability, and security tests will be performed by IT specialists from the buyers’ group.
The document discusses the National Snow and Ice Data Center's (NSIDC) use of HDF and HDF-EOS file formats to manage and distribute scientific cryosphere data. NSIDC collects data from satellites like MODIS, AMSR-E and aircraft missions, processes the data, and makes it available in HDF(5) formats. The HDF formats allow for efficient storage of multi-dimensional scientific datasets along with metadata. NSIDC develops tools to allow users to access, analyze and visualize data stored in HDF files.
This document summarizes a presentation given at the ONE Conference 2013 about using cloud computing for Earth observation ground segments. It describes four cases where the European Space Agency used cloud computing:
1. Mass re-processing of satellite data on Amazon Web Services for validation purposes, allowing processing of 30,000 products in 5 weeks.
2. Coupling large data dissemination and processing capabilities on dedicated servers from Hetzner for analyzing 38,000 satellite images and serving 3,000 users.
3. A collaborative exploitation platform using multiple cloud providers through Helix Nebula for exploiting Earth observation data from various sources and making it available to over 200 users.
4. Plans for a sandbox service providing researchers and service providers
The document describes several deployment scenarios for storing and processing scientific data from instruments like the MAGIC Telescopes.
It outlines scenarios for: 1) large file safe-keeping, 2) mixed file safe-keeping including smaller files and reprocessing outputs, 3) in-archive data processing, 4) distributing data to instrument analysts, and 5) external user access.
Challenges include meeting data transfer timelines, ensuring data integrity over long periods, and providing flexible access and metadata tools to support analysis and discovery. Commercial providers could offer solutions but managing trust and customization needs is also discussed.
This document provides an overview of the status of HDF-EOS software and tools. It describes HDF-EOS5, a rewrite of HDF-EOS2 based on HDF5, which is used operationally by EOS instrument teams. The document also outlines software releases, major developments including bug fixes, and future plans, and provides contact information for support.
El documento resume las teorías sobre la relación entre la mente y el cerebro. Explica que la conciencia tiene dos componentes, el motor y el perceptual, y que el cerebro está compuesto por el sistema límbico y la formación reticular. Además, analiza las teorías monistas y dualistas sobre la relación entre la experiencia inmaterial y la materia, y los enfoques para evaluar el estado mental-cerebral como imágenes neuromagnéticas y pruebas neuropsicológicas.
Learn about People to People Ambassador Program's student travel program to London this holiday. This program, traveling for 10 days provides students an opportunity to explore London-- one of the most amazing cities in the world. Students will experience this sought-after global destination and its holiday traditions in ways most only dream about. Learn more: http://bit.ly/15EoukD
This document discusses various security issues that can arise in source control systems. It describes buffer overflow attacks, where a program writes data past the end of a memory buffer. It also discusses citizen/casual programmers who may not follow proper security practices. Covert channels that can transfer data in violation of security policies are described. The document outlines controls and best practices around these issues like parameter checking, memory protection, and auditing and logging.
The poem is about two lovers who have been through ups and downs in their relationship but ultimately keep coming back to each other. It describes how despite the stops and starts, their love is resilient and they are meant to be together forever. The poem uses imagery of two angels rescued from falling to represent the narrator and their partner finding their way back to each other through all of life's challenges.
This document summarizes different tenses in English grammar including the present simple, present continuous, past simple, past continuous, present perfect simple, will, and going to. For each tense it provides the positive, negative and interrogative forms, example sentences using the tense, time expressions commonly used with the tense, and the uses or meanings of the tense.
The letter provides elders guidance on handling accusations of child abuse, including legal reporting requirements and protecting victims. It outlines steps to take regarding known abusers, including warnings to parents if an individual is considered a predator likely to reoffend. Elders are directed to contact the Legal Department for legal advice on any abuse matters and to protect both children and the congregation's reputation.
The document summarizes the European Archival Records and Knowledge (E-ARK) project, which developed an OAIS-compliant system for fast creation, search, and access of archival information packages. It describes the key components and functionality of the E-ARK reference implementation, including tools for ingest, archival storage, data management, access, and data mining of archived content. Current pilots of the E-ARK system are being used by several national archives for large-scale archiving and access of records.
La escuela superior politécnica de Chimborazo ofrece un curso de nivelación de abril a agosto de 2013 con un módulo de resolución de problemas. El curso incluye a 6 estudiantes: Yajaira Villalva, Karina Tigse, Liseth Carpintero, Sadia Shiguango, Jhonny Quishpi y Edison Quishpe.
This document discusses information security governance and risk management. It covers several topics:
- The roles and responsibilities of various information security positions such as the security officer, system administrators, and end users.
- Security policies, standards, procedures, and frameworks that organizations can implement to formalize security practices.
- Compliance with regulations and how to map different compliance frameworks.
- Managing risks from third parties, acquisitions, and other organizational changes.
- Ensuring proper information security governance through activities like risk analysis, security awareness training, and oversight from executive management.
Hemophilia is a sex-linked disorder caused by a mutation in a blood clotting protein, affecting about 1 in 5000 males, with symptoms including deep bruising, swelling from internal bleeding, joint pain, headaches, and unusual bleeding after immunizations. It is treated by replacing the deficient clotting factor through medication or donated blood plasma.
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
Presentation of the European project SCAPE (www.scape-project.eu) at the Elag2013 conference in Gent/Belgium. The presentation includes details about use cases and implementation at the Austrian National LIbrary.
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
This presentation was given by Per Møldrup-Dalum at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation an overview of the project, its results and how to sustain it is given. For more information, see this blog post, http://bit.ly/SCAPE_SB_Demo, about the event.
This presentation describes the EU-funded project SCAPE – Scalable Preservation Environments –, its developments and sustainability plans.
The SCAPE project has developed scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects.
The project run-time was around 3½ years from 2011 to 2014.
Read more about SCAPE at www.scape-project.eu
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
The British Library hosted a ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Some tools were presented and demonstrated in more detail (see the other presentations) and the day was closed with a presentation by Will Palmer, Carl Wilson and Peter May of some of the other outputs that SCAPE has delivered.
A brief introduction to the SCAPE project co-funded by the European Union under the FP7 ICT program. A blog post leading you through the presentation can be found here: http://www.openplanetsfoundation.org/blogs/2012-12-10-scape-project-%E2%80%93-brief-introduction
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
This presentation origins from a webinar presented by Luís Faria. The webinar presents the SCAPE developed tools Scout and C3PO and demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities.
Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems.
The webinar was held 26 June 2014.
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
At the ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014 Rune Bruun Ferneke-Nielsen presented how the library uses Jpylyzer, a SCAPE developed tool, to validate millions of JPEG 2000 files in connection with a large newspaper digitization project.
The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation and repositories, SCAPE Testbeds and where to learn more.
The presentation was given at the workshop ‘Preservation at Scale’ http://bit.ly/17ppAln in connection with the iPres2013 conference in Lissabon, Portugal, in September 2013.
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
The SCAPE developed tool Jpylyzer has long been in production use at a variety of institutions. The British Library uses Jpylyzer in combination with Schematron to validate JPEG2000 files.
The presentation by Will Palmer was given at the ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
This presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013.
Barbara Sierman, Koninklijke Bibliotheek in the Netherlands, introduced the policy concept, previous work on policies and the work that has been done within SCAPE on preservation policies. SCAPE will build a catalogue of policy elements with three levels – guidance, preservation procedure, and control policies.
Rainer Schmidt, AIT Austrian Institute of Technology, presented Scalable Preservation Workflows from SCAPE at the 5-days ‘Digital Preservation Advanced Practitioner Training’ event (http://bit.ly/1fYCvMO), hosted by DPC, in Glasgow on 15-19 July 2013.
The presentation gives an introduction to the SCAPE Platform, it presents scenarios from SCAPE Testbeds and it finally describes how to create scalable workflows and execute them on the SCAPE Platform.
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
Hadoop has been used at the State and University Library, Denmark, in connection with an experiment on the migration of a large collection of audio files from mp3 to wav. This experiment was presented by Bolette Ammitzbøll Jurik at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
The experiment used Hadoop and Taverna but also xcorrSound waveform-compare which is a small tool developed within SCAPE to compare the content of audio files.
Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
This presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013.
Artur Kulmukhametov, Vienna University of Technology, introduced the importance of content profiling and how this can be done with the help of the SCAPE developed tool C3PO. Content profiling is based on characteristics extracted from the files’ metadata and will help the user to plan digital preservation. The tool C3PO can be easily integrated with both PLATO and Scout.
At the iPres2013 conference in Lisbon, Portugal, in September 2013 Luís Faria, KEEP SOLUTIONS LDA, presented SCAPE work on monitoring of digital repositories and the tool, Scout, which has been developed in this connection. Scout is a web-based service that assists content holders in monitoring their digital repository and provides an ontological knowledge base for compiling the information needed to detect preservation risks and opportunities.
Automatic Preservation Watch Using Information Extraction on the WebLuis Faria
iPRES 2013 presentation of a proof-of-concept experiment of using Information Extraction Technologies to do automatic preservation watch using natural language information on the Web.
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
This presentation was given by Will Palmer at ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation Will Palmer introduced the SCAPE developed tool Nanite which can help institutions analyze their web archive data.
EOSC support to scientific computing needs in to Earth Observation with the EGI Federated Cloud
The European Open Science Cloud (EOSC) supports multi-disciplinary science, and Earth Observation is one of the major use cases.
EOSC will provide capacity and capabilities for the fostering the exploitation of EO data, this can be achieved by federating cloud providers of EGI, DIAS, and data analytics tools. In this presentation, we show how EOSC can rely on a public-private cloud federation for delivering its compute platform for EO.
An image based approach for content analysis in document collectionsSCAPE Project
Reinhold Huber-Mörk of Austrain Institute of Technology presented ‘An image based approach for content analysis in document collections’ at
ISVC'13 (9th International Symposium on Visual Computing) in Rethymnon, Crete, Greece, on 31 July 2013.
The development of tools for library workflows for duplicate content detection and content verification for complex documents were presented accompanied by results of the work.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
20240609 QFM020 Irresponsible AI Reading List May 2024
Application scenarios of the SCAPE project at the Austrian National Library
1. Sven Schlarb
Österreichische Nationalbibliothek
LIBER Satellite Event: APARSEN & SCAPE Workshop
21 May 2014, Austrian National Library, Vienna
Application scenarios of the SCAPE project at the Austrian
National Library
2. • Examples of Big Data in memory institutions
• What are the SCAPE Testbeds?
• Motivation for the Austrian National Library
• Hadoop in a nutshell
• SCAPE Platform setup at the Austrian National Library
• Selected SCAPE tools
• Application scenarios
• Web Archiving
• Austrian Books Online
Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
3. • Google Books Project: 30 Million digital books
• http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched
• Europeana: Metadata about over 24 million objects
• Europeana annual report and accounts 2012, Europeana Foundation, April 2013
• Hathi Trust: 10 million volumes (over 5,6 million titles)
comprising over 3,7 billion book page images
• http://www.hathitrust.org/statistics_info
• Internet Archive: 364 billion pages, about 10 Petabyte.
• http://archive.org und http://archive.org/web/petabox.php
Books, Journals, Newspapers, Websites. Big data?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
4. Takeup
•Stakeholders and Communities
•Dissemination
•Training Activities
•Sustainability
Platform
•Automation
•Workflows
•Parallelization
•Virtualization
SCAPE Project Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
5. • Good:
• Storing structured data
• Expressive query language
• ACID, type safety
• But:
• SQL Joins not efficient at scale
• ÖNB 2011: Failed creating a complete web-archive index
using single-instance-MySQL (write performance!)
• Solution?
• Scaling vertically Bigger servers hardware costs!
• Scaling horizontally Sharding maintenance costs!
Pushing the boundaries of RDBMs (e.g. MySQL)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
6. • Hadoop means a cost-
advantage because
• It usually runs on relatively
inexpensive (commodity)
hardware
• No binding to specific
vendors
• Open-Source-Software
Comparison of storage costs
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen
für Entscheider 2014, S. 39
7. • Required to move data
• From NAS to Server
• To Cloud
• Multi-Terabyte senarios?
Dealing with large amounts of data
• Immediate processing
• Unified storage and
processing capabilities
• Distributed I/O
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
8. • When dealing with large data sets it is usually easier to
bring the processor to the data than the data to the
processor
• Fine-granular parallelisation: All processing cores of
the cluster are used as processors
• Designed for failure. In large clusters hardware failure
is the norm rather than the exception
• Redundancy : Redundant storage of data blocks
(default: 3 copies)
• Data locality: Free nodes with direct access to data do
the processing
Some Basic hadoop assumptions
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
9. What is Hadoop (physically)?
Distributed processing (MapReduce)
Distributed Storage (HDFS)
Hadoop = MapReduce + HDFS
2 x Quad-Core-CPUs:
10 Map (parallelisation)
4 Reduce (aggregation)
4 x 1 TB hard disks with redundancy 3:
1,33 TB effective
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
10. Configuration per CPU
Configuration of one Quad-Core-CPU (= 1 node)
4 physical cores
8 hyperthreading-cores (System „sees“ 8 cores)
OS
Map
Map
Map
Map
Map
Reduce
Reduce
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
11. Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.
25 processing cores for Map tasks
10 processing cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
12. Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
What is Hadoop (conceptually)?
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
13. Platform instance architecture at the Austrian National Library
• Access via REST API
• Workflow engine for complex
jobs
• Hive as the frontend for
analytic queries
• MapReduce/Pig for
Extraction, Transform, and
Load (ETL)
• „Small“ objects in HDFS or
HBase
• „Large “ Digital objects stored
on NetApp Filer
Taverna Workflow engine
REST API
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
15. • Web Archiving
• Web Archive Mime Type Identification
• Characterisation of web archive data
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
Overview about application scenarios
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
16. Webarchiving
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Storage: ca. 45TB
• ca. 1.7 Billion Objekts
• Domain harvesting
• Entire top-level-domain .at
every 2 years
• Selective harvesting
• Important websites that
change regularly
• Event harvesting
• Special occasions and
events (e.g. elections)
17. File format identification in web archives
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
18. (W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
Basiert auf
HERITRIX
Web Crawler
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
File format identification in web archives
Software-Integration Durchsatz(GB/min)
TIKA detector API in Map Phase 6,17 GB/min
FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min
TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min
Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
19. Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
20. Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
21. Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
22. • Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 200K physical volumes scanned so far
• ~ 60 Mio pages
Austrian Books Online
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
23. ADOCO (Austrian Books Online Download & Control)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO
24. • TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the
JPEG2000 file format obsolescense
risk
• Different preservation tool
categories:
• Validation
• Migration
• Quality assurance
Quality assured image file format migration
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
25. Comparison of book derivatives
• Compare different versions of the same book
• Images come from different scanning sources
• Images have been manipulated (cropped, rotated)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
26. • 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-
data-processing-chaining-hadoop-jobs-using-taverna
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
27. Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Bildbreite
Blockbreite
Assumption: „Significant“ difference between average blockwidth and image width
is an indicator for possible text loss due to cropping error.
Cropping errorCorrect cropping
28. Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Create input text files w.
file paths (JP2 & HTML)
• Read image metadata
using Exiftool (Hadoop
Streaming API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query
29. • Possibility for libraries to build cost-efficient solutions
for storing large data collections
• HDFS as storage master or staging area?
• Local cluster vs. cloud?
• Apache Hadoop offers a stable core for building a large
scale processing platform; ready to be used in
production
• Carefully select additional components from the Apache
Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn,
Ambari, etc.) that fit your needs
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Summary
30. These slides on Slideshare:
• http://de.slideshare.net/SvenSchlarb/
application-scenarios-of-the-scape-project-at-the-austrian-national-library
Further information
• Project website: www.scape-project.eu
• Github repository: www.github.com/openplanets
• Project Wiki: www.wiki.opf-labs.org/display/SP/Home
SCAPE tools mentioned
• ToMaR: http://openplanets.github.io/ToMaR/#
• Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer
• Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa-
matchbox
• C3PO: http://ifs.tuwien.ac.at/imp/c3po
Thank you! Questions?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).