Rainer Schmidt, AIT Austrian Institute of Technology, presented Scalable Preservation Workflows from SCAPE at the 5-days ‘Digital Preservation Advanced Practitioner Training’ event (http://bit.ly/1fYCvMO), hosted by DPC, in Glasgow on 15-19 July 2013.
The presentation gives an introduction to the SCAPE Platform, it presents scenarios from SCAPE Testbeds and it finally describes how to create scalable workflows and execute them on the SCAPE Platform.
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks
Ingesting data from variety of sources like Mysql, Oracle, Kafka, Sales Force, Big Query, S3, SaaS applications, OSS etc. with billions of records into datalake (for reporting, adhoc analytics, ML jobs) with reliability, consistency, schema evolution support and within expected SLA has always been a challenging job. Also ingestion may have different flavors like full ingestion, incremental ingestion with and without compaction/de-duplication and transformations with their own complexity of state management and performance. Not to mention dependency management where hundreds / thousands of downstream jobs are dependent on this ingested data and hence data availability on time is of utmost importance. Most data teams end up creating adhoc ingestion pipelines written in different languages and technologies which adds operational overheads and knowledge is mostly limited to few.
In this session, I will talk about how we leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. Will also discuss about how we developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk will also focus how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how we maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.
Post this talk one should be able to understand challenges involved in ingesting data reliably from different sources and how one can leverage Spark’s Dataframe abstraction to solve this in unified way.
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
Devon Energy is a Fortune 500 company focused on unconventional upstream oil and gas production. With a companywide focus on innovation and data-driven decision making, IT has been challenged to make more data available to more people more quickly. To this end, we have leveraged the scale of Microsoft Azure and Databricks’ Unified Analytics Platform to help reimagine our integration, data warehousing and analytics landscape to improve agility while moving our workloads to the cloud. We are in the third year of this transformation and have lessons learned around improving the testability of data pipelines, code management, model training and deployment, promotion, and user empowerment. In this talk, we will share our experience managing the lifecycle of data engineering and machine learning solutions and striking the balance between agility and reliability in a single platform, while democratizing data access to users from all disciplines across the company.
Author: Paul Bruffett
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
Doug Cutting, Apache Hadoop Co-founder, explains how the growth of the Hadoop ecosystem has made Hadoop a much more powerful machine, and how the continued expansion will lead to great things.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
The SCAPE developed tool Jpylyzer has long been in production use at a variety of institutions. The British Library uses Jpylyzer in combination with Schematron to validate JPEG2000 files.
The presentation by Will Palmer was given at the ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
Taverna and myExperiment. SCAPE presentation at a Hack-a-thonSCAPE Project
Presentation by Alexandra Nenadic, University of Manchest, of how to create workflows in Taverna and how the SCAPE project shares its workflows via myExperiment.
Presented at 'Practical Tools for Digital Preservation: A Hack-a-thon' in York, September 28, 2011.
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks
Ingesting data from variety of sources like Mysql, Oracle, Kafka, Sales Force, Big Query, S3, SaaS applications, OSS etc. with billions of records into datalake (for reporting, adhoc analytics, ML jobs) with reliability, consistency, schema evolution support and within expected SLA has always been a challenging job. Also ingestion may have different flavors like full ingestion, incremental ingestion with and without compaction/de-duplication and transformations with their own complexity of state management and performance. Not to mention dependency management where hundreds / thousands of downstream jobs are dependent on this ingested data and hence data availability on time is of utmost importance. Most data teams end up creating adhoc ingestion pipelines written in different languages and technologies which adds operational overheads and knowledge is mostly limited to few.
In this session, I will talk about how we leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. Will also discuss about how we developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk will also focus how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how we maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.
Post this talk one should be able to understand challenges involved in ingesting data reliably from different sources and how one can leverage Spark’s Dataframe abstraction to solve this in unified way.
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
Devon Energy is a Fortune 500 company focused on unconventional upstream oil and gas production. With a companywide focus on innovation and data-driven decision making, IT has been challenged to make more data available to more people more quickly. To this end, we have leveraged the scale of Microsoft Azure and Databricks’ Unified Analytics Platform to help reimagine our integration, data warehousing and analytics landscape to improve agility while moving our workloads to the cloud. We are in the third year of this transformation and have lessons learned around improving the testability of data pipelines, code management, model training and deployment, promotion, and user empowerment. In this talk, we will share our experience managing the lifecycle of data engineering and machine learning solutions and striking the balance between agility and reliability in a single platform, while democratizing data access to users from all disciplines across the company.
Author: Paul Bruffett
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
Doug Cutting, Apache Hadoop Co-founder, explains how the growth of the Hadoop ecosystem has made Hadoop a much more powerful machine, and how the continued expansion will lead to great things.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
The SCAPE developed tool Jpylyzer has long been in production use at a variety of institutions. The British Library uses Jpylyzer in combination with Schematron to validate JPEG2000 files.
The presentation by Will Palmer was given at the ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
Taverna and myExperiment. SCAPE presentation at a Hack-a-thonSCAPE Project
Presentation by Alexandra Nenadic, University of Manchest, of how to create workflows in Taverna and how the SCAPE project shares its workflows via myExperiment.
Presented at 'Practical Tools for Digital Preservation: A Hack-a-thon' in York, September 28, 2011.
Matchbox tool. Quality control for digital collections – SCAPE Training event...SCAPE Project
This is an introduction to the Matchbox tool, a tool for quality control for digital collections. The introduction was given at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012. Presenters were Roman Graf and Reinhold Huber-Mörk from Austrian Institute of Technology and Alexander Schindler from Vienna University of Technology.
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
Donal Fellows from School of Computer Science at University of Manchester gave a talk on Taverna Components at the 14th Annual Bioinformatics Open Source Conference (BOSC 2013) in Berlin, July 2013. The talk describes the usefulness of components and how they are implemented and used.
This work for this presentation was based on SCAPE as well as the BioVeL project and the WF4Ever project.
Planets, OPF & SCAPE - presentation of tools on digital preservationSCAPE Project
Andrew Jackson from British Library presents digital preservation tools from the EU projects Planets and SCAPE and the Open Planets Foundation which is a network providing practical solutions and expertise in digital preservation.
Presented at 'Practical Tools for Digital Preservation: A Hack-a-thon' in York, September 28, 2011.
Characterisation - 101. An introduction to the identification and characteris...SCAPE Project
This is an introduction to the identification and characterization of file formats and which tools can be used for this. The intro was given by Carl Wilson from Open Planets Foundation at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
Hadoop has been used at the State and University Library, Denmark, in connection with an experiment on the migration of a large collection of audio files from mp3 to wav. This experiment was presented by Bolette Ammitzbøll Jurik at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
The experiment used Hadoop and Taverna but also xcorrSound waveform-compare which is a small tool developed within SCAPE to compare the content of audio files.
Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
SCAPE Preservation Platform. Design and DeploymentSCAPE Project
Rainer Schmidt, AIT Austrian Institute of Technology GmbH, gave an architechtural overview of the SCAPE preservation platform, its system requirements and a flexible deployment model to dynamically reconfigure the system and provide initial insights on employing an open-source cloud platform for its realization.
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectSCAPE Project
Jpylyzer is a tool for validation and feature extraction for the JP2 (JPEG 2000 Part 1) still image format. The tool is being developed in the SCAPE Project and was presented by Johan van der Knijff at Archiving 2012 in Copenhagen.
Audio Quality Assurance. An application of cross correlationSCAPE Project
Jesper Sindahl Nielsen, State and University Library, Denmark, presented algorithms for automated quality assurance on audio files in context of preservation actions and
access. Cross correlation is used to compare the soundwaves.
In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 144-149.
ISBN 978-0-9917997-0-1
Duplicate detection for quality assurance of document image collectionsSCAPE Project
Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto.
In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143.
ISBN 978-0-9917997-0-1
PDF/A-3 for preservation. Notes on embedded files and JPEG2000SCAPE Project
Johan van der Knijff, the National Library of the Netherlands, presented his views on ‘PDF/A-3 for preservation’ based on notes on embedded files and JPEG2000.
The presentation was given at DPC briefing (http://bit.ly/1b487mD) which introduced and reviewed recent developments with the PDF / A standard, with particular emphasis on PDF/A version 3 published in October 2012. The meeting took place in Leeds on 13 March 2013.
This is a general presentation of the EU Project SCAPE, http://www.scape-project.eu from 2011. The project is about large-scale digital preservation and runs from 2011 to 2014.
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
The State and University Library, Denmark, hosted an information and demonstration day on 25 June 2014 for delegates from other large cultural heritage institutions in Denmark. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
One of the presentations was given by Asger Askov Blekinge who showed how the library has worked on integrating its digital object management system with Hadoop. The library is currently digitizing 32 million newspaper pages and is using Hadoop map/reduce jobs to do quality assurance on the digitized files with the help of the SCAPE Stager/Loader so updated QA’ed files are stored in the repository.
Quality assurance for document image collections in digital preservation SCAPE Project
Reinhold Huber-Mörk, AIT Austrian Institute of Technology, gave a presentation on ‘Quality assurance for document image collections in digital preservation’ at the Acivs conference in Brno, Czech Republic in September 2012. Acivs is short for Advanced Concepts for Intelligent Vision Systems and focuses on techniques for building adaptive, intelligent, safe and secure imaging systems.
SCAPE - Building Digital Preservation InfrastructureSCAPE Project
Dr. Ross King, AIT Austrian Institute of Technology GmbH, gave an invited talk about the FP7 project SCAPE at the eSciDoc Days in Berlin, October 27, 2011, https://www.escidoc.org/JSPWiki/en/ESciDocDays.
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
This presentation was given by Will Palmer at ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation Will Palmer introduced Hadoop and the way the British Library and SCAPE have used Hadoop to process large-scale data.
Evolving Domains, Problems and Solutions for Long Term Digital PreservationSCAPE Project
Overview of FP7 projects, including ARCOMEM, ENSURE, SCAPE and TIMBUS. Presentation by Dr. Ross King, AIT Austrian Institute of Technology GmbH, at iPres 2011, Singapore. In Proceedings of the 8th International Conference on Preservation of Digital Objects (iPRES 2011), 2011, 194-204 ISBN 978-981-07-0441-4
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
Presentation of the European project SCAPE (www.scape-project.eu) at the Elag2013 conference in Gent/Belgium. The presentation includes details about use cases and implementation at the Austrian National LIbrary.
Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation and repositories, SCAPE Testbeds and where to learn more.
The presentation was given at the workshop ‘Preservation at Scale’ http://bit.ly/17ppAln in connection with the iPres2013 conference in Lissabon, Portugal, in September 2013.
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société
: l'exploitation des données massives ouvre des possibilités de transformation radicales au
niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit
techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités
massives de données représentent des vrais défis techniques.
Une architecture big data permet la création et de l'administration de tous les
systèmes techniques qui vont permettre la bonne exploitation des données.
Il existe énormément d'outils différents pour manipuler des quantités massives de
données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler
ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être
tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ?
Le succès du fonctionnement de la Big data dépend de son architecture, son
infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’.
L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing
& Stockage, Sécurité et Opération.
The analysis of large amounts of data equires database
NoSQL, software framework that supports distributed computing and search engine. On these two fronts Amazon Web Services provides us the services DynamoDB, Elastic MapReduce and Cloud Search
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
In this webinar, we'll see how to use Spark to process data from various sources in R and Python and how new tools like Spark SQL and data frames make it easy to perform structured data processing.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
This presentation was given by Will Palmer at ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation Will Palmer introduced the SCAPE developed tool Nanite which can help institutions analyze their web archive data.
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
The British Library hosted a ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Some tools were presented and demonstrated in more detail (see the other presentations) and the day was closed with a presentation by Will Palmer, Carl Wilson and Peter May of some of the other outputs that SCAPE has delivered.
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
Alecs Geuder from the British Library presented a new SCAPE developed tool called ‘Flint’ at the ‘SCAPE Information Day at the British Library’, on 14 July 2014. Flint is a format and file validation tool which can be used to valide your files and/or formats against a policy. At the British Library Flint is used to deal with non print legal deposit.
The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
This presentation origins from a webinar presented by Luís Faria. The webinar presents the SCAPE developed tools Scout and C3PO and demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities.
Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems.
The webinar was held 26 June 2014.
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
This presentation was given by Per Møldrup-Dalum at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation an overview of the project, its results and how to sustain it is given. For more information, see this blog post, http://bit.ly/SCAPE_SB_Demo, about the event.
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
At the ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014 Rune Bruun Ferneke-Nielsen presented how the library uses Jpylyzer, a SCAPE developed tool, to validate millions of JPEG 2000 files in connection with a large newspaper digitization project.
The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
Per Møldrup-Dalum introduced how the State and University Library in Denmark have deployed Hadoop in connection with the SCAPE project. With Hadoop the library have been able to process large amounts of data so much fast than what has been done before.
The presentation was given at ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. For more information about the demo day, see this blog post, http://bit.ly/SCAPE_SB_Demo, about the event.
This presentation describes the EU-funded project SCAPE – Scalable Preservation Environments –, its developments and sustainability plans.
The SCAPE project has developed scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects.
The project run-time was around 3½ years from 2011 to 2014.
Read more about SCAPE at www.scape-project.eu
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
Sven Schlarb from the Austrian National Libraries gave an overview of the different application scenarios at the Austrian National Libraries related to Web Archiving and the Austrian Books Online project.
The presentation was given at the LIBER Satellite Event on Long term accessibility of digital resources in theory and practice, https://liber2014.univie.ac.at/satellite-event/, in Vienna on 21 May 2014.
This presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013.
Artur Kulmukhametov, Vienna University of Technology, introduced the importance of content profiling and how this can be done with the help of the SCAPE developed tool C3PO. Content profiling is based on characteristics extracted from the files’ metadata and will help the user to plan digital preservation. The tool C3PO can be easily integrated with both PLATO and Scout.
The presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013.
Catherine Jones, Science and Technology Facilities Council, presented the concept of control policies and what is needed to produce machine understandable control policies.
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
This presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013.
Barbara Sierman, Koninklijke Bibliotheek in the Netherlands, introduced the policy concept, previous work on policies and the work that has been done within SCAPE on preservation policies. SCAPE will build a catalogue of policy elements with three levels – guidance, preservation procedure, and control policies.
An image based approach for content analysis in document collectionsSCAPE Project
Reinhold Huber-Mörk of Austrain Institute of Technology presented ‘An image based approach for content analysis in document collections’ at
ISVC'13 (9th International Symposium on Visual Computing) in Rethymnon, Crete, Greece, on 31 July 2013.
The development of tools for library workflows for duplicate content detection and content verification for complex documents were presented accompanied by results of the work.
Sven Schlarb of the Austrian National Library presented SCAPE (in German). Besides giving a general overview of SCAPE the presentation also includes descriptions of SCAPE solutions, including tools, software integration, planning, and more.
The presentation was given at the Austrian Library day on ‘National Initiatives on Digital Information. Repositories, Research data and long-term preservation in Austria’ (http://www.obvsg.at/voeb-obvsg-bibliothekstage-2013/programm-410/) on 4 October 2013 in Vienna.
At the iPres2013 conference in Lisbon, Portugal, in September 2013 Luís Faria, KEEP SOLUTIONS LDA, presented SCAPE work on monitoring of digital repositories and the tool, Scout, which has been developed in this connection. Scout is a web-based service that assists content holders in monitoring their digital repository and provides an ontological knowledge base for compiling the information needed to detect preservation risks and opportunities.
Barbara Sierman, the National Library of the Netherlands, presented ‘Policy levels in SCAPE’ at the iPres2013 conference in Lisbon, Portugal, in September 2013.
The policy work is part of the SCAPE project and is based on an analysis of digital preservation policies from partner institutions.
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012SCAPE Project
This presentation is an introduction to Digital Preservation given by David Tarrant, Open Planets Foundation, at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
PHP Frameworks: I want to break free (IPC Berlin 2024)
Scalable Preservation Workflows
1. SCA
PE
Rainer Schmidt
DP Advanced Practitioners Training
July 16th, 2013
University of Glasgow
Scalable Preservation Workflows
design, parallelisation, and execution
2. SCAlable Preservation Environments
SCAPE
2
• European Commission FP7 Integrated Project
• 16 Organizations, 8 Countries
• 42 months: February 2011 – July 2014
• Budget: 11.3 Million Euro (8.6 Million Euro funded)
• Consortium: data centers, memory institutions,
research centers, universities & commercial partners
• recently extended to involve HPC computing centers.
• Dealing with (digital) preservation processes at scale
• such as ingestion, migration, analysis and monitoring
of digital data sets
• Focus on scalability, robustness, and automation.
The Project
3. SCAlable Preservation Environments
SCAPE
3
What I will show you
• Example Scenarios from the SCAPE DL Testbed and how
they are formalized using Workflow Technology
• Introduction to the SCAPE Platform. Underlying
technologies, preservation services, and how to set-up.
• How is the paradigm different to a client-server set-up
and can I execute a standard tool against my data.
• How to create scalable workflows and execute them on
the platform.
• A practical demonstration (and available VM) for creating
and running such workflows.
5. SCAlable Preservation Environments
SCAPE
5
• Ability to process large and
complex data sets in
preservation scenarios
• Increasing amount of data in
data centers and memory
institutions
Volume, Velocity, and Variety
of data
1970 2000 2030
cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge.
available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
Motivation
6. SCAlable Preservation Environments
SCAPE
Austrian National Library (ONB)
• Web Archiving
• Scenario 1: Web Archive Mime Type Identification
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
7. SCAlable Preservation Environments
SCAPE
• Physical storage 19 TB
• Raw data 32 TB
• Number of objects
1.241.650.566
• Domain harvesting
• Entire top-level-domain
.at every 2 years
• Selective harvesting
• Interesting frequently
changing websites
• Event harvesting
• Special occasions and
events (e.g. elections)
Web Archiving - File Format identification
8. SCAlable Preservation Environments
SCAPE
• Public private partnership with
Google Inc.
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 130K physical volumes scanned
• ~ 40 Mio pages
Austrian Books Online
10. SCAlable Preservation Environments
SCAPE
• Task: Image file format migration
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the JPEG2000
file format obsolescense risk
• Challenges:
• Integrating validation, migration,
and quality assurance
• Computing intensive quality
assurance
Image file format migration
11. SCAlable Preservation Environments
SCAPE
Comparison of book derivatives – Matchbox tool
• Quality Assurance for different book versions
• Images have been manipulated (cropped,
rotated) and stored in different locations
• Images subject to different modification
procedures
• Detailed image comparison and detection of
near duplicates and corresponding images
• Feature extraction invariant under color
space, scale, rotation, cropping
• Detecting visual keypoints and
structural similarity
• Automated Quality Assurance workflows
• Austrian National Library - Book scan project
• The British Library - “Dunhuang” manuscripts
12. SCAlable Preservation Environments
SCAPE
Data Preparation and QA
• Goal: Preparing large document collections for data analysis.
• Example: Detecting quality issues due to cropping errors.
• Large volumes of HTML files generated as part of a book
collection
• Representing layout and text of corresponding book page
• HTML tags representing e.g. width and height of text or image block
• QA Workflow using multiple tools
• Generate image metadata using Exiftool
• Parse HTML and calculate block size of book page
• Normalize data and put into data base
• Execute query to detect quality issues
14. SCAlable Preservation Environments
SCAPE
Goal of the SCAPE Platform
• Hardware and software platform to support scalable
preservation in terms of computation and storage.
• Employing an scale-out architecture to supporting
preservation activities against large amounts of data.
• Integration of existing tools, workflows, and
data sources and sinks.
• A data center service providing a scalable execution
and storage backend for different object management
systems.
• Based a minimal set of defined services for
• processing tools and/or queries closely to the data.
15. SCAlable Preservation Environments
SCAPE
Underlying Technologies
• The SCAPE Platform is built on top of existing data-intensive
computing technologies.
• Reference Implementation leverages Hadoop Software Stack (HDFS,
MapReduce, Hive, …)
• Virtualization and packaging model for dynamic deployments of
tools and environments
• Debian packages and IaaS suppot.
• Repository Integration and Services
• Data/Storage Connector API (Fedora and Lily)
• Object Exchange Format (METS/PREMIS representation)
• Workflow modeling, translation, and provisioning.
• Taverna Workbench and Component Catalogue
• Workflow Compiler and Job Submission Service
16. SCAlable Preservation Environments
SCAPE
16
Components of the Platform
• Execution Platform
• Deploy SCAPE tools and parallelized (WF) applications
• Executable via CLI and Service API
• Scripts/Drivers aiding integration.
• Workflow Support
• Describe and validate preservation workflows using a
defined component model
• Register and semantic search using Component Catalogue
• Repository Integration
• Fedora implementation on top of CI
• Loader Application, Object Model, and Connector APIs.
20. SCAlable Preservation Environments
SCAPE
• Open-source software framework for large-scale data-
intensive computations running on large clusters of
commodity hardware.
• Derived from publications Google File System and
MapReduce publications.
• Hadoop = MapReduce + HDFS
• MapReduce: Programming Model (Map, Shuffle/Sort,
Reduce) and Execution Environment.
• HDFS: Virtual distributed file system overlay on top of local
file systems.
Hadoop Overview #1
21. SCAlable Preservation Environments
SCAPE
• Designed for write one read many times access model.
• Data IO is handled via HDFS.
• Data divided into blocks (typically 64MB) and distributed and
replicated over data nodes.
• Parallelization logic is strictly separated from user
program.
• Automated data decomposition and communication between
processing steps.
• Applications benefit from built-in support for data-locality and
fail-safety .
• Applications scale-out on big clusters processing very large data
volumes.
Hadoop Overview #2
22. SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
22
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
23. SCAlable Preservation Environments
SCAPE
MapReduce/Hadoop in a nutshell
23
Map Reduce
Map takes <k1, v1>
and transforms it to
<k2, v2> pairs
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
23
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
24. SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
24
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
Shuffle/Sort takes
<k2, v2> and
transforms
it to <k2, list(v2)>
25. SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
25
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
Reduce takes
<k2, list(v2)> and
transforms
it to <k3, v3)>
27. SCAlable Preservation Environments
SCAPE
Platform Deployment
• There is no prescribed deployment model
• Private, institutionally-shared, external data center
• Possible to deploy on “bare-metal” or using
virtualization and cloud middleware.
• Platform Environment packaged as VM image
• Automated and scalable deployment.
• Presently supporting Eucalyptus (and AWS) clouds.
• SCAPE provides two shared Platform instances
• Stable non-virtualized data-center cluster
• Private-cloud based development cluster
• Partitioning and dynamic reconfiguration
28. SCAlable Preservation Environments
SCAPE
Deploying Environments
• IaaS enabling packaging and dynamic deployment of (complex)
Software Environments
• But requires complex virtualization infrastructure
• Data-intensive technology is able to deal with a constantly
varying number of cluster nodes.
• Node failures are expected and automatically handled
• System can grow/shrink on demand
• Network Attached Storage solution can be used as data source
• But does not scalability and performance needs for computation
• SCAPE Hadoop Clusters
• Linux + Preservation tools + SCAPE Hadoop libraries
• Optionally Higher-level services (repository, workflow, …)
29. SCAlable Preservation Environments
SCAPE
ONB Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB
effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.
25 processing cores for Map tasks and
10 cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
30. SCAlable Preservation Environments
SCAPE
SCAPE Shared Clusters
• AIT (dev. cluster)
• 10 dual core nodes, 4 six-core
nodes, ~85 TB disk storage.
• Xen and Eucalyptus virtualization
and cloud management
• IMF (central instance)
• Low consumption machines in
NoRack column
• dual core AMD 64-bit processor,
8GB RAM, 15TB on 5 disks
• production data center facility
32. SCAlable Preservation Environments
SCAPE
32
• Wrapping Sequential Tools
• Using a wrapper script (Hadoop Streaming API)
• PT’s generic Java wrapper allows one to use pre-defined
patterns (based on toolspec language)
• Works well for processing a moderate number of files
• e.g. applying migration tools or FITS.
• Writing a custom MapReduce application
• Much more powerful and usually performs better.
• Suitable for more complex problems and file formats, such
as Web archives.
• Using a High-level Language like Hive and Pig
• Very useful to perform analysis of (semi-)structured data,
e.g. characterization output.
33. SCAlable Preservation Environments
SCAPE
• Preservation tools and libraries are pre-packaged so they
can be automatically deployed on cluster nodes
• SCAPE Debian Packages
• Supporting SCAPE Tool Specification Language
• MapReduce libs for processing large container files
• For example METS and (W)arc RecordReader
• Application Scripts
• Based on Apache Hive, Pig, Mahout
• Software components to assemble a complex data-parallel
workflows
• Taverna and Oozie Workflows
Available Tools
34. SCAlable Preservation Environments
SCAPE
34
Sequential Workflows
• In order to run a workflow (or activity) on the cluster it will
have to be parallelized first!
• A number of different parallelization strategies exist
• Approach typically determined on a case-by-case basis
• May lead to changes of activities, workflow structure, or
the entire application.
• Automated parallelization will only work to a certain degree
• Trivial workflows can be deployed/executed using without
requiring individual parallelization (wrapper approach).
• SCAPE driver program for parallelizing Taverna workflows.
• SCAPE template workflows for different institutional
scenarios developed.
35. SCAlable Preservation Environments
SCAPE
35
Parallel Workflows
• Are typically derived from sequential (conceptual) workflows
created for desktop environment (but may differ
substantially!).
• Rely on MapReduce as the parallel programming model and
Apache Hadoop as execution environment
• Data decomposition is handled by Hadoop framework based
on input format handlers (e.g text, warc, mets-xml, etc. )
• Can make use of a workflow engine (like Taverna and Oozie)
for orchestrating complex (composite) processes.
• May include interactions with data mgnt. sytems (repositories)
and sequential (concurrently executed) tools.
• Tools invocations are based on API or cmd-line interface and
performed as part of a MapReduce application.
37. SCAlable Preservation Environments
SCAPE
37
Tool Specification Language
• The SCAPE Tool Specification Language (toolspec) provides a
schema to formalize command line tool invocations.
• Can be used to automate a complex tool invocation (many
arguments) based on a keyword (e.g. ps2pdfs)
• Provides a simple and flexible mechanism to define tool
dependencies, for example of a workflow.
• Can be resolved by the execution system using Linux
packages.
• The toolspec is minimalistic and can be easily created for
individual tools and scripts.
• Tools provided as SCAPE Debian packages come with a
toolspec document by default.
39. SCAlable Preservation Environments
SCAPE
39
MapRed Toolwrapper
• Hadoop provides scalability, reliability, and robustness
supporting processing data that does not fit on a single
machine.
• Application must however be made compliant with the
execution environment.
• Our intention was to provide a wrapper allowing one to
execute a command-line tool on the cluster in a similar way
like on a desktop environment.
• User simply specifies toolspec file, command name, and payload
data.
• Supports HDFS references and (optionally) standard IO streams.
• Supports the SCAPE toolspec to execute preinstalled tools or
other applications available via OS command-line interface.
40. SCAlable Preservation Environments
SCAPE
40
Hadoop Streaming API
• Hadoop streaming API supports the execution of scripts (e.g.
bash or python) which are automatically translated and
executed as MapReduce applications.
• Can be used to process data with common UNIX filters using
commands like echo, awk, tr.
• Hadoop is designed to process its input based on key/value
pairs. This means the input data is interpreted and split by the
framework.
• Perfect for processing text but difficult to process binary data.
• The steaming API uses streams to read/write from/to HDFS.
• Preservation tools typically do not support HDFS file pointers
and/or IO streaming through stdin/sdout.
• Hence, DP tools are almost not usable with streaming API
41. SCAlable Preservation Environments
SCAPE
41
Suitable Use-Cases
• Use MapRed Toolwrapper when dealing with (a large number
of) single files.
• Be aware that this may not be an ideal strategy and there
are more efficient ways to deal with many files on Hadoop
(Sequence Files, Hbase, etc. ).
• However, practical and sufficient in many cases, as there is
no additional application development required.
• A typical example is file format migration on a moderate
number of files (e.g. 100.000s), which can be included in a
workflow with additional QA components.
• Very helpful when payload is simply too big to be computed
on a single machine.
42. SCAlable Preservation Environments
SCAPE
42
Example – Exploring an uncompressed WARC
• Unpacked a 1GB WARC.GZ on local computer
• 2.2 GB unpacked => 343.288 files
• `ls` took ~40s,
• count *.html files with `file` took ~4 hrs => 60.000 html files
• Provided corresponding bash command as toolspec:
• <command>if [ "$(file ${input} | awk "{print $2}" )" == HTML ]; then echo
"HTML" ; fi</command>
• Moved data to HDFS and executed pt-mapred with toolspec.
• 236min on local file system
• 160min with 1 mapper on HDFS (this was a surprise!)
• 85min (2), 52min (4), 27min (8)
• 26min with 8 mappers and IO streaming (also a surprise)
43. SCAlable Preservation Environments
SCAPE
43
Ongoing Work
• Source project and README on Github presently under
openplanets/scape/pt-mapred*
• Will be migrated to its own repository soon.
• Presently required to generate an input file that specifies input
file paths (along with optional output file names).
• TODO: Input binary directly based on input directory path
allowing Hadoop to take advantage of data locality.
• Input/output steaming and piping between toolspec
commands has already been implemented.
• TODO: Add support for Hadoop Sequence Files.
• Look into possible integration with Hadoop Streaming API.
* https://github.com/openplanets/scape/tree/master/pt-mapred
45. SCAlable Preservation Environments
SCAPE
45
What we mean by Workflow
• Formalized (and repeatable) processes/experiments consisting
of one or more activities interpreted by a workflow engine.
• Usually modeled as DAGs based on control-flow and/or
data-flow logic.
• Workflow engine functions as a coordinator/scheduler that
triggers the execution of the involved activities
• May be performed by a desktop or server-sided
component.
• Example workflow engines are Taverna workbench, Taverna
server, and Apache Oozie.
• Not equally rich and designed for different purposes:
experimentation & science, SOA, Hadoop integration.
46. SCAlable Preservation Environments
SCAPE
46
Taverna
• A workflow language and graphical editing environment based
on a dataflow model.
• Linking activities (tools, web services) based on data pipes.
• High level workflow diagram abstracting low level
implementation details
• Think of workflow as a kind of a configurable script.
• Easier to explain, share, reuse and repurpose.
• Taverna workbench provides a desktop environment to run
instances of that language.
• Workflows can also be run in headless and server mode.
• It doesn't necessarily run on a grid, cloud, or cluster but can be
used to interact with those resources.
47. SCAlable Preservation Environments
SCAPE
47
• Extract TIFF Metadata with
Matchbox and Jpylyzer
• Perform OpenJpeg
TIFF to JP2 migration
• Extract JP2 Metadata with
Matchbox and Jpylyzer
• Validation based on Jpylyzer
profiles
• Compare SIFT image
features to test visual
similarity
• Generate Report
Image Migration #1
48. SCAlable Preservation Environments
SCAPE
48
• No significant changes in
workflow structure
compared to sequential
workflow.
• Orchestrating remote
activities using Taverna’s
Tool Plugin over SSH.
• Using Platform’s MapRed
toolwrapper to invoke cmd-
line tools on cluster
Image Migration #2
Command: hadoop jar mpt-mapred.jar
-j $jobname -i $infile -r toolspecs
49. SCAlable Preservation Environments
SCAPE
WARC Identification #1
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC RecordReader
based on
HERITRIX
Web crawler
read/write (W)ARC
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
Tool integration pattern Throughput (GB/min)
TIKA detector API call in Map phase 6,17 GB/min
FILE called as command line tool from map/reduce 1,70 GB/min
TIKA JAR command line tool called from map/reduce 0,01 GB/min
Amount of data
Number of ARC
files
Throughput
(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min
54. SCAlable Preservation Environments
SCAPE
54
Quality Assurance #3 – Using Apache Oozie
• Remote Workflow
scheduler for Hadoop
• Accessible via REST interface
• Control-flow oriented
Workflow language
• Well integrated with Hadoop
stack (MapRed, Pig, HDFS).
• Hadoop API called directly,
no more ssh interaction req.
• Deals with classpath
problems and different
library versions.
56. SCAlable Preservation Environments
SCAPE
56
• When dealing with large amounts of data in terms of #files,
#objects, #records, #TB storage traditional data management
techniques begin to fail (file system operations, db , tools, etc.).
• Scalablity and Robustness are key.
• Data-intensive technologies can help a great deal but do not
support desktop tools and workflows used in many domains
out of the box.
• SCAPE has ported a number of preservation scenarios identified
by its user groups from sequential workflows to a scalable
(Hadoop-based) environment.
• The required effort can vary a lot depending on the
infrastructure in place, the nature of the data, scale, complexity,
and required performance.
Conclusions
57. SCAlable Preservation Environments
SCAPE
57
• Project website: www.scape-project.eu
• Github: https://github.com/openplanets/
• SCAPE Group on MyExperiment: http://www.myexperiment.org
• SCAPE tools: http://www.scape-project.eu/tools
• SCAPE on Slideshare: http://www.slideshare.net/SCAPEproject
• SCAPE Appliction Areas at Austrian National Library:
• http://www.slideshare.net/SvenSchlarb/elag2013-schlarb
• Submission and execution of SCAPE workflows:
• http://www.scape-project.eu/deliverable/d5-2-job-
submission-language-and-interface
Resources