Presentation from Digital Curator Dave Thompson on systems and processes for digitisation at the Wellcome Library for our second Digitisation Open Day.
This document provides an agenda for a Big Data summer training session presented by Amrit Chhetri. The agenda includes modules on Big Data analytics with Apache Hadoop, installing Apache Hadoop on Ubuntu, using HBase, advanced Python techniques, and performing ETL with tools like Sqoop and Talend. Amrit introduces himself and his background before delving into the topics to be covered in the training.
This document provides an overview of object-based storage. It defines object-based storage as storing file data in the form of objects based on content and attributes rather than location. The key components are objects, object storage devices (OSDs), and metadata servers. Objects have file-like methods and contain data, metadata, and attributes. The document compares block-based and file-based storage, discusses drivers for object storage like big unstructured data, and outlines the process for storing and retrieving objects from OSDs. Benefits highlighted include security, reliability, platform independence, scalability, and manageability.
This document provides an overview of bio big data and related technologies. It discusses what big data is and why bio big data is necessary given the large size of genomic data sets. It then outlines and describes Hadoop, Spark, machine learning, and streaming in the context of bio big data. For Hadoop, it explains HDFS, MapReduce, and the Hadoop ecosystem. For Spark, it covers RDDs, Spark SQL, MLlib, and Spark Streaming. The document is intended as an introduction to key concepts and tools for working with large biological data sets.
I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.
Big Data: Guidelines and Examples for the Enterprise Decision MakerMongoDB
This document provides an overview of a real-time directed content system that uses MongoDB, Hadoop, and MapReduce. It describes:
- The key participants in the system and their roles in generating, analyzing, and operating on data
- An architecture that uses MongoDB for real-time user profiling and content recommendations, Hadoop for periodic analytics on user profiles and content tags, and MapReduce jobs to update the profiles
- How the system works over time to continuously update user profiles based on their interactions with content, rerun analytics daily to update tags and baselines, and make recommendations based on the updated profiles
- How the system supports both real-time and long-term analytics needs through this integrated approach.
The document discusses key concepts related to big data including what data and big data are, the three structures of big data (volume, velocity, and variety), sources and types of big data, how big data differs from traditional databases, applications of big data across various fields such as healthcare and social media, tools for working with big data like Hadoop and MongoDB, and challenges and solutions related to big data.
The document outlines the vision, mission, and strategy of the STFC (Science and Technology Facilities Council) in implementing e-Science technologies. The goals are to exploit data from STFC facilities through innovative infrastructure, integrate activities nationally and internationally, and improve computation and data management capabilities to enable new scientific discoveries.
This document provides an agenda for a Big Data summer training session presented by Amrit Chhetri. The agenda includes modules on Big Data analytics with Apache Hadoop, installing Apache Hadoop on Ubuntu, using HBase, advanced Python techniques, and performing ETL with tools like Sqoop and Talend. Amrit introduces himself and his background before delving into the topics to be covered in the training.
This document provides an overview of object-based storage. It defines object-based storage as storing file data in the form of objects based on content and attributes rather than location. The key components are objects, object storage devices (OSDs), and metadata servers. Objects have file-like methods and contain data, metadata, and attributes. The document compares block-based and file-based storage, discusses drivers for object storage like big unstructured data, and outlines the process for storing and retrieving objects from OSDs. Benefits highlighted include security, reliability, platform independence, scalability, and manageability.
This document provides an overview of bio big data and related technologies. It discusses what big data is and why bio big data is necessary given the large size of genomic data sets. It then outlines and describes Hadoop, Spark, machine learning, and streaming in the context of bio big data. For Hadoop, it explains HDFS, MapReduce, and the Hadoop ecosystem. For Spark, it covers RDDs, Spark SQL, MLlib, and Spark Streaming. The document is intended as an introduction to key concepts and tools for working with large biological data sets.
I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.
Big Data: Guidelines and Examples for the Enterprise Decision MakerMongoDB
This document provides an overview of a real-time directed content system that uses MongoDB, Hadoop, and MapReduce. It describes:
- The key participants in the system and their roles in generating, analyzing, and operating on data
- An architecture that uses MongoDB for real-time user profiling and content recommendations, Hadoop for periodic analytics on user profiles and content tags, and MapReduce jobs to update the profiles
- How the system works over time to continuously update user profiles based on their interactions with content, rerun analytics daily to update tags and baselines, and make recommendations based on the updated profiles
- How the system supports both real-time and long-term analytics needs through this integrated approach.
The document discusses key concepts related to big data including what data and big data are, the three structures of big data (volume, velocity, and variety), sources and types of big data, how big data differs from traditional databases, applications of big data across various fields such as healthcare and social media, tools for working with big data like Hadoop and MongoDB, and challenges and solutions related to big data.
The document outlines the vision, mission, and strategy of the STFC (Science and Technology Facilities Council) in implementing e-Science technologies. The goals are to exploit data from STFC facilities through innovative infrastructure, integrate activities nationally and internationally, and improve computation and data management capabilities to enable new scientific discoveries.
Cloud computing involves delivering computing services over the internet. It has three main components: client computers, distributed servers located in different geographic locations, and data centers housing servers and applications. There are three main service models: Software as a Service (SaaS) which provides required software; Platform as a Service (PaaS) which provides operating systems and networks; and Infrastructure as a Service (IaaS) which provides basic network access. Deployment models include public, private, hybrid, and community clouds based on access restrictions. Big data refers to very large amounts of digital data that cannot be analyzed with traditional techniques, and requires distributed processing across cloud infrastructure to gain insights.
This document provides an overview of big data concepts including what big data is, how it is used, and common tools involved. It defines big data as a cluster of technologies like Hadoop, HDFS, and HCatalog used for fetching, processing, and visualizing large datasets. MapReduce and Hadoop clusters are described as common processing techniques. Example use cases mentioned include business intelligence. Resources for getting started with tools like Hortonworks, CloudEra, and examples of MapReduce jobs are also provided.
Globus Online provides services to enable easy and reliable data transfer between campus resources and national cyberinfrastructure. It uses Globus Transfer for simple file transfers and Globus Connect to easily integrate campus resources. Globus Connect Multi-User allows administrators to easily deploy GridFTP servers and authentication for multiple users, facilitating campus bridging. Several universities have found success using these Globus services to enable terabyte-scale data sharing across their campuses and with national resources.
The document describes a data ingest system that digitizes content from multiple data providers and stores redundant copies across a global acquisition and storage preservation system. It replicates the files, metadata, and services needed to deliver the content through an access system.
This document provides an introduction to big data and Hadoop. It discusses what big data is, characteristics of big data like volume, velocity and variety. It then introduces Hadoop as a framework for storing and analyzing big data, describing its main components like HDFS and MapReduce. The document outlines a typical big data workflow and gives examples of big data use cases. It also provides an overview of setting up Hadoop on a single node, including installing Java, configuring SSH, downloading and extracting Hadoop files, editing configuration files, formatting the namenode, starting Hadoop daemons and testing the installation.
This document outlines an introductory workshop on big data held by the BigData Community. The workshop agenda includes an introduction to big data and the Hadoop ecosystem, demonstrations of Hadoop installation in standalone and pseudo-distributed modes, and a hands-on Java application example. Attendees are guided through setting up a test environment, downloading and configuring Hadoop, and testing the installation. The goal is to provide 120 students and 5 universities with an awareness of big data science and engineering through hands-on training.
Big data processing using hadoop poster presentationAmrut Patil
This document compares implementing Hadoop infrastructure on Amazon Web Services (AWS) versus commodity hardware. It discusses setting up Hadoop clusters on both AWS Elastic Compute Cloud (EC2) instances and several retired PCs running Ubuntu. The document also provides an overview of the Hadoop architecture, including the roles of the NameNode, DataNode, JobTracker, and TaskTracker in distributed storage and processing within Hadoop.
The document summarizes the Research Data Family services at the University of Oxford. It discusses the history of research data management at Oxford dating back to 2008. It outlines several key services including DataPlan for creating data management plans, DataStage for lightweight data curation, DataBank as the research data repository, DataFinder as the research data catalogue, and training and support services. Future plans include further integrating these services and making them more sustainable and interoperable with other university and publishing systems.
This document discusses open source tools for big data analytics. It introduces Hadoop, HDFS, MapReduce, HBase, and Hive as common tools for working with large and diverse datasets. It provides overviews of what each tool is used for, its architecture and components. Examples are given around processing log and word count data using these tools. The document also discusses using Pentaho Kettle for ETL and business intelligence projects with big data.
The document introduces MongoDB as an open source, high performance database that is a popular NoSQL option. It discusses how MongoDB stores data as JSON-like documents, supports dynamic schemas, and scales horizontally across commodity servers. MongoDB is seen as a good alternative to SQL databases for applications dealing with large volumes of diverse data that need to scale.
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
The document discusses the convergence of SQL, NoSQL and Hadoop technologies. It notes that these were originally separated but are now joining together. Analytics problems now span these different platforms, and many platforms now support multiple data workloads and personas. However, challenges remain around common security, federated querying, and workload management across platforms. The ideal solution would be a logical hub to coordinate these functions consistently across platforms.
MongoDb is a document oriented database and very flexible one as it gives horizontal scalability.
In this presentation basic study about mongodb with installation steps and basic commands are described.
Foundations for the future of science discusses using artificial intelligence and machine learning to advance scientific research. Key points discussed include using AI to analyze large datasets, develop scientific models, and automate experimental workflows. The document also outlines several examples of how the Globus data platform is currently enabling AI-powered scientific applications across multiple domains. Overall, the document advocates that embracing "AI for science" has the potential to accelerate scientific discovery by overcoming limitations in human analysis capabilities and computational resources.
Copyright clearance for genetics books - a pilot project at the Wellcome LibraryWellcome Library
The Wellcome Library is digitizing around 2,000 genetics books from 1850 onward to make them freely available online. Due to the age of the works, 90% are expected to still have active copyrights. The Wellcome is working with ALCS and PLS to determine the copyright status and locate rights holders of each work to request permission. So far 31% of rights holders have been identified for the first 500 works, while 10% could not be identified. Works that are out of copyright, licensed for free access, or whose rights holders cannot be located may be put online. The project aims to test the feasibility and cost-effectiveness of large-scale copyright clearance for digitization.
Systems and Processes: making order out of chaosWellcome Library
Presentation from Digital Curator Dave Thompson on systems and processes for digitisation at the Wellcome Library for our fourth Digitisation Open Day.
Webinar - Order out of Chaos: Avoiding the Migration MigrainePeak Hosting
When your business has outgrown your current managed hosting provider, the logical thing is to search for something better. Change can be difficult and chaotic, but it doesn’t have to be.
This webinar focuses on best practices for making your migration from the cloud as pain free as possible, including a discussion on what you need to know and ask of your migration provider to ensure it goes smoothly. As an example of this, we will outline Peak Hosting’s migration process, as well as discuss one of our customer migrations and why they chose to undertake it.
Presentation by Digitisation Project Manager Matthew Brack on things to think about when doing digitisation projects, for our fourth Digitisation Open Day.
Dave Thompson, a digital curator at the Wellcome Library, gave a presentation on digital curation and preservation. He explained that as a digital curator, he is responsible for maintaining two digital systems and working with archivists and digitization teams. Much of his time is spent in meetings, project management, and process improvement rather than directly managing data. He emphasized that digital data must be preserved for both technical reasons like preventing obsolescence as well as social and economic reasons. Finally, he stressed that successful long-term data management requires imagination, engagement with data creators and users, and recognizing that digital preservation is as much a social challenge as a technical one.
An airport in Port-Au-Prince, Haiti was in chaos after an earthquake with aid planes jostling for space on the single open runway and landing randomly, but a small team of U.S. Air Force special-operations troops set up a folding table and established a system to safely direct incoming cargo planes from 30-40 miles out and guide them onto the runway, managing to safely direct traffic and allow planes to take off and land every 5 minutes to bring in millions of pounds of supplies over subsequent days and weeks.
This document discusses digitization processes and systems. It describes three key IT systems used in digitization: 1) a workflow management system called Goobi that tracks production, 2) a digital object repository called Preservica for storage, and 3) a front end player for access. Goobi automates workflows to make the production process efficient and scalable. Preservica ingests content from Goobi and creates administrative metadata for preservation. The player provides a single access point for delivering content from Preservica. Overall, the document emphasizes that digitization supports strategic goals and having systems and processes in place helps manage the end-to-end production and delivery of digitized content.
Cloud computing involves delivering computing services over the internet. It has three main components: client computers, distributed servers located in different geographic locations, and data centers housing servers and applications. There are three main service models: Software as a Service (SaaS) which provides required software; Platform as a Service (PaaS) which provides operating systems and networks; and Infrastructure as a Service (IaaS) which provides basic network access. Deployment models include public, private, hybrid, and community clouds based on access restrictions. Big data refers to very large amounts of digital data that cannot be analyzed with traditional techniques, and requires distributed processing across cloud infrastructure to gain insights.
This document provides an overview of big data concepts including what big data is, how it is used, and common tools involved. It defines big data as a cluster of technologies like Hadoop, HDFS, and HCatalog used for fetching, processing, and visualizing large datasets. MapReduce and Hadoop clusters are described as common processing techniques. Example use cases mentioned include business intelligence. Resources for getting started with tools like Hortonworks, CloudEra, and examples of MapReduce jobs are also provided.
Globus Online provides services to enable easy and reliable data transfer between campus resources and national cyberinfrastructure. It uses Globus Transfer for simple file transfers and Globus Connect to easily integrate campus resources. Globus Connect Multi-User allows administrators to easily deploy GridFTP servers and authentication for multiple users, facilitating campus bridging. Several universities have found success using these Globus services to enable terabyte-scale data sharing across their campuses and with national resources.
The document describes a data ingest system that digitizes content from multiple data providers and stores redundant copies across a global acquisition and storage preservation system. It replicates the files, metadata, and services needed to deliver the content through an access system.
This document provides an introduction to big data and Hadoop. It discusses what big data is, characteristics of big data like volume, velocity and variety. It then introduces Hadoop as a framework for storing and analyzing big data, describing its main components like HDFS and MapReduce. The document outlines a typical big data workflow and gives examples of big data use cases. It also provides an overview of setting up Hadoop on a single node, including installing Java, configuring SSH, downloading and extracting Hadoop files, editing configuration files, formatting the namenode, starting Hadoop daemons and testing the installation.
This document outlines an introductory workshop on big data held by the BigData Community. The workshop agenda includes an introduction to big data and the Hadoop ecosystem, demonstrations of Hadoop installation in standalone and pseudo-distributed modes, and a hands-on Java application example. Attendees are guided through setting up a test environment, downloading and configuring Hadoop, and testing the installation. The goal is to provide 120 students and 5 universities with an awareness of big data science and engineering through hands-on training.
Big data processing using hadoop poster presentationAmrut Patil
This document compares implementing Hadoop infrastructure on Amazon Web Services (AWS) versus commodity hardware. It discusses setting up Hadoop clusters on both AWS Elastic Compute Cloud (EC2) instances and several retired PCs running Ubuntu. The document also provides an overview of the Hadoop architecture, including the roles of the NameNode, DataNode, JobTracker, and TaskTracker in distributed storage and processing within Hadoop.
The document summarizes the Research Data Family services at the University of Oxford. It discusses the history of research data management at Oxford dating back to 2008. It outlines several key services including DataPlan for creating data management plans, DataStage for lightweight data curation, DataBank as the research data repository, DataFinder as the research data catalogue, and training and support services. Future plans include further integrating these services and making them more sustainable and interoperable with other university and publishing systems.
This document discusses open source tools for big data analytics. It introduces Hadoop, HDFS, MapReduce, HBase, and Hive as common tools for working with large and diverse datasets. It provides overviews of what each tool is used for, its architecture and components. Examples are given around processing log and word count data using these tools. The document also discusses using Pentaho Kettle for ETL and business intelligence projects with big data.
The document introduces MongoDB as an open source, high performance database that is a popular NoSQL option. It discusses how MongoDB stores data as JSON-like documents, supports dynamic schemas, and scales horizontally across commodity servers. MongoDB is seen as a good alternative to SQL databases for applications dealing with large volumes of diverse data that need to scale.
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
The document discusses the convergence of SQL, NoSQL and Hadoop technologies. It notes that these were originally separated but are now joining together. Analytics problems now span these different platforms, and many platforms now support multiple data workloads and personas. However, challenges remain around common security, federated querying, and workload management across platforms. The ideal solution would be a logical hub to coordinate these functions consistently across platforms.
MongoDb is a document oriented database and very flexible one as it gives horizontal scalability.
In this presentation basic study about mongodb with installation steps and basic commands are described.
Foundations for the future of science discusses using artificial intelligence and machine learning to advance scientific research. Key points discussed include using AI to analyze large datasets, develop scientific models, and automate experimental workflows. The document also outlines several examples of how the Globus data platform is currently enabling AI-powered scientific applications across multiple domains. Overall, the document advocates that embracing "AI for science" has the potential to accelerate scientific discovery by overcoming limitations in human analysis capabilities and computational resources.
Copyright clearance for genetics books - a pilot project at the Wellcome LibraryWellcome Library
The Wellcome Library is digitizing around 2,000 genetics books from 1850 onward to make them freely available online. Due to the age of the works, 90% are expected to still have active copyrights. The Wellcome is working with ALCS and PLS to determine the copyright status and locate rights holders of each work to request permission. So far 31% of rights holders have been identified for the first 500 works, while 10% could not be identified. Works that are out of copyright, licensed for free access, or whose rights holders cannot be located may be put online. The project aims to test the feasibility and cost-effectiveness of large-scale copyright clearance for digitization.
Systems and Processes: making order out of chaosWellcome Library
Presentation from Digital Curator Dave Thompson on systems and processes for digitisation at the Wellcome Library for our fourth Digitisation Open Day.
Webinar - Order out of Chaos: Avoiding the Migration MigrainePeak Hosting
When your business has outgrown your current managed hosting provider, the logical thing is to search for something better. Change can be difficult and chaotic, but it doesn’t have to be.
This webinar focuses on best practices for making your migration from the cloud as pain free as possible, including a discussion on what you need to know and ask of your migration provider to ensure it goes smoothly. As an example of this, we will outline Peak Hosting’s migration process, as well as discuss one of our customer migrations and why they chose to undertake it.
Presentation by Digitisation Project Manager Matthew Brack on things to think about when doing digitisation projects, for our fourth Digitisation Open Day.
Dave Thompson, a digital curator at the Wellcome Library, gave a presentation on digital curation and preservation. He explained that as a digital curator, he is responsible for maintaining two digital systems and working with archivists and digitization teams. Much of his time is spent in meetings, project management, and process improvement rather than directly managing data. He emphasized that digital data must be preserved for both technical reasons like preventing obsolescence as well as social and economic reasons. Finally, he stressed that successful long-term data management requires imagination, engagement with data creators and users, and recognizing that digital preservation is as much a social challenge as a technical one.
An airport in Port-Au-Prince, Haiti was in chaos after an earthquake with aid planes jostling for space on the single open runway and landing randomly, but a small team of U.S. Air Force special-operations troops set up a folding table and established a system to safely direct incoming cargo planes from 30-40 miles out and guide them onto the runway, managing to safely direct traffic and allow planes to take off and land every 5 minutes to bring in millions of pounds of supplies over subsequent days and weeks.
This document discusses digitization processes and systems. It describes three key IT systems used in digitization: 1) a workflow management system called Goobi that tracks production, 2) a digital object repository called Preservica for storage, and 3) a front end player for access. Goobi automates workflows to make the production process efficient and scalable. Preservica ingests content from Goobi and creates administrative metadata for preservation. The player provides a single access point for delivering content from Preservica. Overall, the document emphasizes that digitization supports strategic goals and having systems and processes in place helps manage the end-to-end production and delivery of digitized content.
Analytics with unified file and object Sandeep Patil
Presentation takes you through on way to achive in-place hadoop based analytics for your file and object data. Also give you example of storage integration with cloud congnitive services
Data Management - Full Stack Deep LearningSergey Karayev
This document discusses data management for deep learning projects. It covers five main topics: sources of data, labeling data, data storage, data versioning, and data processing. For data sources, it describes obtaining publicly available datasets, collecting and labeling proprietary data, and techniques for data augmentation. For labeling data, it discusses interfaces for annotators, sources of labor like outsourcing, and labeling software. For storage, it outlines options for files, objects, databases, and data lakes. It describes different levels of data versioning from unversioned to specialized solutions. And it proposes using workflows and schedulers like Airflow to automate multi-step data processing tasks.
From Business Intelligence to Big Data - hack/reduce Dec 2014Adam Ferrari
Talk given on Dec. 3, 2014 at MIT, sponsored by Hack/Reduce. This talk looks at the history of Business Intelligence from first generation OLAP tools through modern Data Discovery and visualization tools. And looking forward, what can we learn from that evolution as numerous new tools and architectures for analytics emerge in the Big Data era.
This document provides an overview of big data and Apache Hadoop. It defines big data as large and complex datasets that are difficult to process using traditional database management tools. It discusses the sources and growth of big data, as well as the challenges of capturing, storing, searching, sharing, transferring, analyzing and visualizing big data. It describes the characteristics and categories of structured, unstructured and semi-structured big data. The document also provides examples of big data sources and uses Hadoop as a solution to the challenges of distributed systems. It gives a high-level overview of Hadoop's core components and characteristics that make it suitable for scalable, reliable and flexible distributed processing of big data.
This document discusses the components and technologies of digital libraries. It describes the key components as selection and acquisition, organization through metadata assignment, indexing and storage in a repository, and search and retrieval via a digital library website. It then associates various technologies with these components, such as metadata standards, document formats, repository systems like DSpace and Fedora, and semantic technologies.
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
A presentation on research data management tools, workflows and best practices at Imperial College London with a focus on software management. Presented at the 2017 session of the HPC Summer School (Dept. of Computing).
This document discusses web data extraction and analysis using Hadoop. It begins by explaining that web data extraction involves collecting data from websites using tools like web scrapers or crawlers. Next, it describes that the data extracted is often large in volume and requires processing tools like Hadoop for analysis. The document then provides details about using MapReduce on Hadoop to analyze web data in a parallel and distributed manner by breaking the analysis into mapping and reducing phases.
Buzz Moschetti presents on using MongoDB and Hadoop together for success with big data projects. He outlines a real-time directed content system that uses MongoDB for operational data and recommendations, Hadoop for batch analytics, and integrates the two with real-time updates. The system dynamically updates user profiles and recommendations based on user clicks and periodic re-analysis of all data in Hadoop. It provides both real-time and long-term analytics capabilities through this integrated architecture.
This document provides an overview of Archivematica and Access to Memory (AtoM) and how they can be used together for digital preservation and access. Archivematica is an open source digital preservation system that uses standards to create preservation packages (Archival Information Packages or AIPs) while AtoM is a content management system that can be used to describe and provide access to content. The document discusses how content could be described and managed in AtoM, preserved using Archivematica, and then have access copies and metadata handed back to AtoM for access. Integration with other systems like DSpace is also mentioned. Key features of Archivematica like standards compliance, flexibility and handling different types of digital content are
The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
Watch full webinar here:
Data lakes have been both praised and loathed. They can be incredibly useful to an organization, but it can also be the source of major headaches. Its ease to scale storage with minimal cost has opened the door to many new solutions, but also to a proliferation of runaway objects that have coined the term data swamp.
However, the addition of an MPP engine, based on Presto, to Denodo’s logical layer can change the way you think about the role of the data lake in your overall data strategy.
Watch on-demand this session to learn:
- The new MPP capabilities that Denodo includes
- How to use them to your advantage to improve security and governance of your lake
- New scenarios and solutions where your data fabric strategy can evolve
Slides for a talk at NDF 2017 by Stuart Yeates and Max Sullivan. See https://web.archive.org/web/20180213055412/http://www.ndf.org.nz/2017-workshops/#mets METS is Metadata for Encoding and Transmission Standard, see https://www.loc.gov/standards/mets/
Materials (sample METS files) are at https://figshare.com/articles/METS_metadata_for_complete_beginners_workshop_samples_/5606917
The document discusses key concepts related to databases including data, information, database management systems (DBMS), database design, and entity relationship modeling. It defines data as raw unorganized facts and information as organized, meaningful data. A database is a collection of organized data that can be easily accessed, managed and updated. Effective database design involves conceptual, logical and physical data modeling to structure data and relationships. The entity relationship model uses entities, attributes, and relationships to graphically represent data structures and relationships.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document provides definitions and summaries of different types of databases:
- It defines databases, data, and database management systems. It also defines flat file databases, relational databases, document-oriented databases, embedded databases, hypertext databases, operational databases, and distributed databases.
- Relational databases organize data into formally described tables that can be accessed in many ways without reorganizing tables. Document databases are designed for storing and managing document-oriented information. Embedded databases are integrated with application software. Operational databases contain reference and event data for transaction systems. Distributed databases consist of data files located across network sites.
Government GraphSummit: And Then There Were 15 StandardsNeo4j
Todd Pihl PhD., Technical Project Mgr. & Mark Jensen, Director of Data Managements and Interoperability, National Institute of Health, Frederick National Labs for Cancer Research
Data repositories such as NCI’s Cancer Research Data Commons receive data that use a variety of data models and vocabularies. This presents a significant obstacle to finding and using the data outside of their original purpose. In this talk we’ll show how using Neo4j allows different data models to be represented and mapped to each other, giving data managers a new way to provide harmonized data to their users.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
The document provides an introduction to PREMIS (Preservation Metadata: Implementation Strategies) and its application in audiovisual archives. It discusses the challenges of digital preservation and the need for preservation metadata to ensure long-term access. It then summarizes the key aspects of PREMIS, including the PREMIS Data Dictionary, its relationship to the OAIS reference model, the five interacting entities in the PREMIS data model, and issues around implementing PREMIS in archives.
Similar to Systems, processes & how we stop the wheels falling off (20)
The Wellcome Library, in considering a project to digitise and transcribe recipe manuscripts using crowdsourcing technologies, commissioned this report from Ben Brumfield and Mia Ridge in Summer 2015. The report addresses issues specific to this project, and to the Wellcome Library's digital infrastructure.
ProQuest Early European Books: Partner PerspectiveWellcome Library
Matthew Brack's presentation from the Jisc and ProQuest symposium "Improving research outcomes with Early European Books", Senate House, London, 13 October 2014.
Creating an online resource for medical archives at the Wellcome LibraryWellcome Library
The Wellcome Library is digitizing its medical archives to create an online resource. It aims to digitize 1.1 million pages from its own collections and 500,000 pages from external partners. Digitizing involves flattening, removing staples from, and photographing archives page-by-page. Sensitive personal information will be restricted from public access online and require registration. Non-sensitive open access material over 100 years old can be freely accessed, while newer material requires registration.
The document discusses various digitization projects at the Wellcome Library. It describes projects to digitize early European books, genetics books, and London Ministry of Health reports. For each project, it provides details on the scope, number of items being digitized, and access and use of the digitized materials. It also outlines the digitization workflow, including cataloging, retrieval, conservation, capture, quality review, and systems used. Challenges discussed include creating engaging digital collections and fully exploiting the possibilities of new technologies.
This document discusses the image capture department and four digitization case studies. It describes the department's work streams including strategic digitization with 3 FTEs, collections photography with 1 FTE and 1 contractor, and corporate photography with 2 FTEs. It then summarizes four digitization projects: the digitization of 870 glass plate negatives from 1869-1871 with an outcome of 500Mb files captured over 6 months with 1 FTE; the digitization of 3000 AIDS posters from 1980-2000 with an outcome of 80Mb files captured over 6 months with 1 new FTE; the digitization of 500 fragile Arabic manuscripts from the 12th-20th centuries with an outcome of 70-80Mb files captured over 18 months
This document discusses the relationship between conservation and digitization. It notes that digitization projects involve both those with expertise in digital collections and physical collections. Good projects require understanding how physical and digital collections interact. Conservation is one part of the digitization workflow. The challenges of conservation for digitization projects include larger volumes of materials, less time to spend on each item, and more stakeholders who may not understand conservation. It also provides examples of how conservation prepares different types of physical materials for digitization.
Copyright Clearance for Genetics Books, A pilot project at the Wellcome LibraryWellcome Library
The Wellcome Library is digitizing around 2,000 genetics books from over 50 countries published after 1850. Due to the age of the works, up to 90% are expected to still have active copyrights. The library is working with rights clearance organizations to determine the copyright status and locate copyright holders to request permission to publish the works openly online. Initial results found rights holders identified for 36% of the first 500 works, while 31% had some rights holders identified. The library will make works available that are out of copyright, licensed for open access, or whose rights holders cannot be identified or contacted.
Managing Large Scale Digitisation at the Wellcome LibraryWellcome Library
The document summarizes the Wellcome Library's efforts to digitize its collections at a large scale. It discusses the library's collections, its pilot digitization project focused on genetics from 2010-2013, and its new strategic approach to digitization, which includes streamlining processes, outsourcing more digitization, establishing governance groups, and implementing a new digital asset management system to help scale up operations. The library aims to provide global online access to its unique medical history collections through large-scale digitization while preserving the materials.
The document summarizes the Wellcome Library's efforts to upscale digitization using the Goobi workflow system. It discusses:
1) The Wellcome Library's collections and goals of providing global online access to explore medicine's cultural contexts.
2) The library's digitization process has transformed from small, ad-hoc projects to a large, strategic program using automated processes and central tracking.
3) The Goobi workflow system streamlines digitization tasks and project management across the library and its partners to ingest, preserve, and provide access to digitized collections.
The document summarizes the Wellcome Trust's open access policy and efforts to increase compliance. It discusses two routes for complying - publishing in an open access journal or self-archiving. It also addresses meeting publication costs, improving workflows to simplify open access options, and clarifying policies to better support open access goals. The ultimate aim is to improve access to research in order to advance scientific understanding.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Systems, processes & how we stop the wheels falling off
1. Systems, processes & how we
stop the wheels falling off
Digitisation Open Day, September 2013
Dave Thompson
Digital Curator, Wellcome Library
2. Digitisation – process overview
Plan project
Catalogue
Identify material
Identify resources
Plan process
Review as you go
Digitise/proces
s
Deliver
Refine processes
Document/share
Document/share
Document/share
Funding, staff, equipment, IT,
storage, data management
planning
Open source player
3. Meanwhile, at the coal face…
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
4. Thinking conceptually … OAIS
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
In OAIS speak this is a SIP. An aggregation of object & its
metadata in a form that is acceptable to the repository, e.g.
JPEG2000 images and MARC XML.
The Open Archive Information System Reference model (OAIS) is an ISO
that describes a conceptual model of an archive. It sets out the activities of an
archive & the processes involved in submission, storage & access. Developed
by NASA after they ‘lost’ space data through obsolescence.
5. Thinking conceptually… OAIS
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
In OAIS speak this is a AIP. This is the object & its metadata
stored in a repository.
OAIS talks of 3 information packages.
1.Submission Information package = what is ingested
2.Archive Information Package = what is stored
3.Dissemination Information package = what is made available
6. Thinking conceptually …OAIS
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
In OAIS speak this is a DIP. This is the parts of the object & its
metadata that we are able to make available.
As defined in the (#DPC) handbook, access is assumed to mean continued,
ongoing usability of a digital resource, retaining all qualities of authenticity,
accuracy and functionality deemed to be essential for the purposes the digital
material was created and/or acquired for.
7. Lets tackle the basics…processing
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
Administrative metadata, (AMD) technical description of the files.
Automatically created by Safety Deposit Box (SDB) on ingest
into our repository. Used by the player for display purposes.
Administrative MetaData is typically created automatically, it could be:
•File size
•Image HxW
•File format
•Checksum
8. Lets tackle the basics…processing
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
DMD. MARC, converted to MARC XML. This becomes MODS in
the METS. Material must be catalogued before we can store it &
make it available.
Descriptive MetaData (DMD), typically human generated, AKA cataloguing
metadata. ISAD(g) for archival material, MARC for bibliographic material.
Metadata Object Description Schema (MODS)
9. Lets tackle the basics…processing
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
Safety Deposit Box (SDB), the place where we store digital stuff.
Ingest is automatically initiated by Goobi. Database that
associates objects with DMD & AMD. Source for dissemination.
Digital Repositories offer a convenient infrastructure through which to store,
manage, re-use and curate digital materials. They are used by a variety of
communities, may carry out many different functions, and can take many
forms.
10. Lets tackle the basics…processing
Administrative
metadata
Descriptive metadata
Digitised images
Ingestion into
repository
Creation of METS Access
+
=
+
+ +
METS is metadata about structure & pagination created by
humans, METS file built automatically.
A Metadata Encoding & Transmission Standard (METS) file is an aggregated
collection of DMD & AMD (a file list with structure) that provides a mechanism
for managed access. A METS file allows metadata from different system to
be combined into a portable format.
11. The formats
• JPEG2000 is our master image format.
• We create dissemination images (JPEG) on the
fly.
• Also use PDF, MPEG2, MP3
12. The systems
• Goobi. Manages & tracks the production of
digitised content.
• SDB. Repository that stores digitised content
along with its DMD & AMD.
• Player. User interface to view digitised material.
13. How Goobi works – the basics
• Project based.
• Workflow driven.
• Users accept ‘tasks’.
• A users role determines what projects they belong
to & what roles they have.
15. How Goobi works – METS editing
Pagination as per original
Descriptive metadata
Structure
16. Lessons from Goobi
• Design your workflows in advance. But be flexible.
• Automate as much as possible, saves time &
more efficient.
• Document processes & procedures.
• Share what you learn.
17. How SDB works – the basics
• Workflow based easily ‘talks’ to other systems.
• Content agnostic.
• Creates administrative metadata on ingest.
• Preservation orientated.
19. How SDB works – behind the scenes
• No public access to SDB.
• Little direct staff access to SDB content.
• High levels of automation of ingest, Goobi.
• Platform for dissemination mediated by the player.
20. Lessons from SDB
• Plan your systems integration, which system talks
to which, and how.
• Plan workflows & processes.
• Data management plan. Your eggs in one basket.
• Plan what you’ll do when it all turns to custard.
22. How the player works
• Makes HTTP request to SDB for content.
• Draws access conditions from METS file.
• Permitted actions drawn from METS.
• Draws DMD from live catalogue.
23.
24. Summary
• Digitisation is an end to end process that brings
together objects & metadata.
• Have to think about the whole system to deliver
results. Process is one of combining metadata
from different systems.
• Document plans & document process.
• Be prepared to be flexible & to change as
necessary. But try to stick to the plan!
25. Further reading
• Wellcome Library – http://wellcomelibrary.org
• Metadata Encoding & Transmission Standard at the Library of Congress -
http://www.loc.gov/standards/mets/
• Reference Model for an Open Archival Information System (OAIS).
Magenta Book. Issue 2. June 2012 -
http://public.ccsds.org/publications/RefModel.aspx
• Tessella, Safety Deposit Box - http://www.tessella.com/tag/safety-deposit-
box/
• Data management planning - http://www.dcc.ac.uk/resources/data-
management-plans
• Repository Software Comparison: Building Digital Library Infrastructure at
LSE - http://www.ariadne.ac.uk/issue64/fay
26. Thank you
Questions now, questions later…?
Dave Thompson, Digital Curator
Wellcome Library
d.thompson@wellcome.ac.uk - #welldigi
http://wellcomelibrary.org/