Integrating ArchivesSpace and Archivematica at the Bentley Historical Library

•Download as PPTX, PDF•

0 likes•37 views

Max Eckard, Lead Archivist for Digital Initiatives at the Bentley Historical Library, will cover the Bentley's integration of ArchivesSpace and Archivematica to streamline digital archiving workflows. He will highlight the decision-making process behind integrating both systems, things he wishes he’d known then that he knows now, goals for the future, and other tips and tricks. In his role at the Bentley Historical Library, Max oversees the digitization program, digital curation activities, web archives, and associated infrastructure.

Technology

Integrating ArchivesSpace
and Archivematica at the
Bentley Historical Library
Max Eckard
Lead Archivist for Digital Initiatives
Integrations with ArchivesSpace | February 19, 2020

Facilitate the creation/reuse of
descriptive and administrative metadata
across preservation and management
systems.

Why Integrate ArchivesSpace and Archivematica?
● Shortcomings with previous workflow for digital processing
○ FileMaker Pro: Metadata was not in a widely used, open system
○ Microsoft Word → EAD: Workflow for EAD generation too localized and complicated
○ AutoPro: Not intended to be a long-term solution
○ Lack of system(s) of record: Lots (and lots… and lots…) of duplicate metadata entry
○ Scale: Scale was an issue!
● ArchivesSpace
○ System of record for descriptive and administrative metadata
○ Gradual migration of metadata from disparate systems into ArchivesSpace
● Archivematica
○ System of record for digital processing/Archival Information Package (AIP) creation
○ Gradual migration of digital backlog into Archivematica
Both systems (DSpace, too!) play nicely with others!

● Worked with Artefactual
● Developed a new “Appraisal and Arrangement” tab in Archivematica
● Integrated Archivematica-ArchivesSpace that permits archivists to:
○ After initial transfer…
■ Display resource records in a treeview
■ Create and edit descriptive metadata for new or existing archival objects
■ Drag and drop digital content onto archival description, creating digital objects
associated with those archival objects
○ After AIP creation and deposit…
■ Write newly-minted DSpace handles back to ArchivesSpace
Sponsoring New Features in Archivematica

How well has this systems integration aged?
● Pretty well! But...
● We use the particular workflow outlined here less and less
○ Works well for relatively small, heterogeneous transfers
○ Doesn’t work as well for:
■ Relatively large or homogenous transfers
■ Transfers destined for digital repositories besides DSpace
● That said, we use integration more and more (in more loosely- and tightly-
coupled gradations)

Lessons Learned
● Systems integration is where it’s at! Side note: The integration of people is as
challenging as the integration of systems.
● More important than any particular workflow is…
○ Having systems that are designed to play nicely with others
○ The technical upskilling we did as a team while implementing, migrating to, and integrating
ArchivesSpace and Archivematica
■ bentley-historical-library/aip-repackaging: Scripts to support repackaging and depositing
Archivematica AIPs to DSpace
■ bentley-historical-library/dappr: A client to communicate with a remote DSpace
installation using its backend API
■ Lots of other ArchivesSpace repositories...
We are now much more flexible in how we approach
digital processing!

References
● [1]: Dallas Pillen, “Integrating Archive-It and ArchivesSpace at the Bentley
Historical Library,” Integrations with ArchivesSpace, December 4, 2019
Resources
● Archival Integration (blog)
● ArchivesSpace-Archivematica-DSpace Workflow Integration
○ Part 1: Configuring ArchivesSpace and DSpace Integration within Archivematica
○ Part 2: Appraisal, Arrangement to ArchivesSpace and Deposit to DSpace
● BHL Archival Curation: Digital Processing

This document summarizes Netflix's big data capabilities and how they use Tableau to analyze and visualize their data. Some key points: 1. Netflix collects up to 100 billion data events per day across multiple tables exceeding 10 billion rows daily, totaling over 2 petabytes of compressed data stored in Amazon S3 buckets. 2. Their Hadoop cluster contains 2,000 EC2 nodes with 22.5 terabytes of RAM used to process this massive amount of data. 3. Tableau is used across many Netflix teams like Data Science, Platform, and IT to visually explore, analyze, and present their big data in a more user-friendly way than Excel. 4. Tableau enables teams

Sasaki practical-linked-data

Felix Sasaki

This XML Prague 2015 Pre-conference presentations shows practical usage of linked data sources. These sources can help to: enrich content with entities, add link to external data sources, use the enriched content in question answering, machine translation or other scenarios. The aim is to show the practical application of linked data sources in XML tooling. The presentation is an update and provides outcomes of the related session held at XML Prague 2014.

Exploring MongoDB & Elasticsearch: Better Together

ObjectRocket

It committee cnd12

Salih Odabasi

Infinum Android Talks #04 - CouchBase Lite

Denis_infinum

Couchbase Lite is a NoSQL mobile database that uses a document data model with key-value pairs and handles data one document at a time. It supports push and pull replication for syncing documents between devices and servers, including both continuous and one-shot replication with options for persistent or non-persistent settings. The document provides details on Couchbase Lite's data structures, basic operations, replication features, and includes links to related resources and a demo app that uses Cloudant as the backend data layer.

Grafana is an open source analytics and monitoring solution that allows users to visualize data and metrics from various sources. It provides a flexible dashboard interface that supports creating and sharing visualizations, alerting, and templating. Grafana has evolved over several major versions to support more data sources, improved UX, alerting capabilities, and a plugin system. It aims to continue expanding supported data sources and features like reporting, live data streaming, and clustering.

Hong Kong Drupal User Group - 2014 April 12th

Wong Hoi Sing Edison

This document summarizes a presentation about using the Migrate API in Drupal for data migration. It introduces Drupal and the Migrate API, describes how to perform Drupal-to-Drupal migration with the Migrate API and Drupal-to-Drupal Migration module, and how migration logic works in Drupal 8 to improve the upgrade process. Resources for learning more about Drupal and the Migrate API are also provided.

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...

Flink Forward

Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.

CData - Triangle Woodard Group - QuickBooks

Jerod Johnson

This document discusses how CData Software provides integration components that allow users to access and analyze QuickBooks data from various applications and tools. It describes CData's Excel add-in and ODBC driver, which allow QuickBooks data to be accessed and visualized in Excel and Tableau, respectively. The document promotes CData's software as providing reliable, standards-based connectivity to QuickBooks data from any application.

Arkstore web ready2013

coldsnipe

The document summarizes the history and purpose of Arkstore, a semantic data storage project. It began in 2011 as a university project in Russia and was commercialized in 2012-2013 as Coldsnipe company. In 2013 it became the ARKSTORE project focused on semantic web storage. Arkstore provides persistent storage of knowledge through mechanisms like backups and high availability. It uses various storage systems and has layers including an API, web interface, Ark semantic web engine, and Ark DataStore which aggregates storage systems. It supports various ontologies and public datasets to store and retrieve semantic data.

TiDB at PayPay

PingCAP

PayPay migrated their payment database from Amazon Aurora to TiDB in 3 months. They chose TiDB for its horizontal scalability, high availability, and ability to remove the need for application-level sharding. They performed an accuracy verification by comparing data between the old and new databases, as well as across microservices. Performance and availability testing was also conducted during the migration to validate the migration was successful. After 3 months of the new TiDB database in production, PayPay saw the expected performance improvements and zero incidents, finding TiDB to be a reliable replacement.

Meeting sept16

JibbigoInternal

The document discusses data collection methods for improving machine translation systems. It describes uploading usage data from users to servers for manual transcription and translation to integrate new data. It also discusses collecting new training data through recording speech and bilingual texts in new language pairs and domains. Two approaches are mentioned: translating only important sentences or sorting sentences by importance and using non-professionals to reduce costs. Other projects discussed include pre-installing Jibbigo on iPod touches and customizing hardware for different translation applications.

DSpace at ILRI: A semi-technical overview of “CGSpace”

ILRI

City of Atlanta Oracle Application Footprint

Danny Bryant

The City of Atlanta has grown its Oracle footprint significantly over time. It currently uses Oracle E-Business Suite, Hyperion, OBIEE, Application Express, Siebel for customer service requests, and Taleo for recruiting and performance management. There are plans to migrate the E-Business Suite from 11i to 12.2.x. Concerns around the migration include potential issues and benefits include new features.

Grafana

NoelMc Grath

This document provides an overview of Grafana, an open source metrics dashboard and graph editor for Graphite, InfluxDB and OpenTSDB. It discusses Grafana's features such as rich graphing, time series querying, templated queries, annotations, dashboard search and export/import. The document also covers Grafana's history and alternatives. It positions Grafana as providing richer features than Graphite Web and highlights features like multiple y-axes, unit formats, mixing graph types, thresholds and tooltips.

Red hat infrastructure for analytics

Kyle Bader

This document discusses Red Hat's Open Data Hub platform for multi-tenant data analytics and machine learning. It describes the challenges of sharing data and compute resources across teams and the Open Data Hub architecture which allows teams to spin up and down their own compute clusters while sharing a common data store. Key elements of the Open Data Hub include Spark, Ceph storage, JupyterHub notebooks, and TensorFlow/Keras for modeling. The document provides an overview of data structures, analytics workflows, and the components and roadmap for the Open Data Hub platform.

SortaSQL

Cloudflare

SortaSQL is a proposal to add seamless horizontal scalability to SQL databases by using the filesystem to store and retrieve data. The SQL database would store metadata and handle queries, while an embedded key-value store manages record storage on files in the local or distributed filesystem. This allows queries to scale across many servers by letting the filesystem handle replication, performance and locking of distributed data files. The architecture involves an application communicating with PostgreSQL over SQL, which uses a SortaSQL plugin to retrieve rows from Kyoto Cabinet key-value files on the POSIX filesystem. Case studies at CloudFlare show how a 400GB per day dataset can be efficiently stored and queried at scale using this approach.

Integrating PostGIS in Web Applications

Command Prompt., Inc

Internet-enabled GIS Using Free and Open Source Tools

John Reiser

Internet-enabled GIS can be developed using free and open source tools like MapServer, GeoServer, TileCache, and OpenLayers. Open source GIS software allows data and applications to be freely shared, adapted, and improved by a community. Pre-rendering map tiles improves rendering speed compared to generating maps from source data for each request. The open source GIS community collaborates to build and enhance software and data.

Review of KohaCon18

PTFS Europe Limited

KohaCon 2018 was held in Portland, Oregon from May 21-25 with over 230 registered users from around the world. The conference included a cultural day and 3-day hackfest after 3 days of presentations on topics like EDI standards in the US, the SubjectsPlus discovery tool, linked data, data-driven decision making, and the Koha ILL module. Upcoming EDS and citation plugins were demonstrated. Talks also covered the Koha manual, Coral ERM integration, Elasticsearch indexing, and customizations at BULAC library. KohaCon 2019 will be held in Dublin, Ireland from May 20-26, 2019.

DSpace at ILRI : A semi-technical overview of “CGSpace”

CIARD Movement

This document provides a semi-technical overview of CGSpace, a digital repository managed by the International Livestock Research Institute (ILRI) that is used by nine CGIAR centers to store over 50,000 research items and receives around 250,000 hits per month. It discusses the history and use of DSpace at ILRI, how content is organized and described, strategies for search engine optimization and dissemination, and the technical skills required for maintenance and development.

Presto talk @ Global AI conference 2018 Boston

kbajda

Presented at Global AI Conference in Boston 2018: http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.

20181215 introduction to graph databases

Timothy Findlay

This document provides an introduction to graph databases. It defines a graph store as a tool for storing and retrieving highly related data where many things are connected to many other things. It notes that graph databases are optimized for this type of data and discusses some popular graph database implementations. It then explores why graph databases may be useful and some limitations. The document provides examples of graph data modeling and querying capabilities. It also outlines some advanced graph database features and how to interact with a graph database using different programming languages.

PGDay.Amsterdam 2018 - Jeroen de Graaff - Step-by-step implementation of Post...

PGDay.Amsterdam

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

ArchivesSpace-Archivematica-DSpace Workflow Integration

Max Eckard

Changing Platforms

Richard Davis

What's hot

Grafana selectel

alexanderzobnin

Hong Kong Drupal User Group - 2014 April 12th

Wong Hoi Sing Edison

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...

Flink Forward

CData - Triangle Woodard Group - QuickBooks

Jerod Johnson

Arkstore web ready2013

coldsnipe

TiDB at PayPay

PingCAP

Meeting sept16

JibbigoInternal

DSpace at ILRI: A semi-technical overview of “CGSpace”

ILRI

City of Atlanta Oracle Application Footprint

Danny Bryant

Grafana

NoelMc Grath

Red hat infrastructure for analytics

Kyle Bader

SortaSQL

Cloudflare

Integrating PostGIS in Web Applications

Command Prompt., Inc

Internet-enabled GIS Using Free and Open Source Tools

John Reiser

Review of KohaCon18

PTFS Europe Limited

DSpace at ILRI : A semi-technical overview of “CGSpace”

CIARD Movement

Presto talk @ Global AI conference 2018 Boston

kbajda

20181215 introduction to graph databases

Timothy Findlay

PGDay.Amsterdam 2018 - Jeroen de Graaff - Step-by-step implementation of Post...

PGDay.Amsterdam

What's hot (19)

Grafana selectel

Hong Kong Drupal User Group - 2014 April 12th

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...

CData - Triangle Woodard Group - QuickBooks

Arkstore web ready2013

TiDB at PayPay

Meeting sept16

DSpace at ILRI: A semi-technical overview of “CGSpace”

City of Atlanta Oracle Application Footprint

Grafana

Red hat infrastructure for analytics

SortaSQL

Integrating PostGIS in Web Applications

Internet-enabled GIS Using Free and Open Source Tools

Review of KohaCon18

DSpace at ILRI : A semi-technical overview of “CGSpace”

Presto talk @ Global AI conference 2018 Boston

20181215 introduction to graph databases

PGDay.Amsterdam 2018 - Jeroen de Graaff - Step-by-step implementation of Post...

Similar to Integrating ArchivesSpace and Archivematica at the Bentley Historical Library

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

ArchivesSpace-Archivematica-DSpace Workflow Integration

Max Eckard

Changing Platforms

Richard Davis

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Alluxio, Inc.

Data Day Texas 2017: Scaling Data Science at Stitch Fix

Stefan Krawczyk

At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists! The teams in the organization do a variety of different tasks: - Clothing recommendations for clients. - Clothes reordering recommendations. - Time series analysis & forecasting of inventory, client segments, etc. - Warehouse worker path routing. - NLP. … and more! They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other? This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well. In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on: Access to Data Access to Compute Resources: Ad-hoc compute (think prototype, iterate, workspace) Production compute (think where things are executed once they’re needed regularly) For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.

Building an open data platform with apache iceberg

Alluxio, Inc.

Building a high-performance, scalable ML & NLP platform with Python, Sheer El...

Pôle Systematic Paris-Region

This document summarizes the development of Lore's machine learning and NLP platform using Python. It started as a monolithic Python server but evolved into a microservices architecture using Docker, Kubernetes, and Celery for parallelization. Key lessons included using DevOps tools like Docker for development and deployment, Celery to parallelize tasks, and wrapping services to improve modularity, flexibility, and performance. The platform now supports multiple products and consulting work in a scalable and maintainable way.

#lspe Building a Monitoring Framework using DTrace and MongoDB

dan-p-kimmel

Don’t give up, You can... Cache!

Stefano Fago

Change data capture

Ron Barabash

This document discusses change data capture (CDC) and its components. CDC is an approach that identifies, captures, and delivers changes made to enterprise data sources. It feeds these changes into a central data stream that can be combined with other data sources in real-time. The document outlines Kafka Connect, Debezium, Schema Registry, and Apache Avro which are key parts of the CDC architecture. It also discusses future steps like supporting additional databases and improving deployment, as well as open issues around performance and compatibility with certain databases.

Using Archivemedia to preserve research data

ARDC

The document discusses a project to investigate using Archivematica, an open-source digital preservation system, to provide digital preservation functionality for research data at the Universities of Hull and York. The project involved three phases: exploring Archivematica and research data needs, developing Archivematica features, and implementing proof-of-concept systems at both universities. Key findings included that Archivematica could meet many preservation needs but had limitations identifying research file formats, and that collaboration was important for addressing challenges in preserving research data long-term.

Webinar slides: DevOps Tutorial: how to automate your database infrastructure

Severalnines

Join our guest speaker Riaan Nolan of mukuru.com, the First Puppet Labs Certified Professional in South Africa, as he walks us through the facets of DevOps integrations and the mission-critical advantages that database automation can bring to your database infrastructure. Infrastructure automation isn’t easy, but it’s not rocket science either. Done right, it is a worthwhile investment, but deciding on which tools to invest in can be a confusing and overwhelming process. Riaan will share some of his secrets on how to proceed with this and he knows what he’s talking about: he saves the companies he works for substantial amounts on their monthly IT bills, typically around 50%. Don’t miss out on this opportunity to understand how you can find efficiencies for your database infrastructure and do watch this webinar to understand the key pain points, which indicate that it’s time to invest in database automation. AGENDA DevOps and databases - what are the challenges Managing databases in a DevOps environment - Requirements from microservice environments - Automated deployments - Performance monitoring - Backups - Schema changes - Version upgrades - Automated failover - Integration with ChatOps and other tools Data distribution - Database hosting in cloud environments - Managing data flows Cloud Automation on AWS SPEAKERS Riaan Nolan was the First Puppet Labs Certified Professional in South Africa. Riaan uses Amazon EC2, VPC and Autoscale with Cloudformation to spin up complete stacks with Autoscaling Fleets. He saves companies substantial amounts on their monthly IT bills, typically around 50% - yes, at one company that meant $500k+ per year. And he’s participated in a number of community tech related forums. He uses next generation technologies such as AWS, Cloudformation, Autoscale, Puppet, GlusterFS, NGINX, Magento and PHP to power huge eCommerce stores. His specialties are Puppet Automation, Cloud Deployments, eCommerce, eMarketing, Specialized Linux Services, Windows, Process making, Budgets, Asset Tracking, Procurement. - Devops Lead, Mukuru - Expert Live Systems Administrator, foodpanda | Hellofood - Senior Systems Administrator / Infrastructure Lead, Rocket Internet GmbH - Senior Technology Manager, Africa Internet Accelerator Art van Scheppingen is a Senior Support Engineer at Severalnines. He’s a pragmatic MySQL and Database expert with over 15 years experience in web development. He previously worked at Spil Games as Head of Database Engineering, where he kept a broad vision upon the whole database environment: from MySQL to Couchbase, Vertica to Hadoop and from Sphinx Search to SOLR. He regularly presents his work and projects at various conferences (Percona Live, FOSDEM) and related meetups.

Data for all: Empowering teams with scalable Shiny applications @ useR 2019

Ruan Pearce-Authers

Shiny, alongside packages like dplyr and ggplot2, offers an unparalleled developer experience for creating self-service analytics dashboards that empower teams to make data-driven decisions. However, out of the box, Shiny is not well-suited to deployment in a multi-user environment. As part of our mission to establish a data culture in a game development studio, we wanted to deploy a suite of Shiny dashboards such that exploring player behaviour became part of every team’s workflow. In this talk, we will discuss the architecture of the supporting cloud infrastructure, including packaging, service orchestration, and authentication. Also, we will show how we’ve adapted Shiny to a multi-user environment using its new support for promises in combination with the future package. Integrating Shiny into this production-grade architecture allows for a streamlined data science workflow that enables data scientists to focus on creating dashboard content with a built-in code review process, and also to deploy changes to production in a button click. We hope to demonstrate how any data-driven organisation can augment their team-wide workflow by leveraging this end-to-end Shiny pipeline.

Behind the Scenes at Coolblue - Feb 2017

Pat Hermens

PyCon HK 2018 - Heterogeneous job processing with Apache Kafka

Hua Chu

This document discusses using Apache Kafka for heterogeneous job processing. It describes how the speaker's company evolved their job processing infrastructure from using a database with cron jobs, to Resque backed by Redis, to a custom system using Kafka. The custom system aims to provide durability and scalability for long-running jobs by decoupling jobs into smaller tasks communicated through Kafka topics. It achieves reliability by ensuring Kafka message replication and allowing tasks to recover from failures.

GSoC2014 - Uniritter Presentation May, 2015

Fabrízio Mello

Py tables

Ali Hallaji

This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.

Large Data Analyze With PyTables

Innfinision Cloud and BigData Solutions

This document discusses PyTables, a Python library for managing hierarchical datasets and efficiently analyzing large amounts of data. It begins by introducing PyTables and its use of HDF5 for portability and extensibility. Key features of PyTables discussed include its object-oriented interface, optimization of memory and disk usage, and fast querying capabilities. The document then covers techniques for maximizing performance like Numexpr for complex expressions, NumPy for powerful data containers, compression algorithms, and caching. Blosc compression is highlighted for its ability to compress faster than memory speed.

PyTables

Ali Hallaji

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

MongoDB

Similar to Integrating ArchivesSpace and Archivematica at the Bentley Historical Library (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets

ArchivesSpace-Archivematica-DSpace Workflow Integration

Changing Platforms

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Data Day Texas 2017: Scaling Data Science at Stitch Fix

Building an open data platform with apache iceberg

Building a high-performance, scalable ML & NLP platform with Python, Sheer El...

#lspe Building a Monitoring Framework using DTrace and MongoDB

Don’t give up, You can... Cache!

Change data capture

Using Archivemedia to preserve research data

Webinar slides: DevOps Tutorial: how to automate your database infrastructure

Data for all: Empowering teams with scalable Shiny applications @ useR 2019

Behind the Scenes at Coolblue - Feb 2017

PyCon HK 2018 - Heterogeneous job processing with Apache Kafka

GSoC2014 - Uniritter Presentation May, 2015

Py tables

Large Data Analyze With PyTables

PyTables

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Recently uploaded

From Natural Language to Structured Solr Queries using LLMs

Sease

This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints. That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source. The objective of the presentation is to propose a technical approach and a way forward to achieve this goal. The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata. This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr. The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

Fwdays

At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience

Northern Engraving | Nameplate Manufacturing Process - 2024

Northern Engraving

Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!

"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba

Fwdays

This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application. In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy. We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.

Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck

FilipTomaszewski5

Principle of conventional tomography-Bibash Shahi ppt..pptx

BibashShahi

Discover the Unseen: Tailored Recommendation of Unwatched Content

ScyllaDB

The session shares how JioCinema approaches ""watch discounting."" This capability ensures that if a user watched a certain amount of a show/movie, the platform no longer recommends that particular content to the user. Flawless operation of this feature promotes the discover of new content, improving the overall user experience. JioCinema is an Indian over-the-top media streaming service owned by Viacom18.

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx

christinelarrosa

Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels

Northern Engraving

Containers & AI - Beauty and the Beast!?!

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real. Keywords: AI, Containeres, Kubernetes, Cloud Native Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211

AWS Certified Solutions Architect Associate (SAA-C03)

HarpalGohil4

"Choosing proper type of scaling", Olena Syrota

Fwdays

The Microsoft 365 Migration Tutorial For Beginner.pptx

operationspcvita

Christine's Supplier Sourcing Presentaion.pptx

christinelarrosa

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...

DanBrown980551

This LF Energy webinar took place June 20, 2024. It featured: -Alex Thornton, LF Energy -Hallie Cramer, Google -Daniel Roesler, UtilityAPI -Henry Richardson, WattTime In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms. This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups. Three primary specifications will be discussed: -Discovery and client registration, emphasizing transparent processes and secure and private access -Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure -Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Fwdays

Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless. As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency. We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.

Dandelion Hashtable: beyond billion requests per second on a commodity server

Antonios Katsarakis

This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).

Demystifying Knowledge Management through Storytelling

Enterprise Knowledge

The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event. The objectives of the Lunch and Learn presentation were to: - Review what KM ‘is’ and ‘isn’t’ - Understand the value of KM and the benefits of engaging - Define and reflect on your “what’s in it for me?” - Share actionable ways you can participate in Knowledge - - Capture & Transfer

GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...

GlobalLogic Ukraine

Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck? Відео та деталі заходу: https://bit.ly/45tILxj

Y-Combinator seed pitch deck template PP

c5vrf27qcz

Recently uploaded (20)

From Natural Language to Structured Solr Queries using LLMs

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

Northern Engraving | Nameplate Manufacturing Process - 2024

"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba

Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck

Principle of conventional tomography-Bibash Shahi ppt..pptx

Discover the Unseen: Tailored Recommendation of Unwatched Content

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx

Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels

Containers & AI - Beauty and the Beast!?!

AWS Certified Solutions Architect Associate (SAA-C03)

"Choosing proper type of scaling", Olena Syrota

The Microsoft 365 Migration Tutorial For Beginner.pptx

Christine's Supplier Sourcing Presentaion.pptx

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Dandelion Hashtable: beyond billion requests per second on a commodity server

Demystifying Knowledge Management through Storytelling

GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...

Y-Combinator seed pitch deck template PP

Integrating ArchivesSpace and Archivematica at the Bentley Historical Library

1. Integrating ArchivesSpace and Archivematica at the Bentley Historical Library Max Eckard Lead Archivist for Digital Initiatives Integrations with ArchivesSpace | February 19, 2020

2. [^1]

4. Facilitate the creation/reuse of descriptive and administrative metadata across preservation and management systems.

8. Why Integrate ArchivesSpace and Archivematica? ● Shortcomings with previous workflow for digital processing ○ FileMaker Pro: Metadata was not in a widely used, open system ○ Microsoft Word → EAD: Workflow for EAD generation too localized and complicated ○ AutoPro: Not intended to be a long-term solution ○ Lack of system(s) of record: Lots (and lots… and lots…) of duplicate metadata entry ○ Scale: Scale was an issue! ● ArchivesSpace ○ System of record for descriptive and administrative metadata ○ Gradual migration of metadata from disparate systems into ArchivesSpace ● Archivematica ○ System of record for digital processing/Archival Information Package (AIP) creation ○ Gradual migration of digital backlog into Archivematica Both systems (DSpace, too!) play nicely with others!

9. ● Worked with Artefactual ● Developed a new “Appraisal and Arrangement” tab in Archivematica ● Integrated Archivematica-ArchivesSpace that permits archivists to: ○ After initial transfer… ■ Display resource records in a treeview ■ Create and edit descriptive metadata for new or existing archival objects ■ Drag and drop digital content onto archival description, creating digital objects associated with those archival objects ○ After AIP creation and deposit… ■ Write newly-minted DSpace handles back to ArchivesSpace Sponsoring New Features in Archivematica

10.

11.

12. How well has this systems integration aged? ● Pretty well! But... ● We use the particular workflow outlined here less and less ○ Works well for relatively small, heterogeneous transfers ○ Doesn’t work as well for: ■ Relatively large or homogenous transfers ■ Transfers destined for digital repositories besides DSpace ● That said, we use integration more and more (in more loosely- and tightly- coupled gradations)

13. Lessons Learned ● Systems integration is where it’s at! Side note: The integration of people is as challenging as the integration of systems. ● More important than any particular workflow is… ○ Having systems that are designed to play nicely with others ○ The technical upskilling we did as a team while implementing, migrating to, and integrating ArchivesSpace and Archivematica ■ bentley-historical-library/aip-repackaging: Scripts to support repackaging and depositing Archivematica AIPs to DSpace ■ bentley-historical-library/dappr: A client to communicate with a remote DSpace installation using its backend API ■ Lots of other ArchivesSpace repositories... We are now much more flexible in how we approach digital processing!

14. Thank you! Questions?

15. References ● [1]: Dallas Pillen, “Integrating Archive-It and ArchivesSpace at the Bentley Historical Library,” Integrations with ArchivesSpace, December 4, 2019 Resources ● Archival Integration (blog) ● ArchivesSpace-Archivematica-DSpace Workflow Integration ○ Part 1: Configuring ArchivesSpace and DSpace Integration within Archivematica ○ Part 2: Appraisal, Arrangement to ArchivesSpace and Deposit to DSpace ● BHL Archival Curation: Digital Processing

Editor's Notes

Thought I’d start with an overview of our institutional context/technical ecosystem. I actually borrowed this slide from my colleague, Dallas Pillen, who recently gave a webinar in this very Integrations with ArchivesSpace series on ArchivesSpace - Archive-It integration. So, we’ve got a lot going on here. We’ve got… (And I’m not even mentioning all of the other more localized database and spreadsheet systems we use here as well.) It’s REALLY IMPORTANT to note that these systems are used by a wide variety of stakeholders within the Bentley (including the “back of the house” Curation team and the “front of the house” Reference and Academic Programs team) as well as beyond (including novice researchers like U-M undergrads and more advanced researchers from both inside and outside the U-M community, as well as the general public). They are likewise hosted and supported by a wide variety of stakeholders. For some, we rely on U-M Library LIT, for others, we rely on U-M ITS, and for still others, we’re experimenting with it being ourselves. Managing all of these systems--especially the handoffs of data and metadata between them--can get overwhelming. So we actively look for ways to integrate them with one another, creating a kind of functional coupling between them so that they act as a coordinated whole to fulfill a number of archival workflows. And actually the integration I’ll be talking about today was part of a larger project that’s kind of the one that kicked this whole thing off for us.
So, yeah, the Archivematica-ArchivesSpace integration that I’ll talk about today was actually part of larger, Mellon Foundation-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project (2014-2016) that united three Open-Source Software platforms.
The point of the ArchivesSpace-Archivematica portion of the integration was to... But, for additional context, as part of the overall workflow, we also wanted to: Streamline the ingest and deposit of content in a preservation repository. Find solutions that met the Bentley’s local needs, but which were also flexible and scalable for other institutions; modular, so that institutions may adopt some, none, or all of the development features; and based upon open standards so that other tools and/or repository platforms could be integrated. Share all code and documentation with the archives and digital preservation communities. And, just to give you a sense of why we were interested in this, let me show you where we were coming from.
We had been doing digital processing with a bunch of localized, disparate, silos of data and metadata that didn’t really work together at all. So for example, here’s how we tracked accessions in a FileMaker Pro database (affectionately called BEAL).
We also used to do arrangement and description work of both physical and digital archives in Microsoft Word documents, generating Encoded Archival Description (EAD) using macros applied to various Microsoft Word styles.
And, we did digital processing with the AutomatedProcessor (AutoPro), a homegrown digital preservation tool written in Windows shell scripts.
While the use of FileMaker Pro, Microsoft Word, and AutoPro for digital processing lowered technical barriers and introduced efficiencies into the our digital processing initiatives, there were numerous shortcomings. The use of a custom FileMaker Pro database, for example, limited our ability to take advantage of the affordances of more widely-used systems (e.g., Archon and Archivists’ Toolkit and later ArchivesSpace), such as the ability to integrate with other tools. Using Microsoft Word to generate EADs was certainly easier than hand-encoding XML, but training processors in Microsoft Word styles and macros made the process very localised and more complicated than, say, ArchivesSpace, for entering descriptive information. AutoPro had limited error handling, a poor user interface and various support issues, and was never really intended to be a long-term solution In general, there was also a lack of well-defined system(s) of record: This meant lots of duplicate/redundant metadata entry in various platforms, and also meant we had a really hard time managing this metadata over time. None of these tools, I’ll add, really helped us work at scale. At all. Meanwhile, ArchivesSpace and Archivematica had emerged as two of the most exciting open source platforms for working with digital archives. We were adopting ArchivesSpace to…and Archivematica to…and, best of all... [animation]... that is, both use common metadata standards, have APIs, are open-source (although that’s not necessarily a prerequisite for systems integration, but it helps), etc.
And so, for this grant, we sponsored some development in Archivematica, essentially paying Artefactual to develop a new “Appraisal and Arrangement” tab in Archivematica. In addition to being the spot where this ArchivesSpace-Archivematica integration would happen, this introduced functionality to appraise and review digital content from within Archivematica. Not going to talk about this much but feel free to ask any questions... As it pertains to this webinar, however, it integrated Archivematica and ArchivesSpace via the introduction of an ArchivesSpace ‘pane’ within the Appraisal and Arrangement tab. This feature utilizes the ArchivesSpace API to: Display resource records in a tree view depicting the intellectual hierarchy of archival objects in ArchivesSpace Create and edit descriptive metadata for new or existing archival objects; authored in Archivematica and written ArchivesSpace It also permits archivists to drag and drop digital content onto archival description to create ArchivesSpace digital object records. Afterlaunching the ingest of Submission Information Packages (SIPs) in Archivematica, a related integration of Archivematica and DSpace automatically uploads a fully ingested AIP and associated descriptive metadata as a unique item in DSpace, the persistent URL of that item (its “handle”) will in turn be written back to ArchivesSpace so that it may serve as a link to either the digital content in ASpace or when archival description is exported to an EAD finding aid.
Screencast
All in all, this systems integration, when mapped to a digital preservation workflow like the Digital Curation Centre Curation Lifecycle model, looks something like this. As you can maybe see, Archivematica’s the one doing the driving, data is flowing both bidirectionally and unidirectionally between systems at various stages of the workflow, and Archivematica is using various integration methodologies like the ArchivesSpace API to integrate with ArchivesSpace (as well as SWORD v2 and the DSpace API to integrate with DSpace, even though I didn’t really touch on those).
So, this was four years ago. You might be asking, how well has this systems integration aged? Pretty well, actually! We still use it, to this day, just this morning, in fact! It has survived upgrades to both ArchivesSpace (started with 2.2 and are now on 2.5x), and Archivematica (1.6 to 1.9), as well as the latter’s migration to a new server. I will stay that we use the particular workflow I outlined here less and less. It works really well for relatively small, heterogenous transfers, but that doesn’t exactly match the kind of accessions or transfers we usually get, which, more often, are small and homogenous or, regardless of whether their homogeneous or heterogeneous, are definitely trending on the bigger and bigger side (whether you measure that by number of files or size). As I showed earlier, we also have some more specialized platforms for digitized images and streaming audiovisual material, and this particular workflow obviously connects to DSpace but not to them. That said, we do use this strategy of integration, and in particular integrations with systems, like ArchivesSpace and Archivematica and various other repository systems, that are designed to play nicely with others, MORE and MORE.
Which leads me to lessons learned. We’ve definitely drank the Kool-Aid and think systems integration is where it’s at. Side note: It turns out the integration of people is as challenging--if not more challenging--as the integration of systems (the proxies for those people), which is why I mentioned all of those stakeholders at the beginning. Looking back, I think we’d now say that more important than any particular workflow is… Just simply having systems that are designed to play nicely with others Really, the technical upskilling we did as as a team while implementing, migrating to, and integration ArchivesSpace and Archivematica. As some evidence of that, here are a couple of GitHub repositories with code we’ve developed on our own based off of these integrations and the workflow they support: AIP Repackaging scripts to… essentially these help us to replicate the integration part of what I just showed without locking us in to a particular workflow Similarly, DAPPr, a Python-based API wrapper for DSpace that, again, allows us to get data and metadata into and out of DSpace in a programmatic way without being forced to use a particular workflow All of this results in the fact that… [animation]... We now much prefer to sit down with a digital processing problem, think about what we want the end product to look like and all the different ways we might get there, and go from there.
Actually I think we’re saving questions for the end.

Integrating ArchivesSpace and Archivematica at the Bentley Historical Library

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Integrating ArchivesSpace and Archivematica at the Bentley Historical Library

Similar to Integrating ArchivesSpace and Archivematica at the Bentley Historical Library (20)

Recently uploaded

Recently uploaded (20)

Integrating ArchivesSpace and Archivematica at the Bentley Historical Library

Editor's Notes