Max Eckard, Lead Archivist for Digital Initiatives at the Bentley Historical Library, will cover the Bentley's integration of ArchivesSpace and Archivematica to streamline digital archiving workflows. He will highlight the decision-making process behind integrating both systems, things he wishes he’d known then that he knows now, goals for the future, and other tips and tricks. In his role at the Bentley Historical Library, Max oversees the digitization program, digital curation activities, web archives, and associated infrastructure.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Claremont Report on Database Research: Research Directions (Eric A. Brewer)infoblog
This is a set of slides from the Claremont Report on Database Research, see http://db.cs.berkeley.edu/claremont/ for more details. These particular slides are from a "Research Directions" talk by "Eric A. Brewer". (Uploaded for discussion at the Stanford InfoBlog, http://infoblog.stanford.edu/.)
This document summarizes Netflix's big data capabilities and how they use Tableau to analyze and visualize their data. Some key points:
1. Netflix collects up to 100 billion data events per day across multiple tables exceeding 10 billion rows daily, totaling over 2 petabytes of compressed data stored in Amazon S3 buckets.
2. Their Hadoop cluster contains 2,000 EC2 nodes with 22.5 terabytes of RAM used to process this massive amount of data.
3. Tableau is used across many Netflix teams like Data Science, Platform, and IT to visually explore, analyze, and present their big data in a more user-friendly way than Excel.
4. Tableau enables teams
This XML Prague 2015 Pre-conference presentations shows practical usage of linked data sources. These sources can help to: enrich content with entities, add link to external data sources, use the enriched content in question answering, machine translation or other scenarios. The aim is to show the practical application of linked data sources in XML tooling. The presentation is an update and provides outcomes of the related session held at XML Prague 2014.
An Open Talk at DeveloperWeek Austin 2017 by Kimberly Wilkins (@dba_denizen), Principal Engineer - Databases at ObjectRocket. Featuring new use cases like Bitcoin, AI, IoT, and all the cool things.
The IT committee update document discusses:
1) The IT committee has 18 members and is recruiting new members until December 12th. 2) Past meetings included discussions around server virtualization, alternative solutions for the infocenter, a new Galaxy strategy, and mobile app ideas. 3) Upcoming meetings will take place in Brussels from December 14-16 with 8-10 members attending.
Couchbase Lite is a NoSQL mobile database that uses a document data model with key-value pairs and handles data one document at a time. It supports push and pull replication for syncing documents between devices and servers, including both continuous and one-shot replication with options for persistent or non-persistent settings. The document provides details on Couchbase Lite's data structures, basic operations, replication features, and includes links to related resources and a demo app that uses Cloudant as the backend data layer.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Claremont Report on Database Research: Research Directions (Eric A. Brewer)infoblog
This is a set of slides from the Claremont Report on Database Research, see http://db.cs.berkeley.edu/claremont/ for more details. These particular slides are from a "Research Directions" talk by "Eric A. Brewer". (Uploaded for discussion at the Stanford InfoBlog, http://infoblog.stanford.edu/.)
This document summarizes Netflix's big data capabilities and how they use Tableau to analyze and visualize their data. Some key points:
1. Netflix collects up to 100 billion data events per day across multiple tables exceeding 10 billion rows daily, totaling over 2 petabytes of compressed data stored in Amazon S3 buckets.
2. Their Hadoop cluster contains 2,000 EC2 nodes with 22.5 terabytes of RAM used to process this massive amount of data.
3. Tableau is used across many Netflix teams like Data Science, Platform, and IT to visually explore, analyze, and present their big data in a more user-friendly way than Excel.
4. Tableau enables teams
This XML Prague 2015 Pre-conference presentations shows practical usage of linked data sources. These sources can help to: enrich content with entities, add link to external data sources, use the enriched content in question answering, machine translation or other scenarios. The aim is to show the practical application of linked data sources in XML tooling. The presentation is an update and provides outcomes of the related session held at XML Prague 2014.
An Open Talk at DeveloperWeek Austin 2017 by Kimberly Wilkins (@dba_denizen), Principal Engineer - Databases at ObjectRocket. Featuring new use cases like Bitcoin, AI, IoT, and all the cool things.
The IT committee update document discusses:
1) The IT committee has 18 members and is recruiting new members until December 12th. 2) Past meetings included discussions around server virtualization, alternative solutions for the infocenter, a new Galaxy strategy, and mobile app ideas. 3) Upcoming meetings will take place in Brussels from December 14-16 with 8-10 members attending.
Couchbase Lite is a NoSQL mobile database that uses a document data model with key-value pairs and handles data one document at a time. It supports push and pull replication for syncing documents between devices and servers, including both continuous and one-shot replication with options for persistent or non-persistent settings. The document provides details on Couchbase Lite's data structures, basic operations, replication features, and includes links to related resources and a demo app that uses Cloudant as the backend data layer.
Grafana is an open source analytics and monitoring solution that allows users to visualize data and metrics from various sources. It provides a flexible dashboard interface that supports creating and sharing visualizations, alerting, and templating. Grafana has evolved over several major versions to support more data sources, improved UX, alerting capabilities, and a plugin system. It aims to continue expanding supported data sources and features like reporting, live data streaming, and clustering.
This document summarizes a presentation about using the Migrate API in Drupal for data migration. It introduces Drupal and the Migrate API, describes how to perform Drupal-to-Drupal migration with the Migrate API and Drupal-to-Drupal Migration module, and how migration logic works in Drupal 8 to improve the upgrade process. Resources for learning more about Drupal and the Migrate API are also provided.
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.
CData - Triangle Woodard Group - QuickBooksJerod Johnson
This document discusses how CData Software provides integration components that allow users to access and analyze QuickBooks data from various applications and tools. It describes CData's Excel add-in and ODBC driver, which allow QuickBooks data to be accessed and visualized in Excel and Tableau, respectively. The document promotes CData's software as providing reliable, standards-based connectivity to QuickBooks data from any application.
The document summarizes the history and purpose of Arkstore, a semantic data storage project. It began in 2011 as a university project in Russia and was commercialized in 2012-2013 as Coldsnipe company. In 2013 it became the ARKSTORE project focused on semantic web storage. Arkstore provides persistent storage of knowledge through mechanisms like backups and high availability. It uses various storage systems and has layers including an API, web interface, Ark semantic web engine, and Ark DataStore which aggregates storage systems. It supports various ontologies and public datasets to store and retrieve semantic data.
PayPay migrated their payment database from Amazon Aurora to TiDB in 3 months. They chose TiDB for its horizontal scalability, high availability, and ability to remove the need for application-level sharding. They performed an accuracy verification by comparing data between the old and new databases, as well as across microservices. Performance and availability testing was also conducted during the migration to validate the migration was successful. After 3 months of the new TiDB database in production, PayPay saw the expected performance improvements and zero incidents, finding TiDB to be a reliable replacement.
The document discusses data collection methods for improving machine translation systems. It describes uploading usage data from users to servers for manual transcription and translation to integrate new data. It also discusses collecting new training data through recording speech and bilingual texts in new language pairs and domains. Two approaches are mentioned: translating only important sentences or sorting sentences by importance and using non-professionals to reduce costs. Other projects discussed include pre-installing Jibbigo on iPod touches and customizing hardware for different translation applications.
City of Atlanta Oracle Application FootprintDanny Bryant
The City of Atlanta has grown its Oracle footprint significantly over time. It currently uses Oracle E-Business Suite, Hyperion, OBIEE, Application Express, Siebel for customer service requests, and Taleo for recruiting and performance management. There are plans to migrate the E-Business Suite from 11i to 12.2.x. Concerns around the migration include potential issues and benefits include new features.
This document provides an overview of Grafana, an open source metrics dashboard and graph editor for Graphite, InfluxDB and OpenTSDB. It discusses Grafana's features such as rich graphing, time series querying, templated queries, annotations, dashboard search and export/import. The document also covers Grafana's history and alternatives. It positions Grafana as providing richer features than Graphite Web and highlights features like multiple y-axes, unit formats, mixing graph types, thresholds and tooltips.
This document discusses Red Hat's Open Data Hub platform for multi-tenant data analytics and machine learning. It describes the challenges of sharing data and compute resources across teams and the Open Data Hub architecture which allows teams to spin up and down their own compute clusters while sharing a common data store. Key elements of the Open Data Hub include Spark, Ceph storage, JupyterHub notebooks, and TensorFlow/Keras for modeling. The document provides an overview of data structures, analytics workflows, and the components and roadmap for the Open Data Hub platform.
SortaSQL is a proposal to add seamless horizontal scalability to SQL databases by using the filesystem to store and retrieve data. The SQL database would store metadata and handle queries, while an embedded key-value store manages record storage on files in the local or distributed filesystem. This allows queries to scale across many servers by letting the filesystem handle replication, performance and locking of distributed data files. The architecture involves an application communicating with PostgreSQL over SQL, which uses a SortaSQL plugin to retrieve rows from Kyoto Cabinet key-value files on the POSIX filesystem. Case studies at CloudFlare show how a 400GB per day dataset can be efficiently stored and queried at scale using this approach.
Leo Hsu and Regina Obe
We'll demonstrate integrating PostGIS in both PHP and ASP.NET applications.
We'll demonstrate using the new PostGIS 1.5 geography offering to extend existing web applications with proximity analysis.
More advanced use to display maps and stats using OpenLayers, WMS/WFS services and roll your own WFS like service using the PostGIS KML/GML/and or GeoJSON output functions.
Internet-enabled GIS Using Free and Open Source ToolsJohn Reiser
Internet-enabled GIS can be developed using free and open source tools like MapServer, GeoServer, TileCache, and OpenLayers. Open source GIS software allows data and applications to be freely shared, adapted, and improved by a community. Pre-rendering map tiles improves rendering speed compared to generating maps from source data for each request. The open source GIS community collaborates to build and enhance software and data.
KohaCon 2018 was held in Portland, Oregon from May 21-25 with over 230 registered users from around the world. The conference included a cultural day and 3-day hackfest after 3 days of presentations on topics like EDI standards in the US, the SubjectsPlus discovery tool, linked data, data-driven decision making, and the Koha ILL module. Upcoming EDS and citation plugins were demonstrated. Talks also covered the Koha manual, Coral ERM integration, Elasticsearch indexing, and customizations at BULAC library. KohaCon 2019 will be held in Dublin, Ireland from May 20-26, 2019.
DSpace at ILRI : A semi-technical overview of “CGSpace”CIARD Movement
This document provides a semi-technical overview of CGSpace, a digital repository managed by the International Livestock Research Institute (ILRI) that is used by nine CGIAR centers to store over 50,000 research items and receives around 250,000 hits per month. It discusses the history and use of DSpace at ILRI, how content is organized and described, strategies for search engine optimization and dissemination, and the technical skills required for maintenance and development.
Presto talk @ Global AI conference 2018 Bostonkbajda
Presented at Global AI Conference in Boston 2018:
http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.
This document provides an introduction to graph databases. It defines a graph store as a tool for storing and retrieving highly related data where many things are connected to many other things. It notes that graph databases are optimized for this type of data and discusses some popular graph database implementations. It then explores why graph databases may be useful and some limitations. The document provides examples of graph data modeling and querying capabilities. It also outlines some advanced graph database features and how to interact with a graph database using different programming languages.
PGDay.Amsterdam 2018 - Jeroen de Graaff - Step-by-step implementation of Post...PGDay.Amsterdam
Rijkswaterstaat is the Service of the Ministry of Infrastructure and Water Management in the Netherlands. During this presentation, I will share our journey to develop and apply PostgreSQL at Rijkswaterstaat. Our work is ICT-driven and access to our data, both historical and actual is key for executing our task now and in the future.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
In April 2014, the Bentley Historical Library received a $355,000 grant from the Mellon Foundation to integrate ArchivesSpace, Archivematica and DSpace into an end-to-end digital archives workflow. This presentation will identify key project goals and outcomes and demonstrate features and functionality of Archivematica’s new “Appraisal and Arrangement” tab.
Grafana is an open source analytics and monitoring solution that allows users to visualize data and metrics from various sources. It provides a flexible dashboard interface that supports creating and sharing visualizations, alerting, and templating. Grafana has evolved over several major versions to support more data sources, improved UX, alerting capabilities, and a plugin system. It aims to continue expanding supported data sources and features like reporting, live data streaming, and clustering.
This document summarizes a presentation about using the Migrate API in Drupal for data migration. It introduces Drupal and the Migrate API, describes how to perform Drupal-to-Drupal migration with the Migrate API and Drupal-to-Drupal Migration module, and how migration logic works in Drupal 8 to improve the upgrade process. Resources for learning more about Drupal and the Migrate API are also provided.
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.
CData - Triangle Woodard Group - QuickBooksJerod Johnson
This document discusses how CData Software provides integration components that allow users to access and analyze QuickBooks data from various applications and tools. It describes CData's Excel add-in and ODBC driver, which allow QuickBooks data to be accessed and visualized in Excel and Tableau, respectively. The document promotes CData's software as providing reliable, standards-based connectivity to QuickBooks data from any application.
The document summarizes the history and purpose of Arkstore, a semantic data storage project. It began in 2011 as a university project in Russia and was commercialized in 2012-2013 as Coldsnipe company. In 2013 it became the ARKSTORE project focused on semantic web storage. Arkstore provides persistent storage of knowledge through mechanisms like backups and high availability. It uses various storage systems and has layers including an API, web interface, Ark semantic web engine, and Ark DataStore which aggregates storage systems. It supports various ontologies and public datasets to store and retrieve semantic data.
PayPay migrated their payment database from Amazon Aurora to TiDB in 3 months. They chose TiDB for its horizontal scalability, high availability, and ability to remove the need for application-level sharding. They performed an accuracy verification by comparing data between the old and new databases, as well as across microservices. Performance and availability testing was also conducted during the migration to validate the migration was successful. After 3 months of the new TiDB database in production, PayPay saw the expected performance improvements and zero incidents, finding TiDB to be a reliable replacement.
The document discusses data collection methods for improving machine translation systems. It describes uploading usage data from users to servers for manual transcription and translation to integrate new data. It also discusses collecting new training data through recording speech and bilingual texts in new language pairs and domains. Two approaches are mentioned: translating only important sentences or sorting sentences by importance and using non-professionals to reduce costs. Other projects discussed include pre-installing Jibbigo on iPod touches and customizing hardware for different translation applications.
City of Atlanta Oracle Application FootprintDanny Bryant
The City of Atlanta has grown its Oracle footprint significantly over time. It currently uses Oracle E-Business Suite, Hyperion, OBIEE, Application Express, Siebel for customer service requests, and Taleo for recruiting and performance management. There are plans to migrate the E-Business Suite from 11i to 12.2.x. Concerns around the migration include potential issues and benefits include new features.
This document provides an overview of Grafana, an open source metrics dashboard and graph editor for Graphite, InfluxDB and OpenTSDB. It discusses Grafana's features such as rich graphing, time series querying, templated queries, annotations, dashboard search and export/import. The document also covers Grafana's history and alternatives. It positions Grafana as providing richer features than Graphite Web and highlights features like multiple y-axes, unit formats, mixing graph types, thresholds and tooltips.
This document discusses Red Hat's Open Data Hub platform for multi-tenant data analytics and machine learning. It describes the challenges of sharing data and compute resources across teams and the Open Data Hub architecture which allows teams to spin up and down their own compute clusters while sharing a common data store. Key elements of the Open Data Hub include Spark, Ceph storage, JupyterHub notebooks, and TensorFlow/Keras for modeling. The document provides an overview of data structures, analytics workflows, and the components and roadmap for the Open Data Hub platform.
SortaSQL is a proposal to add seamless horizontal scalability to SQL databases by using the filesystem to store and retrieve data. The SQL database would store metadata and handle queries, while an embedded key-value store manages record storage on files in the local or distributed filesystem. This allows queries to scale across many servers by letting the filesystem handle replication, performance and locking of distributed data files. The architecture involves an application communicating with PostgreSQL over SQL, which uses a SortaSQL plugin to retrieve rows from Kyoto Cabinet key-value files on the POSIX filesystem. Case studies at CloudFlare show how a 400GB per day dataset can be efficiently stored and queried at scale using this approach.
Leo Hsu and Regina Obe
We'll demonstrate integrating PostGIS in both PHP and ASP.NET applications.
We'll demonstrate using the new PostGIS 1.5 geography offering to extend existing web applications with proximity analysis.
More advanced use to display maps and stats using OpenLayers, WMS/WFS services and roll your own WFS like service using the PostGIS KML/GML/and or GeoJSON output functions.
Internet-enabled GIS Using Free and Open Source ToolsJohn Reiser
Internet-enabled GIS can be developed using free and open source tools like MapServer, GeoServer, TileCache, and OpenLayers. Open source GIS software allows data and applications to be freely shared, adapted, and improved by a community. Pre-rendering map tiles improves rendering speed compared to generating maps from source data for each request. The open source GIS community collaborates to build and enhance software and data.
KohaCon 2018 was held in Portland, Oregon from May 21-25 with over 230 registered users from around the world. The conference included a cultural day and 3-day hackfest after 3 days of presentations on topics like EDI standards in the US, the SubjectsPlus discovery tool, linked data, data-driven decision making, and the Koha ILL module. Upcoming EDS and citation plugins were demonstrated. Talks also covered the Koha manual, Coral ERM integration, Elasticsearch indexing, and customizations at BULAC library. KohaCon 2019 will be held in Dublin, Ireland from May 20-26, 2019.
DSpace at ILRI : A semi-technical overview of “CGSpace”CIARD Movement
This document provides a semi-technical overview of CGSpace, a digital repository managed by the International Livestock Research Institute (ILRI) that is used by nine CGIAR centers to store over 50,000 research items and receives around 250,000 hits per month. It discusses the history and use of DSpace at ILRI, how content is organized and described, strategies for search engine optimization and dissemination, and the technical skills required for maintenance and development.
Presto talk @ Global AI conference 2018 Bostonkbajda
Presented at Global AI Conference in Boston 2018:
http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.
This document provides an introduction to graph databases. It defines a graph store as a tool for storing and retrieving highly related data where many things are connected to many other things. It notes that graph databases are optimized for this type of data and discusses some popular graph database implementations. It then explores why graph databases may be useful and some limitations. The document provides examples of graph data modeling and querying capabilities. It also outlines some advanced graph database features and how to interact with a graph database using different programming languages.
PGDay.Amsterdam 2018 - Jeroen de Graaff - Step-by-step implementation of Post...PGDay.Amsterdam
Rijkswaterstaat is the Service of the Ministry of Infrastructure and Water Management in the Netherlands. During this presentation, I will share our journey to develop and apply PostgreSQL at Rijkswaterstaat. Our work is ICT-driven and access to our data, both historical and actual is key for executing our task now and in the future.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
In April 2014, the Bentley Historical Library received a $355,000 grant from the Mellon Foundation to integrate ArchivesSpace, Archivematica and DSpace into an end-to-end digital archives workflow. This presentation will identify key project goals and outcomes and demonstrate features and functionality of Archivematica’s new “Appraisal and Arrangement” tab.
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
Alluxio Bay Area Meetup March 14th
Join the Alluxio Meetup group: https://www.meetup.com/Alluxio
Alluxio Community slack: https://www.alluxio.org/slack
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
This document summarizes the development of Lore's machine learning and NLP platform using Python. It started as a monolithic Python server but evolved into a microservices architecture using Docker, Kubernetes, and Celery for parallelization. Key lessons included using DevOps tools like Docker for development and deployment, Celery to parallelize tasks, and wrapping services to improve modularity, flexibility, and performance. The platform now supports multiple products and consulting work in a scalable and maintainable way.
#lspe Building a Monitoring Framework using DTrace and MongoDBdan-p-kimmel
A talk I gave at the Large Scale Production Engineering meetup at Yahoo! about building monitoring tools and how to use DTrace to get more out of your monitoring data.
This document discusses change data capture (CDC) and its components. CDC is an approach that identifies, captures, and delivers changes made to enterprise data sources. It feeds these changes into a central data stream that can be combined with other data sources in real-time. The document outlines Kafka Connect, Debezium, Schema Registry, and Apache Avro which are key parts of the CDC architecture. It also discusses future steps like supporting additional databases and improving deployment, as well as open issues around performance and compatibility with certain databases.
The document discusses a project to investigate using Archivematica, an open-source digital preservation system, to provide digital preservation functionality for research data at the Universities of Hull and York. The project involved three phases: exploring Archivematica and research data needs, developing Archivematica features, and implementing proof-of-concept systems at both universities. Key findings included that Archivematica could meet many preservation needs but had limitations identifying research file formats, and that collaboration was important for addressing challenges in preserving research data long-term.
Webinar slides: DevOps Tutorial: how to automate your database infrastructureSeveralnines
Join our guest speaker Riaan Nolan of mukuru.com, the First Puppet Labs Certified Professional in South Africa, as he walks us through the facets of DevOps integrations and the mission-critical advantages that database automation can bring to your database infrastructure.
Infrastructure automation isn’t easy, but it’s not rocket science either. Done right, it is a worthwhile investment, but deciding on which tools to invest in can be a confusing and overwhelming process. Riaan will share some of his secrets on how to proceed with this and he knows what he’s talking about: he saves the companies he works for substantial amounts on their monthly IT bills, typically around 50%.
Don’t miss out on this opportunity to understand how you can find efficiencies for your database infrastructure and do watch this webinar to understand the key pain points, which indicate that it’s time to invest in database automation.
AGENDA
DevOps and databases - what are the challenges
Managing databases in a DevOps environment
- Requirements from microservice environments
- Automated deployments
- Performance monitoring
- Backups
- Schema changes
- Version upgrades
- Automated failover
- Integration with ChatOps and other tools
Data distribution
- Database hosting in cloud environments
- Managing data flows
Cloud Automation on AWS
SPEAKERS
Riaan Nolan was the First Puppet Labs Certified Professional in South Africa. Riaan uses Amazon EC2, VPC and Autoscale with Cloudformation to spin up complete stacks with Autoscaling Fleets. He saves companies substantial amounts on their monthly IT bills, typically around 50% - yes, at one company that meant $500k+ per year. And he’s participated in a number of community tech related forums. He uses next generation technologies such as AWS, Cloudformation, Autoscale, Puppet, GlusterFS, NGINX, Magento and PHP to power huge eCommerce stores. His specialties are Puppet Automation, Cloud Deployments, eCommerce, eMarketing, Specialized Linux Services, Windows, Process making, Budgets, Asset Tracking, Procurement.
- Devops Lead, Mukuru
- Expert Live Systems Administrator, foodpanda | Hellofood
- Senior Systems Administrator / Infrastructure Lead, Rocket Internet GmbH
- Senior Technology Manager, Africa Internet Accelerator
Art van Scheppingen is a Senior Support Engineer at Severalnines. He’s a pragmatic MySQL and Database expert with over 15 years experience in web development. He previously worked at Spil Games as Head of Database Engineering, where he kept a broad vision upon the whole database environment: from MySQL to Couchbase, Vertica to Hadoop and from Sphinx Search to SOLR. He regularly presents his work and projects at various conferences (Percona Live, FOSDEM) and related meetups.
Data for all: Empowering teams with scalable Shiny applications @ useR 2019Ruan Pearce-Authers
Shiny, alongside packages like dplyr and ggplot2, offers an unparalleled developer experience for creating self-service analytics dashboards that empower teams to make data-driven decisions. However, out of the box, Shiny is not well-suited to deployment in a multi-user environment. As part of our mission to establish a data culture in a game development studio, we wanted to deploy a suite of Shiny dashboards such that exploring player behaviour became part of every team’s workflow. In this talk, we will discuss the architecture of the supporting cloud infrastructure, including packaging, service orchestration, and authentication. Also, we will show how we’ve adapted Shiny to a multi-user environment using its new support for promises in combination with the future package. Integrating Shiny into this production-grade architecture allows for a streamlined data science workflow that enables data scientists to focus on creating dashboard content with a built-in code review process, and also to deploy changes to production in a button click. We hope to demonstrate how any data-driven organisation can augment their team-wide workflow by leveraging this end-to-end Shiny pipeline.
Behind the Scenes at Coolblue - Feb 2017Pat Hermens
This document discusses various tools in the Elastic Stack including Kibana, Elasticsearch, Beats, and Logstash. It provides brief descriptions of each tool and why they are used. Additional logging and monitoring tools are also mentioned, along with links to documentation, code samples, and other resources from the discussion.
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka Hua Chu
This document discusses using Apache Kafka for heterogeneous job processing. It describes how the speaker's company evolved their job processing infrastructure from using a database with cron jobs, to Resque backed by Redis, to a custom system using Kafka. The custom system aims to provide durability and scalability for long-running jobs by decoupling jobs into smaller tasks communicated through Kafka topics. It achieves reliability by ensuring Kafka message replication and allowing tasks to recover from failures.
GSoC2014 - Uniritter Presentation May, 2015Fabrízio Mello
This presentation is about the work that I did during the Google Summer of Code 2014 to PostgreSQL. The project is about change an Unlogged Table to Logged and vice-versa. Project wiki page: https://wiki.postgresql.org/wiki/Allow_an_unlogged_table_to_be_changed_to_logged_GSoC_2014
I present this work to Uniritter IT students in Canoas/RS (2015-05-18) and Porto Alegre/RS (FAPA - 2015-05-20).
This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.
This document discusses PyTables, a Python library for managing hierarchical datasets and efficiently analyzing large amounts of data. It begins by introducing PyTables and its use of HDF5 for portability and extensibility. Key features of PyTables discussed include its object-oriented interface, optimization of memory and disk usage, and fast querying capabilities. The document then covers techniques for maximizing performance like Numexpr for complex expressions, NumPy for powerful data containers, compression algorithms, and caching. Blosc compression is highlighted for its ability to compress faster than memory speed.
This document discusses using PyTables to analyze large datasets. PyTables is built on HDF5 and uses NumPy to provide an object-oriented interface for efficiently browsing, processing, and querying very large amounts of data. It addresses the problem of CPU starvation by utilizing techniques like caching, compression, and high performance libraries like Numexpr and Blosc to minimize data transfer times. PyTables allows fast querying of data through flexible iterators and indexing to facilitate extracting important information from large datasets.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
Similar to Integrating ArchivesSpace and Archivematica at the Bentley Historical Library (20)
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
Discover the Unseen: Tailored Recommendation of Unwatched ContentScyllaDB
The session shares how JioCinema approaches ""watch discounting."" This capability ensures that if a user watched a certain amount of a show/movie, the platform no longer recommends that particular content to the user. Flawless operation of this feature promotes the discover of new content, improving the overall user experience.
JioCinema is an Indian over-the-top media streaming service owned by Viacom18.
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Ukraine
Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck?
Відео та деталі заходу: https://bit.ly/45tILxj
Integrating ArchivesSpace and Archivematica at the Bentley Historical Library
1. Integrating ArchivesSpace
and Archivematica at the
Bentley Historical Library
Max Eckard
Lead Archivist for Digital Initiatives
Integrations with ArchivesSpace | February 19, 2020
8. Why Integrate ArchivesSpace and Archivematica?
● Shortcomings with previous workflow for digital processing
○ FileMaker Pro: Metadata was not in a widely used, open system
○ Microsoft Word → EAD: Workflow for EAD generation too localized and complicated
○ AutoPro: Not intended to be a long-term solution
○ Lack of system(s) of record: Lots (and lots… and lots…) of duplicate metadata entry
○ Scale: Scale was an issue!
● ArchivesSpace
○ System of record for descriptive and administrative metadata
○ Gradual migration of metadata from disparate systems into ArchivesSpace
● Archivematica
○ System of record for digital processing/Archival Information Package (AIP) creation
○ Gradual migration of digital backlog into Archivematica
Both systems (DSpace, too!) play nicely with others!
9. ● Worked with Artefactual
● Developed a new “Appraisal and Arrangement” tab in Archivematica
● Integrated Archivematica-ArchivesSpace that permits archivists to:
○ After initial transfer…
■ Display resource records in a treeview
■ Create and edit descriptive metadata for new or existing archival objects
■ Drag and drop digital content onto archival description, creating digital objects
associated with those archival objects
○ After AIP creation and deposit…
■ Write newly-minted DSpace handles back to ArchivesSpace
Sponsoring New Features in Archivematica
10.
11.
12. How well has this systems integration aged?
● Pretty well! But...
● We use the particular workflow outlined here less and less
○ Works well for relatively small, heterogeneous transfers
○ Doesn’t work as well for:
■ Relatively large or homogenous transfers
■ Transfers destined for digital repositories besides DSpace
● That said, we use integration more and more (in more loosely- and tightly-
coupled gradations)
13. Lessons Learned
● Systems integration is where it’s at! Side note: The integration of people is as
challenging as the integration of systems.
● More important than any particular workflow is…
○ Having systems that are designed to play nicely with others
○ The technical upskilling we did as a team while implementing, migrating to, and integrating
ArchivesSpace and Archivematica
■ bentley-historical-library/aip-repackaging: Scripts to support repackaging and depositing
Archivematica AIPs to DSpace
■ bentley-historical-library/dappr: A client to communicate with a remote DSpace
installation using its backend API
■ Lots of other ArchivesSpace repositories...
We are now much more flexible in how we approach
digital processing!
15. References
● [1]: Dallas Pillen, “Integrating Archive-It and ArchivesSpace at the Bentley
Historical Library,” Integrations with ArchivesSpace, December 4, 2019
Resources
● Archival Integration (blog)
● ArchivesSpace-Archivematica-DSpace Workflow Integration
○ Part 1: Configuring ArchivesSpace and DSpace Integration within Archivematica
○ Part 2: Appraisal, Arrangement to ArchivesSpace and Deposit to DSpace
● BHL Archival Curation: Digital Processing
Editor's Notes
Thought I’d start with an overview of our institutional context/technical ecosystem. I actually borrowed this slide from my colleague, Dallas Pillen, who recently gave a webinar in this very Integrations with ArchivesSpace series on ArchivesSpace - Archive-It integration.
So, we’ve got a lot going on here. We’ve got… (And I’m not even mentioning all of the other more localized database and spreadsheet systems we use here as well.)
It’s REALLY IMPORTANT to note that these systems are used by a wide variety of stakeholders within the Bentley (including the “back of the house” Curation team and the “front of the house” Reference and Academic Programs team) as well as beyond (including novice researchers like U-M undergrads and more advanced researchers from both inside and outside the U-M community, as well as the general public).
They are likewise hosted and supported by a wide variety of stakeholders. For some, we rely on U-M Library LIT, for others, we rely on U-M ITS, and for still others, we’re experimenting with it being ourselves.
Managing all of these systems--especially the handoffs of data and metadata between them--can get overwhelming. So we actively look for ways to integrate them with one another, creating a kind of functional coupling between them so that they act as a coordinated whole to fulfill a number of archival workflows. And actually the integration I’ll be talking about today was part of a larger project that’s kind of the one that kicked this whole thing off for us.
So, yeah, the Archivematica-ArchivesSpace integration that I’ll talk about today was actually part of larger, Mellon Foundation-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project (2014-2016) that united three Open-Source Software platforms.
The point of the ArchivesSpace-Archivematica portion of the integration was to...
But, for additional context, as part of the overall workflow, we also wanted to:
Streamline the ingest and deposit of content in a preservation repository.
Find solutions that met the Bentley’s local needs, but which were also flexible and scalable for other institutions; modular, so that institutions may adopt some, none, or all of the development features; and based upon open standards so that other tools and/or repository platforms could be integrated.
Share all code and documentation with the archives and digital preservation communities.
And, just to give you a sense of why we were interested in this, let me show you where we were coming from.
We had been doing digital processing with a bunch of localized, disparate, silos of data and metadata that didn’t really work together at all.
So for example, here’s how we tracked accessions in a FileMaker Pro database (affectionately called BEAL).
We also used to do arrangement and description work of both physical and digital archives in Microsoft Word documents, generating Encoded Archival Description (EAD) using macros applied to various Microsoft Word styles.
And, we did digital processing with the AutomatedProcessor (AutoPro), a homegrown digital preservation tool written in Windows shell scripts.
While the use of FileMaker Pro, Microsoft Word, and AutoPro for digital processing lowered technical barriers and introduced efficiencies into the our digital processing initiatives, there were numerous shortcomings.
The use of a custom FileMaker Pro database, for example, limited our ability to take advantage of the affordances of more widely-used systems (e.g., Archon and Archivists’ Toolkit and later ArchivesSpace), such as the ability to integrate with other tools.
Using Microsoft Word to generate EADs was certainly easier than hand-encoding XML, but training processors in Microsoft Word styles and macros made the process very localised and more complicated than, say, ArchivesSpace, for entering descriptive information.
AutoPro had limited error handling, a poor user interface and various support issues, and was never really intended to be a long-term solution
In general, there was also a lack of well-defined system(s) of record: This meant lots of duplicate/redundant metadata entry in various platforms, and also meant we had a really hard time managing this metadata over time.
None of these tools, I’ll add, really helped us work at scale. At all.
Meanwhile, ArchivesSpace and Archivematica had emerged as two of the most exciting open source platforms for working with digital archives. We were adopting ArchivesSpace to…and Archivematica to…and, best of all... [animation]... that is, both use common metadata standards, have APIs, are open-source (although that’s not necessarily a prerequisite for systems integration, but it helps), etc.
And so, for this grant, we sponsored some development in Archivematica, essentially paying Artefactual to develop a new “Appraisal and Arrangement” tab in Archivematica. In addition to being the spot where this ArchivesSpace-Archivematica integration would happen, this introduced functionality to appraise and review digital content from within Archivematica. Not going to talk about this much but feel free to ask any questions...
As it pertains to this webinar, however, it integrated Archivematica and ArchivesSpace via the introduction of an ArchivesSpace ‘pane’ within the Appraisal and Arrangement tab. This feature utilizes the ArchivesSpace API to:
Display resource records in a tree view depicting the intellectual hierarchy of archival objects in ArchivesSpace
Create and edit descriptive metadata for new or existing archival objects; authored in Archivematica and written ArchivesSpace
It also permits archivists to drag and drop digital content onto archival description to create ArchivesSpace digital object records.
Afterlaunching the ingest of Submission Information Packages (SIPs) in Archivematica, a related integration of Archivematica and DSpace automatically uploads a fully ingested AIP and associated descriptive metadata as a unique item in DSpace, the persistent URL of that item (its “handle”) will in turn be written back to ArchivesSpace so that it may serve as a link to either the digital content in ASpace or when archival description is exported to an EAD finding aid.
Screencast
All in all, this systems integration, when mapped to a digital preservation workflow like the Digital Curation Centre Curation Lifecycle model, looks something like this. As you can maybe see, Archivematica’s the one doing the driving, data is flowing both bidirectionally and unidirectionally between systems at various stages of the workflow, and Archivematica is using various integration methodologies like the ArchivesSpace API to integrate with ArchivesSpace (as well as SWORD v2 and the DSpace API to integrate with DSpace, even though I didn’t really touch on those).
So, this was four years ago. You might be asking, how well has this systems integration aged?
Pretty well, actually! We still use it, to this day, just this morning, in fact! It has survived upgrades to both ArchivesSpace (started with 2.2 and are now on 2.5x), and Archivematica (1.6 to 1.9), as well as the latter’s migration to a new server.
I will stay that we use the particular workflow I outlined here less and less. It works really well for relatively small, heterogenous transfers, but that doesn’t exactly match the kind of accessions or transfers we usually get, which, more often, are small and homogenous or, regardless of whether their homogeneous or heterogeneous, are definitely trending on the bigger and bigger side (whether you measure that by number of files or size). As I showed earlier, we also have some more specialized platforms for digitized images and streaming audiovisual material, and this particular workflow obviously connects to DSpace but not to them.
That said, we do use this strategy of integration, and in particular integrations with systems, like ArchivesSpace and Archivematica and various other repository systems, that are designed to play nicely with others, MORE and MORE.
Which leads me to lessons learned. We’ve definitely drank the Kool-Aid and think systems integration is where it’s at. Side note: It turns out the integration of people is as challenging--if not more challenging--as the integration of systems (the proxies for those people), which is why I mentioned all of those stakeholders at the beginning.
Looking back, I think we’d now say that more important than any particular workflow is…
Just simply having systems that are designed to play nicely with others
Really, the technical upskilling we did as as a team while implementing, migrating to, and integration ArchivesSpace and Archivematica.
As some evidence of that, here are a couple of GitHub repositories with code we’ve developed on our own based off of these integrations and the workflow they support:
AIP Repackaging scripts to… essentially these help us to replicate the integration part of what I just showed without locking us in to a particular workflow
Similarly, DAPPr, a Python-based API wrapper for DSpace that, again, allows us to get data and metadata into and out of DSpace in a programmatic way without being forced to use a particular workflow
All of this results in the fact that… [animation]... We now much prefer to sit down with a digital processing problem, think about what we want the end product to look like and all the different ways we might get there, and go from there.
Actually I think we’re saving questions for the end.