Running DSpace: Technical overview, lessons learned, workflows and essential skills

•

0 likes•1,227 views

Presented by Alan Orth and Sisay Webshet at Dspace Ethiopia Interest Group Meeting and Training, Addis Ababa, Ethiopia, 28 October – 1 November 2013.

Technology

Running Dspace
Technical overview, lessons learned, workflows
and essential skills
Alan Orth and Sisay Webshet
Dspace Ethiopia Interest Group Meeting
Addis Ababa, 28 October 2013

DSpace Instances
One server, two instances...
CGSpace

DSpace Test

Hosted at CGNET in California, USA

Instance Overview
CGSpace
(cgspace.cgiar.org)
● “Production”
● Should always be up &
stable
● Is the “reference”
implementation

DSpace Test
(dspacetest.cgiar.org)
● “Development”
● Changes to style,
functionality, DSpace etc
are tested here first
● Sometimes wiped clean

$Living With Legacy Decisions... CGSpace and DSpace Test on the same machine… ● In 2010 CGSpace had a fraction of the content, users, etc, so it didn’t affect the running of the system ● Not true anymore! ● 100s of 1000s of monthly views... ● Large assetstore, log files, RAM / CPU usage, etc$

(Near) Future Plans
Separate instances on Amazon EC2!

cgspace.cgiar.org

dspacetest.cgiar.org

CGSpace Code Is 100% Open

Source code is on github: github.com/ilri/DSpace

How The Code Is Organized
Production code lives in the 3_x-prod branch; this
is stable, tested code. Updates (if any) come from
the development branch on Monday.
Development code lives in the 3_x-dev branch;
this is semi-tested code! Changes throughout the
week.

“Social Coding” on GitHub
● Anyone can “fork” the
code repository to their
own GitHub account
● Source code repositories
can share code via “pull
requests”
● Developers can comment
on changes and discuss
issues

GitHub “OctoCat”

“Pull request” from @mire introducing the CUA module

Workflow Lessons Learned
● Sending changes is good, but leaves the
burden of merging to me
● Sending patches is better, but requires sender
to know how to generate them
● Sending a pull request is best, but requires
sender knows how to use git, branches, etc

Go Forth And Fork!

… and send pull requests!

Scenario: Create A New Theme
Creating an XMLUI theme for a new community
Create community in DSpace (ie, 10568/38440)
Add custom metadata (ie, cg.subject.bioversity)
Add custom submission template (input-forms.xml)
Copy existing XMLUI theme (ie ILRI) as a reference, and
customize for center-specific metadata, look & feel, etc
5. Update search & browse indexes (dspace.cfg)
6. Update XMLUI config for new theme (xmlui.xconf)
1.
2.
3.
4.

DSpace Sysadmin Crashcourse
DSpace...
● is a Java application
● builds using maven and ant
● uses PostgreSQL as a database backend
● stores PDFs and other blobs in the filesystem
(“assetstore”)
● runs best on Linux

CGSpace Stack

Apache httpd

Apache Tomcat

PostgreSQL

Bitstreams

Debian GNU/Linux

Why Not Use Tomcat Directly?
Any sysadmin will tell you that working with Tomcat is a joy*.
Surprisingly**, these things are annoying in Tomcat:
● Virtual hosting
● SSL
● redirects
● caching and manipulating headers
*for some definitions of “joy”
**not surprising, actually

Essential Technical Skills
Managing a DSpace instance doesn’t require “programmers”
or “developers” (but it doesn’t hurt).
Mainly, you’ll need:
● Linux experience (Debian, CentOS, Ubuntu)
● Administration experience (web servers, log files, cron jobs,
security)
● Software development concepts (git, patches,
branching/merging)

Better lives through livestock
ilri.org

The presentation has a Creative Commons licence. You are free to re-use or distribute this work, provided credit is given to ILRI.

This document discusses using Cassandra for big data event logging. It notes that Cassandra scales incrementally, is highly available, and is well suited for OLTP workloads where write throughput is prioritized over reads. It covers Cassandra's internal workings including token assignment, replication, and compaction strategies. Setup instructions are provided along with benchmarking results. Maintenance tools like Nodetool and stress testing tools are also mentioned. The document concludes that Cassandra is a good candidate for logging systems due to its scalability and ease of adding nodes.

HaaS: HPCC Systems as a Service – BYOD to the Cloud Party

HPCC Systems

From the 2017 HPCC Systems Community Day: Amazon Web Services (AWS) is the premier IaaS provider. It leads the pack by offering more and better services at lower prices. Furthermore, AWS continuously improves and innovates to stay in front. There are numerous reasons to use an IaaS for HPCC Systems instead of dedicated hardware, especially if the workload does not execute 24/7. AWS has developed several features and tools for launching clusters. CloudFormation provides users a tool to make creating and managing an AWS resources much easier. Foremost it consists of a template (CFT) that defines resources required. The template is parameterizable and flexible so that a single CFT can launch an HPCC Systems cluster with an arbitrary number of nodes, various amount of memory per node, and other configuration options. Second, an Amazon Machine Image (AMI) contains the information needed to launch a compute node, with appropriate software, and configure it for a specific operation. We developed a CFT and an AMI for HPCC Systems. Additionally, we developed a reference architecture for HPCC Systems in AWS. It is a typical N+1 cluster, N worker nodes and one node (or mode) for cluster wide services such as Dali. The architecture also has storage (i.e., EBS volumes) and networking (i.e., VPN) resources. Significant effort was expended to determine the best set of resources for HPCC Systems clusters. Furthermore, we created a program to create and manage HPCC Systems clusters in AWS from the command line. This talk will present the tools we created. It also explains and justifies the reference architecture and many of the configuration options. Vince Freeh Associate Professor, North Carolina State University Vincent W. Freeh is an associate professor of computer science at North Carolina State University. He received his Ph.D. in 1996 from the University of Arizona. His research focus is high-performance system software, with emphasis on filesystems, parallel and distributed systems, power-aware computing, and storage systems. Prof Freeh teaches courses in the above research areas as well as in compilers. He has more than 55 referred publications in numerous computer science conferences and scientific journals. He received an NSF CAREER Award and several IBM Faculty Development Awards. He was a captain in the US Army Corps of Engineers before entering graduate school for his MS. Chin-Jung Hsu PhD Student, North Carolina State University Chin-Jung Hsu is a Ph.D. candidate in Computer Science at North Carolina State University. His primary research interests include distributed systems, storage systems, and performance optimization. He interned at NetApp and AT&T Research Lab, where he applied machine learning techniques to distributed storage systems for ensuring performance guarantees. Chin-Jung is currently working on how to efficiently run HPCC Systems applications on the public clouds such as AWS and Azure.

Electron, databases, and RxDB

Ben Gotow

Lokijs

Joe Minichino

Azure DocumentDB 101

Ike Ellis

The document provides an introduction to Azure DocumentDB, a fully managed NoSQL database service. It discusses key features like schema-free JSON documents, automatic indexing, and the ability to run JavaScript code directly in the database using stored procedures. It also covers how to configure an DocumentDB account, create databases and collections, perform CRUD operations on documents, and write simple stored procedures. The presentation aims to explain the basics of DocumentDB and demonstrates how to interact with it programmatically.

Scalding @ Coursera

Daniel Jin Hao Chia

FITC presents: Mobile & offline data synchronization in Angular JS

FITC

Save 10% off ANY FITC event with discount code 'slideshare' See our upcoming events at www.fitc.ca OVERVIEW Are you building mobile or web applications with AngularJS and wish they would work when you were offline? You can read, send and delete mail from your mobile email client when you are offline, why not from your AngularJS app? AngularJS is completely agnostic when it comes to creating your data models. Let’s explore what is required to allow your application to be useful to your users even without an internet connection. INTENDED AUDIENCE - BEGINNER - INTERMEDIATE This presentation is for developers that know they are looking for offline and data synchronization capabilities. Or, possibly for managers that wish to have a greater understanding of what their options are in AngularJS to create such functionality. Daniel Zen, CEO, Zen Digital Daniel Zen is the CEO of Zen Digital, founder of the New York AngularJS Meetup, a frequent lecturer, and a former consultant for Google, Pivotal Labs and various Fortune 500 companies. Zen Digital uses Agile techniques to move projects forward while continuously integrating new code and ideas, producing elegant frontend experiences and efficient backend systems for web and mobile applications.

Building an API layer for C* at Coursera

Daniel Jin Hao Chia

Drupal's Paragraphs module, combined with a DAMS (Digital Asset Management System) can deliver powerful, rich stories on the web. This session will show how, showing the inner workings of the Baseball Hall of Fame (baseballhall.org) website as a case study. This site uses Drupal with the Islandora DAMS to leverage the Baseball Hall of Fame's huge archive of images. Topics covered: - Building flexible content types using the Paragraphs module - Multifaceted display of content using view modes - DAMS & integrating Islandora assets with Drupal content First presented at DrupalCamp Brighton in January 2015, by Alex Bridge and Tassos Koutlas.

02 beginning code first

Maxim Shaptala

Entity cache

Ashok Modi

The document discusses the Drupal 7 Entity Cache module. It summarizes that the module caches entity data, including fields, after the first load to improve performance. It caches the full entity to serve from cache until expiration rather than reloading the entity and fields on each request. The module already supports core Drupal entities and makes it easy to cache other entity types as well. Installing and enabling the module provides these caching benefits without additional configuration.

Kubernetes at Spreadshirt - First steps to production

Jens Hadlich

Share point 2013 on azure

Prabath Fonseka

This document discusses deploying SharePoint 2013 on Microsoft Azure infrastructure as a service (IaaS). It covers key Azure concepts like virtual networks, availability, disks, and virtual machines. Virtual networks allow grouping of virtual machines and enabling Active Directory. High availability is achieved through location, regions, affinity groups, and availability sets. Disk storage and performance considerations for databases and content are provided. Sample virtual machine configurations show optimal disk layout and sizing for SharePoint and SQL Server.

Presentation: mongo db & elasticsearch & membase

Ardak Shalkarbayuli

This document provides summaries of NoSQL databases MongoDB, ElasticSearch, and Couchbase. It discusses their key features and uses cases. MongoDB is a document-oriented database that stores data in JSON-like documents. ElasticSearch is a search engine and stores data in JSON documents for real-time search and analytics capabilities. Couchbase is a key-value store that provides high-performance access to data through caching and supports high concurrency.

MongoDB

Rony Gregory

MongoDB is a document database that stores data in BSON format, which is similar to JSON. It is a non-relational, schema-free database that scales easily and supports massive amounts of data and high availability. MongoDB can replace traditional relational databases for certain applications, as it offers dynamic schemas, horizontal scaling, and high performance. Key features include indexing, replication, MapReduce and rich querying of embedded documents.

SQL for Elasticsearch

Jodok Batlogg

Search and analyze your data with elasticsearch

Anton Udovychenko

The importance of search for modern applications is evident and nowadays it is higher than ever. A lot of projects use search forms as a primary interface for communication with a user. Though implementation of an intelligent search functionality is still a challenge and we need a good set of tools. In this presentation, I will talk through the high-level architecture and benefits of Elasticsearch with some examples. Aside from that, we will also take a look at its existing competitors, their similarities, and differences.

Utilizing the OpenNTF Domino API

Oliver Busse

The document discusses the OpenNTF Domino API (ODA), an open source project that provides additional capabilities for working with Java and Domino. It was started in 2013 and fills gaps for Java developers working with Domino. The ODA makes common tasks like session handling, view handling, document handling and transactions easier. It also introduces new capabilities like improved date/time functions and Xots for executing multi-threaded tasks. The document provides an overview of the ODA and examples of how it can simplify and enhance Java code that interacts with Domino.

Windows server and docker

girish goudar

PlovDev 2016: Оркестрация на контейнери с Kubernetes - Мартин Владев

PlovDev Conference

This document discusses Kubernetes and container orchestration. It provides an overview of Kubernetes, including its key features like horizontal scaling, automated rollouts and rollbacks, storage orchestration, self-healing capabilities, service discovery and load balancing. The document also discusses Kubernetes concepts like pods, labels, selectors, controllers and services. It outlines Kubernetes' architecture and control loops that drive the current state towards the desired state.

Rapid prototyping using azure functions - A walk on the wild side

Samrat Saha

Introduction to elasticsearch

hypto

Selecting the right persistent storage options for apps in containers Open So...

bipin kunal

No matter where an application is running, it will most likely need some form of storage. When running application in container environment, persistent storage is needed. There are plenty of storage plugins available which can provide persistent storage for application containers. With plenty of persistent storage available, it becomes evident to understand the different persistent storage options, their access modes and how it works so that applications can make better use of persistent storage. Join us and be able to choose right persistent for your applications. We will take you through : what all various persistent storage options and access method we have, how access mode suites your workload.

Cassandra

Pooja GV

Cassandra is a highly scalable, open-source distributed database designed to handle large amounts of structured data across many servers. It provides high availability with no single point of failure and was created by Facebook to power search on their messaging platform. Cassandra uses a decentralized peer-to-peer architecture and replicates data across multiple data centers for fault tolerance. It emphasizes performance and scalability over more complex query options and does not support features like joins typically found in relational databases. Companies like Netflix and Hulu use Cassandra for its availability, scalability, and ability to span large clusters with minimal maintenance.

AWS Fargate in practice. How to run containers without managing EC2 instances

Max Borysov

Fargate allows running containers on AWS without managing servers. Key concepts include repositories for images, clusters for grouping resources, task definitions for configuring containers, and scheduled tasks for automating them. Backup tasks can restore databases, generate files, store in S3/Glacier, and delete resources to save on costs compared to reserved RDS storage. Monitoring includes CloudWatch and custom scripts.

OpenStack Cinder, Implementation Today and New Trends for Tomorrow

Ed Balduf

This document discusses OpenStack Block Storage (Cinder) implementations, trends, and the future direction of Cinder. It provides an overview of Cinder's mission to provide on-demand, self-service block storage and its plugin architecture that supports various backend storage devices. It also discusses some common storage types in OpenStack and looks at specific Cinder features, configurations, and the user experience. The document concludes by exploring how Cinder may evolve to better support enterprise applications and looks at upcoming changes in the Liberty release.

Provisioning Servers Made Easy

All Things Open

This document discusses Linux server provisioning using Stacki. Stacki is a tool that automates the provisioning of Linux servers at scale from bare metal to a fully configured system. It addresses the exponential complexity of managing large clusters as more servers are added. Stacki handles all aspects of server provisioning from OS installation to configuration of networking, storage, software and more. It provides a fully automated, repeatable process to quickly deploy and manage servers.

UKLUG 2012 - XPages, Beyond the basics

Ulrich Krause

What's hot

Elasticsearch Arcihtecture & What's New in Version 5

Burak TUNGUT

Mashing the data

Felix Crisan

Rich storytelling with Drupal, Paragraphs and Islandora DAMS

alxbrdg

02 beginning code first

Maxim Shaptala

Entity cache

Ashok Modi

Kubernetes at Spreadshirt - First steps to production

Jens Hadlich

Share point 2013 on azure

Prabath Fonseka

Presentation: mongo db & elasticsearch & membase

Ardak Shalkarbayuli

MongoDB

Rony Gregory

SQL for Elasticsearch

Jodok Batlogg

Search and analyze your data with elasticsearch

Anton Udovychenko

Utilizing the OpenNTF Domino API

Oliver Busse

Windows server and docker

girish goudar

PlovDev 2016: Оркестрация на контейнери с Kubernetes - Мартин Владев

PlovDev Conference

Rapid prototyping using azure functions - A walk on the wild side

Samrat Saha

Introduction to elasticsearch

hypto

Selecting the right persistent storage options for apps in containers Open So...

bipin kunal

Cassandra

Pooja GV

AWS Fargate in practice. How to run containers without managing EC2 instances

Max Borysov

What's hot (19)

Elasticsearch Arcihtecture & What's New in Version 5

Mashing the data

Rich storytelling with Drupal, Paragraphs and Islandora DAMS

02 beginning code first

Entity cache

Kubernetes at Spreadshirt - First steps to production

Share point 2013 on azure

Presentation: mongo db & elasticsearch & membase

MongoDB

SQL for Elasticsearch

Search and analyze your data with elasticsearch

Utilizing the OpenNTF Domino API

Windows server and docker

PlovDev 2016: Оркестрация на контейнери с Kubernetes - Мартин Владев

Rapid prototyping using azure functions - A walk on the wild side

Introduction to elasticsearch

Selecting the right persistent storage options for apps in containers Open So...

Cassandra

AWS Fargate in practice. How to run containers without managing EC2 instances

Similar to Running DSpace: Technical overview, lessons learned, workflows and essential skills

OpenStack Cinder, Implementation Today and New Trends for Tomorrow

Ed Balduf

Provisioning Servers Made Easy

All Things Open

UKLUG 2012 - XPages, Beyond the basics

Ulrich Krause

06 integrate elasticsearch

Erhwen Kuo

在這個大數據的時代, 使用者最常使用的功能之一就是檢索。而在一般的網頁應用程式的設計中最容易拖慢速度的不是檢索(Search)就是產生report(尤其是Summary或是Aggregation)。如何在幾TB的數據量下面而又能快速反應使用者對檢索(Search)或summary/aggregation reporting的需求, 我們需要能夠快速檢索的引擎。想知道如何輕輕鬆鬆地整合一個檢索的引擎, 而且是能夠很容易擴展的幾十個Ｎodes的檢索clusters嗎？想知道Stackoverflow或GitHub這些擁有幾十上百TB資料的網路大神如何提供檢索的功能嗎？在這個最終篇的課程中,讓我們一起登頂吧 (這是一本跑歩熱血小說中, 男主角對其它組員的勉勵slogan~~嗯…有點冷!!)

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary

Hiram Fleitas León

Flexible compute

Peter Clapham

Sanger, upcoming Openstack for Bio-informaticians

Peter Clapham

UnConference for Georgia Southern Computer Science March 31, 2015

Christopher Curtin

Introduction to Stacki - World's fastest Linux server provisioning Tool

Suresh Paulraj

Stacki is an open source tool for provisioning and managing Linux servers at scale. It provides fast, reliable provisioning of servers from bare metal to a fully configured system. PayPal uses Stacki to manage their Hadoop infrastructure, which includes over 3,000 nodes spread across multiple datacenters. Stacki automates tasks like disk formatting, partitioning, OS installation, and integration with other tools to quickly provision new servers. It helped PayPal reduce provisioning time from hours to just 14 minutes for 288 servers.

[DanNotes] XPages - Beyound the Basics

Ulrich Krause

This document provides an agenda for a conference on XPages Beyond the Basics held from February 2-3, 2012 in Denmark. The agenda includes topics like JavaScript/CSS aggregation, pre-loading for XPages, Java design elements, themes, the XPages Extension Library, relational database support using JDBC, exporting data to Excel/PDF, and more. The document also introduces the speaker, Ulrich Krause, an IBM Champion and experienced Notes/Domino developer.

Docker & ECS: Secure Nearline Execution

Brennan Saeta

DrupalCampLA 2011: Drupal backend-performance

Ashok Modi

12 core technologies you should learn, love, and hate to be a 'real' technocrat

linoj

Scala at Treasure Data

Taro L. Saito

Scala is widely used at Treasure Data for data analytics workflows, management of the Presto query engine, and open-source libraries. Some key uses of Scala include analyzing query logs to optimize Presto performance, developing Prestobase using Scala macros and libraries like Airframe, and integrating Spark with Treasure Data. Treasure Data engineers have also created several open-source Scala libraries, such as wvlet-log for logging and Airframe for dependency injection, and sbt plugins to facilitate packaging, testing, and deployment.

The Why and How of Scala at Twitter

Alex Payne

GWT is Smarter Than You

Robert Cooper

CGSpace technical overview

Running DSpace: Technical overview, lessons learned, workflows and essential skills

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Running DSpace: Technical overview, lessons learned, workflows and essential skills

Similar to Running DSpace: Technical overview, lessons learned, workflows and essential skills (20)

More from ILRI

More from ILRI (20)

Recently uploaded

Recently uploaded (20)

Running DSpace: Technical overview, lessons learned, workflows and essential skills