Gen AI in Business - Global Trends Report 2024.pdf
DSpace 4.2 Transmission: Import/Export
1. DSpace 4.2
Advanced Training –
Content Transmission
DSpace 4.2 Advanced Training by James Creel is licensed under a
Creative Commons Attribution 4.0 International License. Special
thanks to the DuraSpace Foundation and the Texas Digital Library
for making this course possible.
2. Module Outline
• Harvesting and Disseminating with OAI/PMH
• Reading content with REST
• Export and Import with SIPs
• Depositing content with SWORD
• Importing content with the Simple Archive Format (SAF)
3. Introduction to Harvesting
• Open Archives Initiative
• Protocol for Metadata Harvesting
• Object Reuse and Exchange
• Harvesting with DSpace XMLUI
• Choice of collection source
• Replicate metadata (OAI-PMH) or metadata + data (OAI-PMH +
OAI-ORE)
• What an excellent way to rapidly populate one’s repository!
4. Introduction to Harvesting
• Go ahead and create a new collection wherever you please.
• We will be harvesting content from remote DSpace
repositories.
• Having created the collection, one is taken to the edit view.
Click the tab for Content Source
6. How do we learn about the harvest
source?
• Point your browser to
http://repository.tamu.edu/dspace-
oai/request?verb=ListSets to see a list of
collections at TAMU.
• There are several interesting verbs for which an OAI server will
grant requests-
• Point your browser to
http://www.openarchives.org/OAI/openarchives
protocol.html for details
• In the 1.8.x days, one would need to keep that page open when
trying to craft queries to OAI. Under 3.x and higher, there is a
lovely stylesheet courtesy of Lyncode that makes typical queries
easy and automatic.
7. Configuring the Content Source
• A sample OAI Provider – OAK Trust: The Texas A&M Digital
Repository: http://repository.tamu.edu/dspace-
oai/request
• OAI Set spec: com_1969.1_5670
• Test the settings to make sure things are copasetic, then save.
10. Your oai webapp provides a
machine-readable dissemination
service.
• Try some requests:
• http://localhost:8080/oai/request?verb=Ident
ify
• http://localhost:8080/oai/request?verb=ListM
etadataFormats
• http://localhost:8080/oai/request?verb=ListS
ets
• http://localhost:8080/oai/request?verb=ListR
ecords&metadataPrefix=ore
11. We can experiment with
harvesting from each other’s
repositories
• From your command line, run ipconfig
• Your ip address will be listed as the IPv4 address
• You can craft a OAI request URL for your server using
the ip address as the host name.
• If you like, invite a neighbor to harvest one of your
collections.
12. Automating Harvesting (1/3)
• Requests to harvest large collections can easily time
out.
• This calls for a scheduler that runs independently of
the browser.
• Find it in the XMLUI under
the control panel.
13. Automating Harvesting (2/3)
• When automated, the harvester will conduct its
activity on all collections that are configured to
harvest.
• Once started, the harvester will operate at regular
intervals as specified by
harvester.harvestFrequency in
modules/oai.cfg.
14. Automating Harvesting (3/3)
• Start – initiate the periodic process
• Pause – wait for the current operation to complete,
then suspend further operations
• Stop – wait for the currently harvested item to
complete, then suspend further operations (which will
likely break further harvests of the containing
collection)
• Reset Harvest Status – clears the status of each
harvested collection so that they may be initiated
anew
15. Which formats are available to
your harvester?
• This is configurable in [dspace-install-
dir]modulesoai.cfg under the
harvester.oai.metadataformats.[declar
ed-metadata-format-name] values
• Where [declared-metadata-format-name] is declared in
your xoai.xml
• Let’s add “rdf” to that list and try harvesting with it.
16. Dissemination – Metadata
Crosswalks
• Metadata in DSpace exist in key-value pairs with field names
given by the metadata registry.
• Fields may be exported in the formats that oai indicates from
the ListMetadataFormats verb.
• Dissemination crosswalks are encoded as XSL files inside the
[dspace-install-dir/config/crosswalks]
directory
• The .properties seem to have stopped being used for OAI
dissemination since DSpace went to version 3.x
• The crosswalks are active in specific contexts that can be
configured.
17. Configuring Metadata Crosswalks –
XOAI Configuration Entities
• Open up the
C:dspaceconfigcrosswalksoaixoai.xml
file with jEdit.
• The top level Configuration element contains <Contexts>,
<Formats>, <Transformers>, <Filters>, and
<Sets>.
• Each of these contain, in turn, what you would expect -
<Context> elements, <Format> elements,
<Transformer> elements, <Filter> elements, and
<Set> elements.
• Each of these does its own thing.
18. Configuring Metadata Crosswalks –
XOAI Configuration – Setting up
Contexts
• The <Context> element refers to instances of all the other
elements.
• The baseurl attribute determines how to address the context
in your url path
• The <Format> elements name the crosswalks to be available
• The <Transformer> element names a stylesheet to apply to
the final XML output
• The <Filter> elements name Java classes that will eliminate
results unacceptable to the context
• The <Set> element appears simply to alias the set of all records
in the context.
19. Configuring Metadata Crosswalks –
XOAI Configuration – Setting up
Formats
• The <Format> elements have an id attribute which allows
them to be referenced in the <Context>
• They also contain, minimally, a
• <Prefix> by which they are addressed in OAI requests
• <XSLT> designating the xsl file doing the crosswalk
• And should include
• <Namespace> designating the namespace of XML output
• <SchemaLocation> designating the schema specification of
that XML
20. Configuring Metadata Crosswalks –
XOAI Configuration – Setting up
Transformers
• The <Transformer> element contains an id attribute by
which it is referenced in the <Context> and an <XSLT>
element designating its XSL file.
21. Configuring Metadata Crosswalks –
XOAI Configuration – Setting up
Filters
• The <Filter> elements contain an id attribute by which
they are referenced in the <Context> and
• <Class> which names the java class doing the filtering
• <Parameter> with a key attribute and one or more <Value>
elements that are used to parameterize the filtering method.
22. Configuring Metadata Crosswalks –
XOAI Configuration – Setting up
Sets
• The <Set> element has the usual id attribute and
• <Pattern> which renders as the set spec in the OAI response
• <Name> which renders as the set’s name
23. Exercise – A Custom Context
• Let’s imagine a use case where there is a requirement to be
harvested by a vendor or partner.
• Only items with certain fields are suitable for their index (for
example, those with a title, author, and type)
• Create a new context with an appropriate filter.
24. Configuring Metadata Crosswalks –
Styling for Human Readability
• The webappsoaistaticstyle.xsl stylesheet is used to render
the OAI responses in a nice readable format with the links of
interest also provided.
• One may also change the stylesheet being used by OAI by
changing the stylesheet attribute of the
<Configuration> root element of xoai.xml.
• Let’s experiment with some changes to the style –
• New branding
• Links to each of the contexts
25. The REST Webapp (1/4)
• Representational State Transfer – A scaleable, simple approach
to web services.
• Stateless on the server side – client maintains any session data
• Cacheable – responses should indicate whether the client can
save them in a web cache
• Layerable – Client need not know or care whether the server is
behind a proxy
• Simple, Uniform Requests – resources identifiable by URI,
responses report their format and their cacheability
26. The REST Webapp (2/4)
• Read Only in 4.x
• JSON or XML depending on your HTTP Header: Accept
• Possible values are application/xml and application/json
• Your browser may default to one or the other, but your
application code (or developer’s browser) can specify.
• Communities, Collections, Items and Bitstreams are queryable
resources
• The ?expand query parameter followed by a comma
delimited list will provide more detail than the default queries
27. The REST Webapp (3/4)
• Communities
• /rest/communities lists all
• /rest/communities/:id gets one
• ?expand possibilities: parentCommunity,
collections, subCommunities, logo, all
• Collections
• /rest/collections lists all
• /rest/collections/:id gets one
• ?expend possibilities: parentCommunityList,
parentCommunity, items, license, logo, all
28. The REST Webapp (4/4)
• Items
• /items/:id lists one
• ?expand possibilities: metadata, parentCollection,
parentCollectionList, parentCommunityList,
bitstreams, all
• Bitstreams
• /bitstreams/:bitstreamID lists one
• /bitstreams/:bitstreamID/retrieve to download
• ?expend possibilities: parent, all
29. The DSpace Packager
• Utilized with the dspace packager command-line script
• Submission Information Packages
• Dissemination Information Packages
30. Submission Packages (SIPs)
• Four package formats supported by default:
• DSpace Archival Information Package (AIP) – used for backing up
and restoring DSpace repository content
• DSPACE-ROLES – used for backing up and restoring DSpace groups
and epersons
• METS – A zipfile containing MODS descriptive metadata and
designating content bitstreams and their disposition
• PDF – A single PDF file can be considered a package (supposing its
embedded metadata are suitable
31. Submission Packages (SIPs)
• An example – importing a PDF as a package
• Track down a pdf on the interwebs – here’s one!
• http://hdl.handle.net/1969.1/2313
• Copy it to [dspace-install-dir] i.e. C:dspace
• Learn about the packager with the
C:dspacebindspace packager --help --
type PDF command
• Can you craft the command to make the submission?
32. Submission Packages (SIPs) –
PDF example
• We need a –t for type, -p for parent collection, -e for eperson
email, and finally the name of the “package”
• Once this succeeds, however, the quality of the metadata is
likely to be very poor indeed! Embedded metadata are
seldom well populated.
33. Submission Packages (SIPs)
• An example – importing a METS package
• Of interest as this is also the package used by default for SWORD
deposits
• Find the file mets-sip-example.zip in the
W:Developmentresources directory.
• Copy it to [dspace-install-dir] i.e. C:dspace
• Learn about the packager with the C:dspacebindspace
packager --help --type METS command
• Can you craft the command to make the submission?
34. Submission Packages (SIPs) –
METS example
• We need at least the –t flag for type, -p for parent collection, -
e for eperson, and finally the filename of the package.
• C:dspacebindspace packager –t METS –p
[collection-handle] –e admin@admin.com
mets-sip-example.zip
35. Dissemination Packages (DIPs)
• DSpace Archival Information Package
• DSPACE-ROLES
• METS
• No need to export PDFs, we might suppose.
• As a final packaging exercise, use the packager to disseminate
an item. This will require the additional –i (identifier, i.e.
handle of the object) and –d (disseminate instead of the
default, submit)
• Can you craft the command?
36. Dissemination Packages (DIPs)
• A successful dissemination:
• Let’s complete the circle by submitting this package to another
(or even the same) collection.
37. SWORD
• Simple Web Service Offering Repository Deposit
• DSpace comes with servers for v1 and v2
• Big innovation of v2 is ability to update items, but client
support is currently limited
• Accessible via a client or (e.g.) a cURL command.
• Accepts deposits via METS packages by default
• Requires an administrative eperson account
38. SWORD – accessing via cURL
command
• A cURL executable is provided at W:Developmentcurl-
7.37.0-win32
• Copy that directory to your own C:Development.
• This command is an extremely robust tool that enables
communication of data over protocols with and without
encryption – we here are interested just in HTTP today.
39. SWORD – accessing via cURL
command – getting the
servicedocument
• Clues to the meaning may be found at
http://curl.haxx.se/docs/manpage.html
40. SWORD – accessing via cURL
command – Making a deposit
• A long, long command indeed…
• curl
• -i
• --data-binary "@mets-sip-example.zip"
• -H "Content-Disposition: filename=mets-sip-example.zip"
• -H "Content-Type: application/zip"
• -H "X-Packaging: http://purl.org/net/sword-
types/METSDSpaceSIP"
• -H "X-No-Op: false“
• -H "X-Verbose: true“
• --user "admin@admin.com:admin"
http://localhost:8080/sword/deposit/123456789/26
41. SWORD – accessing via cURL
command – Making a deposit
• Find that text in the W:Developmentresourcescurl-deposit-
notes file.
• In an amusing turn of events, this deposit will fail from most of
our localhost machines, as behind the scenes the SWORD
server will attempt to write a temporary file named after your
IP address which contains colon characters which are illegal in
Windows filenames.
• This can be gleaned from the
C:Developmenttomcatlogslocalhost.[today].log
• Instead, let’s experiment with deposits to other servers in the
room.
42. SWORD – Bringing up the
DSpace Client
• Activate the aspect in xmlui.xconf
• Target repositories are configured in the [dspace-
install-dir]configmodulessword-
client.cfg file
43. SWORD – Utilizing the DSpace
SWORD Client
• Serves at this time only to copy existing items to another
SWORD-enabled repository.
• To utilize, navigate to the item’s page while logged in as an
administrator.
• Let’s try some
deposits to
localhost and
our neighbors.
44. SWORD – Looking Forward to
Sword v2 in Practice
• Sword v2 offers the capability to change the content and
metadata of previously deposited items
• Java libraries for the client are available, but I have not seen an
implemented GUI.
• cURL usage is also theoretically quite possible, but also looks
like a little bit of heavy lifting.
45. Batch Imports
• DSpace Simple Archive Format (SAF)
• The DSpace import script
• Adding items
• Replacing items
• Deleting items
• Importing from real sources
• Example: CSV
• Example: MARC XML
46. DSpace SAF (1/3) - Overview
• The top level directory contains one directory for each item in
the batch.
• Each item directory must contain:
• The bitstream files
• A contents manifest contents
• A metadata file dublin_core.xml
• Optionally, other metadata files with names like
metadata_[schema].xml where [schema] is the schema’s
name.
Scott Phillips provides a fine guide at
http://www.scottphillips.com/2009/05/howto-dspace-batch-ingest/
47. DSpace SAF (2/3) – Contents
Manifest
• The contents manifest contents names each bitstream
that will be in the item as well as it’s disposition:
• Bundle
• Permissions
• Primacy
48. DSpace SAF (3/3) – Metadata
• The SAF uses a specific XML format for the encoding of Dublin
Core style metadata.
• dublin_core.xml
• metadata_[schema].xml where [schema] is another
metadata schema in your repository’s registry
• The containing element is dublin_core with a schema
attribute.
• The field elements are dcvalue with schema, element, and
qualifier attributes.
49. Example imports…
• Provided are some rough code examples that will parse a CSV
metadata file (and associated content files) or a MARC XML
file (and associated content files).
• The code examples are in Java and best comprehended in a
nicely configured development environment, but we can work
with them using jEdit and the command line.
• We will conduct these imports into the repository and
consider the advantages and disadvantages of the approach.
50. An example import: CSV
• Create the import processor application in your
C:DevelopmentSAFCreator directory
• mvn clean package
• Run it with java –jar targetSAFCreator-0.0.1-
SNAPSHOT.one-jar.jar
• You will be presented with a Java Swing interface where you can
specify a csv metadat a file, a directory for source files, and
directory for SAF output, and other details for the batch.
51. An example import: CSV
• Import the SAF as follows:
• c:dspacebindspace import -a -e
admin@admin.com -s c:DevelopmentSAFtest-
output -c 123456789/2 -m
c:DevelopmentSAFtest-outputmap.map
52. An example import: MARC XML
• This example may be found in the import/marc directory
• Create the program with
• javac –sourcepath . *.java
• jar cfm xslimporter.jar manifest.mf *.class
• Run with
• java –jar xslimporter.jar
To see a common import difficulty, attempt an import as we did for
the CSV example.
-This will result in some schema-related errors, a very common
problem when doing imports.
53. An example import: MARC XML
• Add the following to a new thesis metadata schema and
re-attempt the import.
• degree.name
• degree.level
• degree.discipline
• degree.department
54. Consider the Import Results
• Idiosyncrasies of certain field values are more apparent in
different syntactic contexts.
• Different metadata origins entail different complexities in the
processing.
• Importation into a digital repository is a crucial step in the life
of a digital resource, as it is a chance to refine metadata, after
which it can be easily transmitted via crosswalks.
• However, it is a time when metadata are at risk of loss for lack
of care.
55. Final Thoughts on Content
Transmission
• Along with preservation, one of the greatest services provided
by digital repositories
• Yet, like preservation, good transmission requires constant work
• Crosswalks must be maintained to standards as well as local
practices
• Our means of importing content are constantly improving but
face a moving target
• New collection types inevitably require new development work if
their ingestion is to be automated