SlideShare a Scribd company logo
Fedora Commons in the
CLARIN Infrastructure
Menzo Windhouwer
menzo.windhouwer@meertens.knaw.nl
Meertens Institute, TLA, CLARIN ERIC
Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
Fedora Commons
• fedora-commons.org
• 300 registered installations
• 1997: started as a research project at Cornell University
• Implemented as a Java servlet
• 2009: joined the DSpace foundation (now DuraSpace)
• 2014: Fedora Commons 4 released
• More RDF-based
• Not backward compatible qua functionality, e.g., APIs
• Data migration utilities available
• 2015: last Fedora Commons 3 release (3.8.1)
• wiki.duraspace.org/display/FEDORA38/
• github.com/fcrepo3
• focus
Fedora Commons main features
• Digital Objects
• Content Model Architecture (FOXML)
• Datastreams
• Relationships between Digital Objects (RDF)
• APIs (REST/SOAP)
• Access
• Management
• Security (XACML)
• Access control
• Policies
• Message queue
• OAI-PMH
• Replication & mirroring
• Versioning
• Checksums
Fedora Commons main features
• Digital Objects
• Content Model Architecture (FOXML)
• Datastreams
• Relationships between Digital Objects (RDF)
• APIs (REST/SOAP)
• Access
• Management
• Security (XACML)
• Access control
• Policies
• Message queue
• OAI-PMH
• Replication & mirroring
• Versioning
• Checksums
Digital Objects - Model
“Fedora uses a "compound digital object" design
which aggregates one or more content items
into the same digital object. Content items can
be of any format and can either be stored locally
in the repository, or stored externally and just
referenced by the digital object. The Fedora
digital object model is simple and flexible so
that many different kinds of digital objects can
be created, yet the generic nature of the Fedora
digital object allows all objects to be managed in
a consistent manner in a Fedora repository.”
Digital Objects – Content Model Architecture
1. Data Object
• “Data objects are what we normally think
of when we imagine a repository storing
digital collections. Data objects can
represent such varied entities as images,
books, electronic texts, learning objects,
publications, datasets, and many other
entities.”
2. Content Model Object
• “[A]cts as a container for the Content
Model document which is a formal model
that characterizes a class of digital
objects.”
3. Service Definition Object
4. Service Deployment Object
Digital Objects - Datastreams
• “The content represented by a Datastream is treated as an opaque bit
stream; it is up to the user to determine how to interpret the content (i.e.
data or metadata).”
• Where does this bit stream live?
1. Internal XML Content
“the content is stored as XML in-line within the digital object XML file” (FOXML)
2. Managed Content
“the content is stored in the repository and the digital object XML maintains an internal
identifier that can be used to retrieve the content from storage”
3. Externally Referenced Content
“the content is stored outside the repository and the digital object XML maintains a URL that
can be dereferenced by the repository to retrieve the content from a remote location”
4. Redirect Referenced Content
“the content is stored outside the repository and the digital object XML maintains a URL that is
used to redirect the client when an access request is made”
Digital Objects - Relations
• Relationships between Digital Objects
• Collections, compounds, cross references, …
• Using the Fedora relationship ontology
• Domain specific relationships
• Encoded in RDF
• RELS-EXT: relations from the DO to other DOs or external resources
• RELS-INT: relations from datastreams in the DO to other resources
Digital Objects - FOXML
<foxml:digitalObject PID="lat:1839_00_0000_0000_0016_7E07_7" xmlns:foxml="info:fedora/fedora-system:def/foxml#" …>
<foxml:objectProperties>
<foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="A"/>
<foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="deerhunt"/>
</foxml:objectProperties>
<foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="X">
<foxml:datastreamVersion ID="DC.0" FORMAT_URI="http://www.openarchives.org/OAI/2.0/oai_dc/"
MIMETYPE="text/xml" LABEL="Dublin Core Record for this object">
<foxml:xmlContent>
<oai_dc:dc …>
<dc:title>deerhunt story</dc:title>
<dc:description xml:lang="eng">The text was recorded at Madison University in the 1960s. The text was recorded indoors.</dc:description>
...
</oai_dc:dc>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="CMD" STATE="A" CONTROL_GROUP="X">
<foxml:datastreamVersion ID="CMD.0" LABEL="CMD Record for this object" MIMETYPE="application/x-cmdi+xml" …>
<foxml:xmlContent>
<cmd:CMD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.1" …>...</cmd:CMD>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
…
Digital Objects - FOXML
…
<foxml:datastream ID="RELS-EXT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
<foxml:datastreamVersion ID="RELS-EXT.0" LABEL="RDF Statements about this object" MIMETYPE="text/xml">
<foxml:xmlContent>
<rdf:RDF xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" …>
<rdf:Description rdf:about="info:fedora/lat:1839_00_0000_0000_0016_7E07_7">
<fedora:isMemberOfCollection rdf:resource="info:fedora/lat:1839_00_0000_0000_0016_7E41_8"/>
<fedora-model:hasModel rdf:resource="info:fedora/islandora:compoundCModel"/>
<fedora-model:hasModel rdf:resource="info:fedora/islandora:sp_cmdiCModel"/>
<oai:itemID xmlns="http://www.openarchives.org/OAI/2.0/">oai:flat.example.com.:lat:1839_00_0000_0000_0016_7E07_7</oai:itemID>
</rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="TN" STATE="A" CONTROL_GROUP="E">
<foxml:datastreamVersion ID="TN.0" LABEL="icon.png" MIMETYPE="image/png">
<foxml:contentLocation TYPE="URL" REF="file:/app/flat/icons/folder.png"/>
</foxml:datastreamVersion>
</foxml:datastream>
</foxml:digitalObject>
APIs (REST/SOAP)
• The ‘RESTful’ APIs provide easy HTTP URLs to access (API-A) objects
and their datastreams:
1. https://www.meertens.knaw.nl/flat/objects/lat:10744_1b9e0d44_ef4d_496
c_8939_6129b5ee5b49/datastreams/CMD/content?asOfDateTime=2017-
01-27T11:30:52.732Z
2. https://www.meertens.knaw.nl/flat/objects/lat:10744_792194f7_d1fd_400
c_ab2b_9b51f4fe3907/datastreams/OBJ/content?asOfDateTime=2017-01-
27T11:31:01.207Z
Used as redirect for a handle
Notice the use of a timestamp to refer to a specific version of the datastream
• API-M provides methods to update objects and their datastreams
• Access to API-M can be limited using repository wide XACML policies
Security (XACML)
• eXtensible Access Control Markup Language (XACML) is a OASIS standard to encode access
control policies
“Each XACML policy defines: (1) a "target" describes what the policy applies to (by referring to attributes of
users, operations, objects, datastreams, dates, and more), and (2) one or more "rules" to permit or deny access.”
 Rather cryptical and bloated language
• Repository wide policies
• Access to API-M (methods) by certain user/roles from certain IP adresses
• …
• Object specific policies
• Which users can access which datastreams
• …
• User profiles
• Plugin any authfilter in the application server
• Hardcoded users
• …
Fedora Commons as a basis - extensions
• Facetted search: gsearch (Solr)
• Listens to the FC message queue
• Runs an XSLT to create a SOLR document
• OAI-PMH: Proai
• Occasionally queries FCs resource index
• Can deliver other metadata datastreams than the default Dublin Core
• …
Fedora Commons as a basis - frontends
• Islandora
• Drupal based
• Large set of modules, relatively easy extensible
• Still based on Fedora Commons 3
• Ongoing experiments/development, e.g., CLAW for Islandora
• Hydra
• Ruby on Rails based
• More hardcoded workflow and data models
• …
• Portland Common Data Model
• Common data model (content models) so migration between front-ends/frameworks
becomes easier
Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
Repository solutions in use by CLARIN centres
0
1
2
3
4
5
6
7
8
9
Fedora
Commons
DSpace custom LAT GIT eSciDoc
Repository info on 20 B centres in the Centre registry
# B centres
Notes:
• Meertens: custom -> Fedora
Commons
• MPI: LAT -> Fedora Commons
• eSciDoc: Fedora Commons
under the hood
• Various C centres also run a
Fedora Commons (based)
repository
How happy are these centres with Fedora Commons?
• Send out a questionnaire to 9 centres: 6 responses 
Do you (still) consider Fedora Commons a
sustainable repository solution for your
center?
yes no
Would you advice new CLARIN centers
to use Fedora Commons as (the basis
for) their CLARIN-compatible repository
solution?
yes no maybe
If you are member of CLARIN-D
then you probably might want
to choose Fedora, but if you're
in another country you might
want to take a closer look at
other solutions (DSpace or TLA
software).👍🏻
Depends partly
on available
technical
expertise
Fedora Commons versions
0
0.5
1
1.5
2
2.5
3.6 3.6.2 3.7.1 3.8.1 4
Which version of Fedora Commons does
your centre use in production?
# centres
Do you plan a move to Fedora Commons 4?
yes no maybe
benefit from
Linked Open
Data
approach;
within next 2
years
We are migrating to
version 4 right now. We
also made major
enhancements to our
front-end. We are
planning to go into
production with it within
the next months.
Size of the centre’s repositories
# Digital Objects:
ca. 150
2,500
3,038
10,000
33,000
# bytes:
ca. 125M (metadata only)
5G
16G
ca. 500G
Both MPI and Meertens have currently
over the 100.000 CMD records in the VLO,
which describe resources that take up
several TB (and up to 1M DOs).
Experiments did reveal problems in the FC
area, but they can be repaired 
Community support
How helpful was/is the
documentation available
within the Fedora Commons
community?
not at all somewhat ok very much
How helpful was/is the
support by the Fedora
Commons community?
not at all somewhat ok very much
How helpful was/is the
documentation on Fedora
Commons by the CLARIN
community?
not at all somewhat ok very much
How helpful was/is the
support for Fedora
Commons within the
CLARIN community?
not at all somewhat ok very much
Unfortunately
there seem to be
no more Fedora
User Groups in
Europe...
Being one of the first centers
to use Fedora Commons, we
did use the documentation
available within the FC
community. At that time
there was not much CLARIN
documentation.
This blog entry was very useful
for us:
http://asingh.com.np/blog/fedo
ra-commons-installation-and-
configuration-guide/
an option for the case one
has never made use of the
support should have been
included
Frontends
0
0.5
1
1.5
2
2.5
3
3.5
none Islandora custom
Do you use a front-end, e.g., Islandora, Hydra or your own, next to
Fedora Commons?
# centres
own front-end,
based on
Django
(EulFedora)
and MySQL
We developed
our own, called
Erdo
The built-in user interface is not
adequate. You will need to
replace it with something better.
Additional advice
• “Let Apache httpd (or Apache Tomcat) take care for most of the
configuration (access control) and configure Fedora Commons to be
"open". Take care what to store in Fedora and what not (it can be very
unhandy to store too many data streams inside Fedora).”
• “I consider the two offered RDF query languages (SPARQL, ITQL) by
Fedora as insufficient, as both miss important features, e.g ITQL can't
use regexp search and can't sort strings numerically and SPARQL can't
use COUNT operator and also cannot sort strings numerically (at least
in version 3.6.2).”
• “For CMDI metadata, you also need the Proai OAI provider. Use the
version customised for Fedora Commons.”
Overview
1. An overview of Fedora Commons (3.8.1)
2. Current usage by CLARIN centres
3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora
Commons and Islandora
FLAT’s predecessors
• The Language Archive (TLA) at the MPI for Psycholinguistics
• long history in digital archiving, especially resources on endangered languages
• home build LAT (Language Archiving Technology)
• 2014 – now: preparing to switch to a stack that is largely based on off-the-shelf
software based on Fedora Commons + Islandora
• choice made after a INNET repository workshop and several pilots
• initial version based on scripts kindly provided by IDS
• started as EasyLAT now known as (TLA-)FLAT (Fedora Language Archiving Technology)
• doing a lot of cleanup/curation along the way from LAT to FLAT
• The Meertens Institute
• collecting valuable (Dutch) (physical) humanities resources for over a century
• digitization projects
• digital born resources
• KNAW participates in TLA and the Meertens Institute teamed up with the MPI to
modernize its setup and develop FLAT
FLAT’s predecessors
TLA-FLAT base line
• Meet the, technical, CLARIN B centre requirements
• Meet the, technical, Data Seal of Approval (DSA) requirements
• Meet organization specific requirements
• Meet, at least the CLARIN B centre and DSA, requirements, as much as
possible, with the Fedora Commons backend
• frontend (technology) come and go quickly
• How far can we get using available components, configuration and a
limited level of tailor made software?
• Mainly to add support for CMDI
• Start with Fedora Commons 3.8.x and Islandora 7.x-1.x, move along with
the Islandora community to Fedora Commons 4
Islandora 7.x-1.x
• islandora.ca
• An open-source software framework designed to help institutions and organizations and their audiences collaboratively
manage, and discover digital assets using a best-practices framework.
• Islandora was originally developed by the University of Prince Edward Island's Robertson Library, but is now implemented and
contributed to by an ever-growing international community.
• Built on a base of Drupal (7.x), Fedora (3.x), and Solr, Islandora releases solution packs which empower users to work with
data types (such as image, video, and pdf) and knowledge domains (such as Chemistry and the Digital Humanities).
Solution packs also often provide integration with additional viewers, editors, and data processing applications.
• wiki.duraspace.org/display/ISLANDORA/Islandora
• github.com/Islandora
• github.com/Islandora-Labs/islandora_awesome
• github.com/discoverygarden
• Digital Objects are not Drupal nodes, the Islandora modules interact with Fedora Commons via an intermediate (PHP)
layer, Tuque
• In CLAW Digital Objects are Drupal nodes synchronized using Apache Camel
CLARIN B centre requirements
• [CLARIN-B-2] Centres need to adhere to the security guidelines, i.e. the
servers need to have accepted certificates.
• [CLARIN-B-3] Centres need to join the national identity federation where
available and join the CLARIN service provider federation to support single
identity and single sign-on operation based on SAML 2.0 and trust
declarations.
• [CLARIN-B-5] Centres need to offer component based metadata (CMDI)
that make use of elements from accepted registries such as ISOcat in
accordance with the CLARIN agreements, i.e. metadata needs to be
harvestable via OAI-PMH.
• [CLARIN-B-6] Centres need to associate PIDs records according to the
CLARIN agreements with their objects and add them to the metadata
record.
DSA requirements
• [DSA-10] The data repository enables the users to discover and use
the data and refer to them in a persistent way.
• [DSA-11] The data repository ensures the integrity of the digital
objects and the metadata.
• [DSA-12] The data repository ensures the authenticity of the digital
objects and the metadata.
• [DSA-13] The technical infrastructure explicitly supports the tasks and
functions described in internationally accepted archival standards like
OAIS.
Meertens Institute & TLA requirements
• [Home-1] The repository should support arbitrary deep collection hierarchies.
• [Home-2] The repository should support handles as persistent identifiers.
• [Home-3] The repository should work with arbitrary CMDI profiles.
• [Home-4] The repository should provide resource level access control.
• [Home-5] The repository should allow collection management to review submissions before the
resources are actually ingested.
• [Home-6] The repository should allow system management to determine the location of
resources on persistent storage, e.g., from fast access times to secure tape drives.
• [Home-7] The repository should allow the storage of arbitrary relationships between data sets.
• [Home-8] The repository should provide entry points for interaction with Virtual Research
Environments,
• [Home-9] The repository should allow for collection management oriented metadata, which
might not be public.
FLAT’s place at the Meertens Institute
Drupal
Islandora
Fedora Commons
Deposition
Service
(DoorKeeper)
SIP
AIP
Workspace
(ownCloud)
Virtual Research
Environment
Persistent
storage
SOLR
(MTAS)
Backups
(EUDAT)
Collection
Management
Infrastructures
(CLARIN)
SWORD
CMDI SP
OAI-PMH
💡
💡
💡
💡
💡
FLAT’s place at the MPI/TLA
Drupal
Islandora
Fedora Commons
Deposition
Service
(DoorKeeper)
SIP
AIP
Workspace
(ownCloud)
Deposition
UI
Persistent
storage
Backups
(DANS)
Infrastructures
(CLARIN)
SWORD
CMDI SP
OAI-PMH
💡
FLAT modules
• Core
• Fedora Commons and Islandora setup
• CMDI Solution Pack
• CMD to FOXML conversion
• Proai setup
• Indexing (SOLR)
• gsearch-based solution for CMDI
• Meertens’ CMDI indexer
• SWORD 2.0
• Reuses a deposit via SWORD approach and implementation by DANS
• DoorKeeper
• Deposition UI
• IMDI conversion
• Shibboleth
Shibboleth setup is very
server specific, so there is
a module that illustrates
the Drupal setup and can
be combined with a test
IdP.
CMDI Solution Pack
• Registers a metadata renderer in Islandora
• Triggers when a Digital Object uses the CMDI content model and
renders the CMD datastream
• The default render XSLT can be overwritten by profile specific XSLTs
• Not FLAT specific, i.e., could be reused outside of FLAT
Archival Information Package (AIP)
isMemberOfCollection
isMemberOfCollection
Collection + CMDI
CMD
RELS-EXT
DC
Collection
DC
Image
OBJ
RELS-EXT
DC
OBJ
RELS-EXT
DC
Collection + Compound + CMDI
CMD
RELS-EXT
DC
Compound + CMDI
CMD
RELS-EXT
DC
Video
OBJ
RELS-EXT
DC
isMemberOfCollection
isMemberOfCollection
isMemberOfCollection
isConstituentOf
isConstituentOf
isConstituentOf contentLocation
contentLocation
contentLocation
isMemberOfCollection
FLAT reuses a lot of
Islandora’s content
models so rendering is
easy. And they can be
easily taken along
without Islandora.
FLAT’s DoorKeeper
• A configurable chain of actions that
• Validate the CMDI, also according to centre specific requirements
• Check the validity of resources against preferred formats (FITS)
• Assess metadata quality
• Offer the SIP for evaluation to collection management
• Move new resources from a temporary workspace into persistent locations
• Expand WebACL to XACML
• Version management
• Assign and create handles (EPIC)
• Interact with Fedora Common’s API-M
• Trigger indexing
• Create backup bags (for DANS or EUDAT)
• Creates user and develop oriented logs
• Interaction via a REST API or the command line
• Uses dynamic class loading, i.e., easily extensible with centre specific actions
• Not too FLAT specific, e.g., usable by other repository setups or replace Fedora by DSpace 
Actions are, in
general, lean and
mean, so its relatively
easy to implement
one in Java.
Submission Information Package (SIP)
• A CMD record referring with
• relative paths to resources within the package
• absolute paths to resources already on the server
• For example, in the user’s ownCloud data directory
• (block access to system files!)
• Additional files
• Access control
• License
• …
• When using the SWORD 2.0 interface these are put in a bag and zipped for
upload
• The SWORD interface allows upload in parts
+-test-sip/
+-bag-info.txt
+-bagit.txt
+-data/
| +-metadata/
| | +-policy.n3
| | +-record.cmdi
| +-resources/
| +-my comic.pdf
| +-secret.txt
+-manifest-md5.txt
+-tagmanifest-md5.txt
Security
• To hide the intricacies of XACML and design choices for content
models we use WebACL to specify the access rules for a SIP
@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
# make a specific resource (identified by the ID of the ResourceProxy) in the SIP accessible to a specific user
[acl:accessTo <sip#h1>; acl:mode acl:Read; acl:agent <#other1>].
# a colleague<#other1> a foaf:Person ;
foaf:account [foaf:accountServiceHomepage <#flat>; foaf:accountName "sarah@meertens.knaw.nl"].
# give the owner read and write access
[acl:accessTo <sip>; acl:mode acl:Read, acl:Write; acl:agent <#owner>].
# the owner
<#owner> a foaf:Person ;
foaf:account [foaf:accountServiceHomepage <#flat>; foaf:accountName "bob@meertens.knaw.nl"].
shortcuts
Shibboleth
EPPNs
CMDI indexing for facetted search
1. gsearch for CMDI
• Based on a XSLT that processes the FOXML
• FLAT generates an XSLT for the (internal) CMD datastream
• Based on the profiles in your CMD records
• And a VLO-like mapping
• Facet = VLO facet
• Facet = concept
• Facet = hard coded XPath
• Only the configured facets will be available
• Can also be used for the required CMD to DC mapping
• Allows to run FLAT for your CMD records out-of-the-box
2. Meertens CMD indexer
• Analyzes the profiles in your CMD records
• Creates facets for all semantic paths it finds
• Facet names based on concept links (plus context)
• At runtime switch between facets for querying and rendering
Includes indexing of collection and
compound relationships. Islandora
can use the SOLR for this instead of
the resource index (by default
Mulgara), which is needed in case of
large collections/compounds.
Replacing Mulgara by another triple
store, e.g., Blazegraph, is even better,
but requires all components to use
SPARQL instead of ITQL.
Deposition UI
• Drupal/Islandora module
• Create a project
• Upload a CMD record
• Or create a new one using a form
• Upload resources
• Via a project specific ownCloud data directory
• dropbox-like functionality
• possibility to link with other providers (dropbox, google drive, ….)
• no need to worry about uploading ‘big’ files
• Freeze a project
• Validate the SIP using the DoorKeeper (async)
• Deposit a valid project
• Validate and deposit the SIP using the DoorKeeper (async)
New vs legacy data
• New data goes via the DoorKeeper so its checked against the centres
policies!
• Legacy (meta)data can be bulk loaded into Fedora Commons:
• Convert IMDI to CMDI (optional)
• Create FOXML for CMD records and resources
• ResourceProxies should contain the local paths to resources, e.g., via @lat:localURI
• Bulk load into Fedora Commons
• Index for facetted search
• Update handles
• EPICify (github.com/meertensinstituut/EPICify)
Scripts
available, but
need to be
generalized.
Branding
• Drupal has extensive facilities for styling and templating
• Drupal has many modules and blocks for additional functionality
• Islandora as well, and also offers solution packs
• During FOXML creation resource specific content models can be used
• Take care, after bulk import or via a DoorKeeper action, that needed
derivatives are created
• Enable solution pack specific viewers
• Some experiments have been done
• FLAT comes with a basic style, but the MPI/TLA and Meertens
instances look very different
Branding
Where are we?
• Set of Docker images that extend each other to build up a complete
solution for a:
• Read only interface for bulk loaded existing (meta)data (master)
• Upload of new data via the DoorKeeper (develop)
• Update metadata resource proxies in the CMDI collection hierarchy
• User audit trails and checksums for big files
• Updating existing data via the DoorKeeper
• Versioning
• Ongoing cleanup and enrichment of (legacy) metadata and resources,
e.g., controlled vocabularies, license information
In production at the
Meertens Institute
www.meertens.knaw.nl/flat
and we are continuously
moving, cleaned, (meta)data
from the old setup to FLAT.
CLARIN B certification based
on FLAT started.
Being connected to
Meertens Institutes
questionnaire
system at the
moment.
A containerization platform that
allows easy development, testing
and deployment.
FLAT is moving
• github.com/TheLanguageArchive/FLAT
• Its birthplace, but FLAT is moving to
• github.com/TLA-FLAT
• Code can be more clearly split over multiple repositories
• DoorKeeper
• Bundles of actions
• Servlet wrapper
• CMDI Solution Pack
• …
• Docker setups
• finer granualarity
• Place for cooperation on
• code
• configuration
• actions
• knowledge sharing
• Q&A, issues
A Dockerfile precisely
describes what
software to install
and how to configure
it to get a running
system.
Fedora Commons,
Islandora and Drupal
documentation is
sometimes hard to
find/read and the full
stack has many layers
and corners. We can
share our experience
CLARIN-wide.
Let’s visit FLAT!
Conclusions
• Fedora Commons (3.8.1) provides many of the basic functionality
needed by a CLARIN B centre
• Fedora Commons has a proven record of being a stable and
satisfactory repository solution for many existing CLARIN centres
• Transition from version 3 to 4 is starting to happen
• TLA-FLAT is a modular CLARIN-compliant Fedora Commons-based
solution that is easy to step in and a platform to share knowledge on
running a Fedora Commons repository and its context
Thanks!
Questions?
now or later 
menzo.windhouwer@meertens.knaw.nl
Please visit
github.com/TheLanguageArchive/FLAT
github.com/TLA-FLAT
TLA-FLAT team
MI: Marc Kemps-Snijders, Menzo Windhouwer, Rob Zeeman, Bas van der Veen
MPI: André Moreira, Daniel von Rhein, Paul Trilsbeek, Guilherme Silva

More Related Content

What's hot

Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
 
Texas navigator planning guide5 10
Texas navigator planning guide5 10Texas navigator planning guide5 10
Texas navigator planning guide5 10
Sue Bennett
 

What's hot (20)

Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJID
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
File system
File systemFile system
File system
 
Dspace
DspaceDspace
Dspace
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Inroduction to Dspace
Inroduction to DspaceInroduction to Dspace
Inroduction to Dspace
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Audio MD Metadata Scheme
Audio MD Metadata SchemeAudio MD Metadata Scheme
Audio MD Metadata Scheme
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Texas navigator planning guide5 10
Texas navigator planning guide5 10Texas navigator planning guide5 10
Texas navigator planning guide5 10
 
Introduction to koha
Introduction to kohaIntroduction to koha
Introduction to koha
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
 
Digital Library Software
Digital Library SoftwareDigital Library Software
Digital Library Software
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables
 
Dspace software
Dspace softwareDspace software
Dspace software
 
LDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access ProtocolLDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access Protocol
 

Similar to Fedora Commons in the CLARIN Infrastructure

DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
DuraSpace
 
Simon Waddington BL RIC WORKSHOP 22032011
Simon Waddington BL RIC WORKSHOP 22032011Simon Waddington BL RIC WORKSHOP 22032011
Simon Waddington BL RIC WORKSHOP 22032011
djmichael156
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
Hannes Ebner
 
The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...
The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...
The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...
University of Missouri
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
eposthumus
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 

Similar to Fedora Commons in the CLARIN Infrastructure (20)

Wilcox - Open Source Repositories and the Future of Fedora
Wilcox - Open Source Repositories and the Future of FedoraWilcox - Open Source Repositories and the Future of Fedora
Wilcox - Open Source Repositories and the Future of Fedora
 
CNIT 152: 13 Investigating Mac OS X Systems
CNIT 152: 13 Investigating Mac OS X SystemsCNIT 152: 13 Investigating Mac OS X Systems
CNIT 152: 13 Investigating Mac OS X Systems
 
CNIT 121: 13 Investigating Mac OS X Systems
CNIT 121: 13 Investigating Mac OS X SystemsCNIT 121: 13 Investigating Mac OS X Systems
CNIT 121: 13 Investigating Mac OS X Systems
 
Islandora and Linked Open Data
Islandora and Linked Open Data Islandora and Linked Open Data
Islandora and Linked Open Data
 
Welcome to the CTDA
Welcome to the CTDAWelcome to the CTDA
Welcome to the CTDA
 
M.Sc. Research Proposal
M.Sc. Research ProposalM.Sc. Research Proposal
M.Sc. Research Proposal
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 
Simon Waddington BL RIC WORKSHOP 22032011
Simon Waddington BL RIC WORKSHOP 22032011Simon Waddington BL RIC WORKSHOP 22032011
Simon Waddington BL RIC WORKSHOP 22032011
 
CNIT 152 13 Investigating Mac OS X Systems
CNIT 152 13 Investigating Mac OS X SystemsCNIT 152 13 Investigating Mac OS X Systems
CNIT 152 13 Investigating Mac OS X Systems
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 
The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...
The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...
The Reality of the Cloud: Implications of Cloud Computing for Mobile Library ...
 
dotte.ppt
dotte.pptdotte.ppt
dotte.ppt
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
 
Drupal for Higher Education and Virtual Learning
Drupal for Higher Education and Virtual LearningDrupal for Higher Education and Virtual Learning
Drupal for Higher Education and Virtual Learning
 
DSpace-CRIS Workshop OR2015: Slides
DSpace-CRIS Workshop OR2015: SlidesDSpace-CRIS Workshop OR2015: Slides
DSpace-CRIS Workshop OR2015: Slides
 
Hibernate tutorial
Hibernate tutorialHibernate tutorial
Hibernate tutorial
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 

More from Menzo Windhouwer

LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data Categories
Menzo Windhouwer
 
What do cats have to do with explicit semantics?
What do cats have to do with explicit semantics?What do cats have to do with explicit semantics?
What do cats have to do with explicit semantics?
Menzo Windhouwer
 

More from Menzo Windhouwer (13)

CMD2RDF
CMD2RDFCMD2RDF
CMD2RDF
 
ISOcat and RELcat, two cooperating semantic registries
	ISOcat and RELcat, two cooperating semantic registries	ISOcat and RELcat, two cooperating semantic registries
ISOcat and RELcat, two cooperating semantic registries
 
Semantic Mapping in CLARIN Component Metadata.
Semantic Mapping in CLARIN Component Metadata.Semantic Mapping in CLARIN Component Metadata.
Semantic Mapping in CLARIN Component Metadata.
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
 
A CMD Core Model for CLARIN Web Services
A CMD Core Model for CLARIN Web ServicesA CMD Core Model for CLARIN Web Services
A CMD Core Model for CLARIN Web Services
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data Categories
 
What do cats have to do with explicit semantics?
What do cats have to do with explicit semantics?What do cats have to do with explicit semantics?
What do cats have to do with explicit semantics?
 
ISOcat to LMF to TEI
ISOcat to LMF to TEIISOcat to LMF to TEI
ISOcat to LMF to TEI
 
On the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categoriesOn the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categories
 
The ISO-DCR
The ISO-DCRThe ISO-DCR
The ISO-DCR
 
Use of ISOcat within CMDI
Use of ISOcat within CMDIUse of ISOcat within CMDI
Use of ISOcat within CMDI
 
ISOcat: a short introduction
ISOcat: a short introductionISOcat: a short introduction
ISOcat: a short introduction
 
Sustainable operability: Keeping complex linguistic resources alive.
Sustainable operability: Keeping complex linguistic resources alive.Sustainable operability: Keeping complex linguistic resources alive.
Sustainable operability: Keeping complex linguistic resources alive.
 

Recently uploaded

Recently uploaded (20)

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Transforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UXTransforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UX
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Fedora Commons in the CLARIN Infrastructure

  • 1. Fedora Commons in the CLARIN Infrastructure Menzo Windhouwer menzo.windhouwer@meertens.knaw.nl Meertens Institute, TLA, CLARIN ERIC
  • 2. Overview 1. An overview of Fedora Commons (3.8.1) 2. Current usage by CLARIN centres 3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora Commons and Islandora
  • 3. Overview 1. An overview of Fedora Commons (3.8.1) 2. Current usage by CLARIN centres 3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora Commons and Islandora
  • 4. Fedora Commons • fedora-commons.org • 300 registered installations • 1997: started as a research project at Cornell University • Implemented as a Java servlet • 2009: joined the DSpace foundation (now DuraSpace) • 2014: Fedora Commons 4 released • More RDF-based • Not backward compatible qua functionality, e.g., APIs • Data migration utilities available • 2015: last Fedora Commons 3 release (3.8.1) • wiki.duraspace.org/display/FEDORA38/ • github.com/fcrepo3 • focus
  • 5. Fedora Commons main features • Digital Objects • Content Model Architecture (FOXML) • Datastreams • Relationships between Digital Objects (RDF) • APIs (REST/SOAP) • Access • Management • Security (XACML) • Access control • Policies • Message queue • OAI-PMH • Replication & mirroring • Versioning • Checksums
  • 6. Fedora Commons main features • Digital Objects • Content Model Architecture (FOXML) • Datastreams • Relationships between Digital Objects (RDF) • APIs (REST/SOAP) • Access • Management • Security (XACML) • Access control • Policies • Message queue • OAI-PMH • Replication & mirroring • Versioning • Checksums
  • 7. Digital Objects - Model “Fedora uses a "compound digital object" design which aggregates one or more content items into the same digital object. Content items can be of any format and can either be stored locally in the repository, or stored externally and just referenced by the digital object. The Fedora digital object model is simple and flexible so that many different kinds of digital objects can be created, yet the generic nature of the Fedora digital object allows all objects to be managed in a consistent manner in a Fedora repository.”
  • 8. Digital Objects – Content Model Architecture 1. Data Object • “Data objects are what we normally think of when we imagine a repository storing digital collections. Data objects can represent such varied entities as images, books, electronic texts, learning objects, publications, datasets, and many other entities.” 2. Content Model Object • “[A]cts as a container for the Content Model document which is a formal model that characterizes a class of digital objects.” 3. Service Definition Object 4. Service Deployment Object
  • 9. Digital Objects - Datastreams • “The content represented by a Datastream is treated as an opaque bit stream; it is up to the user to determine how to interpret the content (i.e. data or metadata).” • Where does this bit stream live? 1. Internal XML Content “the content is stored as XML in-line within the digital object XML file” (FOXML) 2. Managed Content “the content is stored in the repository and the digital object XML maintains an internal identifier that can be used to retrieve the content from storage” 3. Externally Referenced Content “the content is stored outside the repository and the digital object XML maintains a URL that can be dereferenced by the repository to retrieve the content from a remote location” 4. Redirect Referenced Content “the content is stored outside the repository and the digital object XML maintains a URL that is used to redirect the client when an access request is made”
  • 10. Digital Objects - Relations • Relationships between Digital Objects • Collections, compounds, cross references, … • Using the Fedora relationship ontology • Domain specific relationships • Encoded in RDF • RELS-EXT: relations from the DO to other DOs or external resources • RELS-INT: relations from datastreams in the DO to other resources
  • 11. Digital Objects - FOXML <foxml:digitalObject PID="lat:1839_00_0000_0000_0016_7E07_7" xmlns:foxml="info:fedora/fedora-system:def/foxml#" …> <foxml:objectProperties> <foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="A"/> <foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="deerhunt"/> </foxml:objectProperties> <foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="X"> <foxml:datastreamVersion ID="DC.0" FORMAT_URI="http://www.openarchives.org/OAI/2.0/oai_dc/" MIMETYPE="text/xml" LABEL="Dublin Core Record for this object"> <foxml:xmlContent> <oai_dc:dc …> <dc:title>deerhunt story</dc:title> <dc:description xml:lang="eng">The text was recorded at Madison University in the 1960s. The text was recorded indoors.</dc:description> ... </oai_dc:dc> </foxml:xmlContent> </foxml:datastreamVersion> </foxml:datastream> <foxml:datastream ID="CMD" STATE="A" CONTROL_GROUP="X"> <foxml:datastreamVersion ID="CMD.0" LABEL="CMD Record for this object" MIMETYPE="application/x-cmdi+xml" …> <foxml:xmlContent> <cmd:CMD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.1" …>...</cmd:CMD> </foxml:xmlContent> </foxml:datastreamVersion> </foxml:datastream> …
  • 12. Digital Objects - FOXML … <foxml:datastream ID="RELS-EXT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true"> <foxml:datastreamVersion ID="RELS-EXT.0" LABEL="RDF Statements about this object" MIMETYPE="text/xml"> <foxml:xmlContent> <rdf:RDF xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" …> <rdf:Description rdf:about="info:fedora/lat:1839_00_0000_0000_0016_7E07_7"> <fedora:isMemberOfCollection rdf:resource="info:fedora/lat:1839_00_0000_0000_0016_7E41_8"/> <fedora-model:hasModel rdf:resource="info:fedora/islandora:compoundCModel"/> <fedora-model:hasModel rdf:resource="info:fedora/islandora:sp_cmdiCModel"/> <oai:itemID xmlns="http://www.openarchives.org/OAI/2.0/">oai:flat.example.com.:lat:1839_00_0000_0000_0016_7E07_7</oai:itemID> </rdf:Description> </rdf:RDF> </foxml:xmlContent> </foxml:datastreamVersion> </foxml:datastream> <foxml:datastream ID="TN" STATE="A" CONTROL_GROUP="E"> <foxml:datastreamVersion ID="TN.0" LABEL="icon.png" MIMETYPE="image/png"> <foxml:contentLocation TYPE="URL" REF="file:/app/flat/icons/folder.png"/> </foxml:datastreamVersion> </foxml:datastream> </foxml:digitalObject>
  • 13. APIs (REST/SOAP) • The ‘RESTful’ APIs provide easy HTTP URLs to access (API-A) objects and their datastreams: 1. https://www.meertens.knaw.nl/flat/objects/lat:10744_1b9e0d44_ef4d_496 c_8939_6129b5ee5b49/datastreams/CMD/content?asOfDateTime=2017- 01-27T11:30:52.732Z 2. https://www.meertens.knaw.nl/flat/objects/lat:10744_792194f7_d1fd_400 c_ab2b_9b51f4fe3907/datastreams/OBJ/content?asOfDateTime=2017-01- 27T11:31:01.207Z Used as redirect for a handle Notice the use of a timestamp to refer to a specific version of the datastream • API-M provides methods to update objects and their datastreams • Access to API-M can be limited using repository wide XACML policies
  • 14. Security (XACML) • eXtensible Access Control Markup Language (XACML) is a OASIS standard to encode access control policies “Each XACML policy defines: (1) a "target" describes what the policy applies to (by referring to attributes of users, operations, objects, datastreams, dates, and more), and (2) one or more "rules" to permit or deny access.”  Rather cryptical and bloated language • Repository wide policies • Access to API-M (methods) by certain user/roles from certain IP adresses • … • Object specific policies • Which users can access which datastreams • … • User profiles • Plugin any authfilter in the application server • Hardcoded users • …
  • 15. Fedora Commons as a basis - extensions • Facetted search: gsearch (Solr) • Listens to the FC message queue • Runs an XSLT to create a SOLR document • OAI-PMH: Proai • Occasionally queries FCs resource index • Can deliver other metadata datastreams than the default Dublin Core • …
  • 16. Fedora Commons as a basis - frontends • Islandora • Drupal based • Large set of modules, relatively easy extensible • Still based on Fedora Commons 3 • Ongoing experiments/development, e.g., CLAW for Islandora • Hydra • Ruby on Rails based • More hardcoded workflow and data models • … • Portland Common Data Model • Common data model (content models) so migration between front-ends/frameworks becomes easier
  • 17. Overview 1. An overview of Fedora Commons (3.8.1) 2. Current usage by CLARIN centres 3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora Commons and Islandora
  • 18. Repository solutions in use by CLARIN centres 0 1 2 3 4 5 6 7 8 9 Fedora Commons DSpace custom LAT GIT eSciDoc Repository info on 20 B centres in the Centre registry # B centres Notes: • Meertens: custom -> Fedora Commons • MPI: LAT -> Fedora Commons • eSciDoc: Fedora Commons under the hood • Various C centres also run a Fedora Commons (based) repository
  • 19. How happy are these centres with Fedora Commons? • Send out a questionnaire to 9 centres: 6 responses  Do you (still) consider Fedora Commons a sustainable repository solution for your center? yes no Would you advice new CLARIN centers to use Fedora Commons as (the basis for) their CLARIN-compatible repository solution? yes no maybe If you are member of CLARIN-D then you probably might want to choose Fedora, but if you're in another country you might want to take a closer look at other solutions (DSpace or TLA software).👍🏻 Depends partly on available technical expertise
  • 20. Fedora Commons versions 0 0.5 1 1.5 2 2.5 3.6 3.6.2 3.7.1 3.8.1 4 Which version of Fedora Commons does your centre use in production? # centres Do you plan a move to Fedora Commons 4? yes no maybe benefit from Linked Open Data approach; within next 2 years We are migrating to version 4 right now. We also made major enhancements to our front-end. We are planning to go into production with it within the next months.
  • 21. Size of the centre’s repositories # Digital Objects: ca. 150 2,500 3,038 10,000 33,000 # bytes: ca. 125M (metadata only) 5G 16G ca. 500G Both MPI and Meertens have currently over the 100.000 CMD records in the VLO, which describe resources that take up several TB (and up to 1M DOs). Experiments did reveal problems in the FC area, but they can be repaired 
  • 22. Community support How helpful was/is the documentation available within the Fedora Commons community? not at all somewhat ok very much How helpful was/is the support by the Fedora Commons community? not at all somewhat ok very much How helpful was/is the documentation on Fedora Commons by the CLARIN community? not at all somewhat ok very much How helpful was/is the support for Fedora Commons within the CLARIN community? not at all somewhat ok very much Unfortunately there seem to be no more Fedora User Groups in Europe... Being one of the first centers to use Fedora Commons, we did use the documentation available within the FC community. At that time there was not much CLARIN documentation. This blog entry was very useful for us: http://asingh.com.np/blog/fedo ra-commons-installation-and- configuration-guide/ an option for the case one has never made use of the support should have been included
  • 23. Frontends 0 0.5 1 1.5 2 2.5 3 3.5 none Islandora custom Do you use a front-end, e.g., Islandora, Hydra or your own, next to Fedora Commons? # centres own front-end, based on Django (EulFedora) and MySQL We developed our own, called Erdo The built-in user interface is not adequate. You will need to replace it with something better.
  • 24. Additional advice • “Let Apache httpd (or Apache Tomcat) take care for most of the configuration (access control) and configure Fedora Commons to be "open". Take care what to store in Fedora and what not (it can be very unhandy to store too many data streams inside Fedora).” • “I consider the two offered RDF query languages (SPARQL, ITQL) by Fedora as insufficient, as both miss important features, e.g ITQL can't use regexp search and can't sort strings numerically and SPARQL can't use COUNT operator and also cannot sort strings numerically (at least in version 3.6.2).” • “For CMDI metadata, you also need the Proai OAI provider. Use the version customised for Fedora Commons.”
  • 25. Overview 1. An overview of Fedora Commons (3.8.1) 2. Current usage by CLARIN centres 3. TLA-FLAT: a CLARIN compatible repository solution based on Fedora Commons and Islandora
  • 26. FLAT’s predecessors • The Language Archive (TLA) at the MPI for Psycholinguistics • long history in digital archiving, especially resources on endangered languages • home build LAT (Language Archiving Technology) • 2014 – now: preparing to switch to a stack that is largely based on off-the-shelf software based on Fedora Commons + Islandora • choice made after a INNET repository workshop and several pilots • initial version based on scripts kindly provided by IDS • started as EasyLAT now known as (TLA-)FLAT (Fedora Language Archiving Technology) • doing a lot of cleanup/curation along the way from LAT to FLAT • The Meertens Institute • collecting valuable (Dutch) (physical) humanities resources for over a century • digitization projects • digital born resources • KNAW participates in TLA and the Meertens Institute teamed up with the MPI to modernize its setup and develop FLAT
  • 28. TLA-FLAT base line • Meet the, technical, CLARIN B centre requirements • Meet the, technical, Data Seal of Approval (DSA) requirements • Meet organization specific requirements • Meet, at least the CLARIN B centre and DSA, requirements, as much as possible, with the Fedora Commons backend • frontend (technology) come and go quickly • How far can we get using available components, configuration and a limited level of tailor made software? • Mainly to add support for CMDI • Start with Fedora Commons 3.8.x and Islandora 7.x-1.x, move along with the Islandora community to Fedora Commons 4
  • 29. Islandora 7.x-1.x • islandora.ca • An open-source software framework designed to help institutions and organizations and their audiences collaboratively manage, and discover digital assets using a best-practices framework. • Islandora was originally developed by the University of Prince Edward Island's Robertson Library, but is now implemented and contributed to by an ever-growing international community. • Built on a base of Drupal (7.x), Fedora (3.x), and Solr, Islandora releases solution packs which empower users to work with data types (such as image, video, and pdf) and knowledge domains (such as Chemistry and the Digital Humanities). Solution packs also often provide integration with additional viewers, editors, and data processing applications. • wiki.duraspace.org/display/ISLANDORA/Islandora • github.com/Islandora • github.com/Islandora-Labs/islandora_awesome • github.com/discoverygarden • Digital Objects are not Drupal nodes, the Islandora modules interact with Fedora Commons via an intermediate (PHP) layer, Tuque • In CLAW Digital Objects are Drupal nodes synchronized using Apache Camel
  • 30. CLARIN B centre requirements • [CLARIN-B-2] Centres need to adhere to the security guidelines, i.e. the servers need to have accepted certificates. • [CLARIN-B-3] Centres need to join the national identity federation where available and join the CLARIN service provider federation to support single identity and single sign-on operation based on SAML 2.0 and trust declarations. • [CLARIN-B-5] Centres need to offer component based metadata (CMDI) that make use of elements from accepted registries such as ISOcat in accordance with the CLARIN agreements, i.e. metadata needs to be harvestable via OAI-PMH. • [CLARIN-B-6] Centres need to associate PIDs records according to the CLARIN agreements with their objects and add them to the metadata record.
  • 31. DSA requirements • [DSA-10] The data repository enables the users to discover and use the data and refer to them in a persistent way. • [DSA-11] The data repository ensures the integrity of the digital objects and the metadata. • [DSA-12] The data repository ensures the authenticity of the digital objects and the metadata. • [DSA-13] The technical infrastructure explicitly supports the tasks and functions described in internationally accepted archival standards like OAIS.
  • 32. Meertens Institute & TLA requirements • [Home-1] The repository should support arbitrary deep collection hierarchies. • [Home-2] The repository should support handles as persistent identifiers. • [Home-3] The repository should work with arbitrary CMDI profiles. • [Home-4] The repository should provide resource level access control. • [Home-5] The repository should allow collection management to review submissions before the resources are actually ingested. • [Home-6] The repository should allow system management to determine the location of resources on persistent storage, e.g., from fast access times to secure tape drives. • [Home-7] The repository should allow the storage of arbitrary relationships between data sets. • [Home-8] The repository should provide entry points for interaction with Virtual Research Environments, • [Home-9] The repository should allow for collection management oriented metadata, which might not be public.
  • 33. FLAT’s place at the Meertens Institute Drupal Islandora Fedora Commons Deposition Service (DoorKeeper) SIP AIP Workspace (ownCloud) Virtual Research Environment Persistent storage SOLR (MTAS) Backups (EUDAT) Collection Management Infrastructures (CLARIN) SWORD CMDI SP OAI-PMH 💡 💡 💡 💡 💡
  • 34. FLAT’s place at the MPI/TLA Drupal Islandora Fedora Commons Deposition Service (DoorKeeper) SIP AIP Workspace (ownCloud) Deposition UI Persistent storage Backups (DANS) Infrastructures (CLARIN) SWORD CMDI SP OAI-PMH 💡
  • 35. FLAT modules • Core • Fedora Commons and Islandora setup • CMDI Solution Pack • CMD to FOXML conversion • Proai setup • Indexing (SOLR) • gsearch-based solution for CMDI • Meertens’ CMDI indexer • SWORD 2.0 • Reuses a deposit via SWORD approach and implementation by DANS • DoorKeeper • Deposition UI • IMDI conversion • Shibboleth Shibboleth setup is very server specific, so there is a module that illustrates the Drupal setup and can be combined with a test IdP.
  • 36. CMDI Solution Pack • Registers a metadata renderer in Islandora • Triggers when a Digital Object uses the CMDI content model and renders the CMD datastream • The default render XSLT can be overwritten by profile specific XSLTs • Not FLAT specific, i.e., could be reused outside of FLAT
  • 37. Archival Information Package (AIP) isMemberOfCollection isMemberOfCollection Collection + CMDI CMD RELS-EXT DC Collection DC Image OBJ RELS-EXT DC OBJ RELS-EXT DC Collection + Compound + CMDI CMD RELS-EXT DC Compound + CMDI CMD RELS-EXT DC Video OBJ RELS-EXT DC isMemberOfCollection isMemberOfCollection isMemberOfCollection isConstituentOf isConstituentOf isConstituentOf contentLocation contentLocation contentLocation isMemberOfCollection FLAT reuses a lot of Islandora’s content models so rendering is easy. And they can be easily taken along without Islandora.
  • 38. FLAT’s DoorKeeper • A configurable chain of actions that • Validate the CMDI, also according to centre specific requirements • Check the validity of resources against preferred formats (FITS) • Assess metadata quality • Offer the SIP for evaluation to collection management • Move new resources from a temporary workspace into persistent locations • Expand WebACL to XACML • Version management • Assign and create handles (EPIC) • Interact with Fedora Common’s API-M • Trigger indexing • Create backup bags (for DANS or EUDAT) • Creates user and develop oriented logs • Interaction via a REST API or the command line • Uses dynamic class loading, i.e., easily extensible with centre specific actions • Not too FLAT specific, e.g., usable by other repository setups or replace Fedora by DSpace  Actions are, in general, lean and mean, so its relatively easy to implement one in Java.
  • 39. Submission Information Package (SIP) • A CMD record referring with • relative paths to resources within the package • absolute paths to resources already on the server • For example, in the user’s ownCloud data directory • (block access to system files!) • Additional files • Access control • License • … • When using the SWORD 2.0 interface these are put in a bag and zipped for upload • The SWORD interface allows upload in parts +-test-sip/ +-bag-info.txt +-bagit.txt +-data/ | +-metadata/ | | +-policy.n3 | | +-record.cmdi | +-resources/ | +-my comic.pdf | +-secret.txt +-manifest-md5.txt +-tagmanifest-md5.txt
  • 40. Security • To hide the intricacies of XACML and design choices for content models we use WebACL to specify the access rules for a SIP @prefix acl: <http://www.w3.org/ns/auth/acl#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . # make a specific resource (identified by the ID of the ResourceProxy) in the SIP accessible to a specific user [acl:accessTo <sip#h1>; acl:mode acl:Read; acl:agent <#other1>]. # a colleague<#other1> a foaf:Person ; foaf:account [foaf:accountServiceHomepage <#flat>; foaf:accountName "sarah@meertens.knaw.nl"]. # give the owner read and write access [acl:accessTo <sip>; acl:mode acl:Read, acl:Write; acl:agent <#owner>]. # the owner <#owner> a foaf:Person ; foaf:account [foaf:accountServiceHomepage <#flat>; foaf:accountName "bob@meertens.knaw.nl"]. shortcuts Shibboleth EPPNs
  • 41. CMDI indexing for facetted search 1. gsearch for CMDI • Based on a XSLT that processes the FOXML • FLAT generates an XSLT for the (internal) CMD datastream • Based on the profiles in your CMD records • And a VLO-like mapping • Facet = VLO facet • Facet = concept • Facet = hard coded XPath • Only the configured facets will be available • Can also be used for the required CMD to DC mapping • Allows to run FLAT for your CMD records out-of-the-box 2. Meertens CMD indexer • Analyzes the profiles in your CMD records • Creates facets for all semantic paths it finds • Facet names based on concept links (plus context) • At runtime switch between facets for querying and rendering Includes indexing of collection and compound relationships. Islandora can use the SOLR for this instead of the resource index (by default Mulgara), which is needed in case of large collections/compounds. Replacing Mulgara by another triple store, e.g., Blazegraph, is even better, but requires all components to use SPARQL instead of ITQL.
  • 42. Deposition UI • Drupal/Islandora module • Create a project • Upload a CMD record • Or create a new one using a form • Upload resources • Via a project specific ownCloud data directory • dropbox-like functionality • possibility to link with other providers (dropbox, google drive, ….) • no need to worry about uploading ‘big’ files • Freeze a project • Validate the SIP using the DoorKeeper (async) • Deposit a valid project • Validate and deposit the SIP using the DoorKeeper (async)
  • 43. New vs legacy data • New data goes via the DoorKeeper so its checked against the centres policies! • Legacy (meta)data can be bulk loaded into Fedora Commons: • Convert IMDI to CMDI (optional) • Create FOXML for CMD records and resources • ResourceProxies should contain the local paths to resources, e.g., via @lat:localURI • Bulk load into Fedora Commons • Index for facetted search • Update handles • EPICify (github.com/meertensinstituut/EPICify) Scripts available, but need to be generalized.
  • 44. Branding • Drupal has extensive facilities for styling and templating • Drupal has many modules and blocks for additional functionality • Islandora as well, and also offers solution packs • During FOXML creation resource specific content models can be used • Take care, after bulk import or via a DoorKeeper action, that needed derivatives are created • Enable solution pack specific viewers • Some experiments have been done • FLAT comes with a basic style, but the MPI/TLA and Meertens instances look very different
  • 46. Where are we? • Set of Docker images that extend each other to build up a complete solution for a: • Read only interface for bulk loaded existing (meta)data (master) • Upload of new data via the DoorKeeper (develop) • Update metadata resource proxies in the CMDI collection hierarchy • User audit trails and checksums for big files • Updating existing data via the DoorKeeper • Versioning • Ongoing cleanup and enrichment of (legacy) metadata and resources, e.g., controlled vocabularies, license information In production at the Meertens Institute www.meertens.knaw.nl/flat and we are continuously moving, cleaned, (meta)data from the old setup to FLAT. CLARIN B certification based on FLAT started. Being connected to Meertens Institutes questionnaire system at the moment. A containerization platform that allows easy development, testing and deployment.
  • 47. FLAT is moving • github.com/TheLanguageArchive/FLAT • Its birthplace, but FLAT is moving to • github.com/TLA-FLAT • Code can be more clearly split over multiple repositories • DoorKeeper • Bundles of actions • Servlet wrapper • CMDI Solution Pack • … • Docker setups • finer granualarity • Place for cooperation on • code • configuration • actions • knowledge sharing • Q&A, issues A Dockerfile precisely describes what software to install and how to configure it to get a running system. Fedora Commons, Islandora and Drupal documentation is sometimes hard to find/read and the full stack has many layers and corners. We can share our experience CLARIN-wide.
  • 49. Conclusions • Fedora Commons (3.8.1) provides many of the basic functionality needed by a CLARIN B centre • Fedora Commons has a proven record of being a stable and satisfactory repository solution for many existing CLARIN centres • Transition from version 3 to 4 is starting to happen • TLA-FLAT is a modular CLARIN-compliant Fedora Commons-based solution that is easy to step in and a platform to share knowledge on running a Fedora Commons repository and its context
  • 50. Thanks! Questions? now or later  menzo.windhouwer@meertens.knaw.nl Please visit github.com/TheLanguageArchive/FLAT github.com/TLA-FLAT TLA-FLAT team MI: Marc Kemps-Snijders, Menzo Windhouwer, Rob Zeeman, Bas van der Veen MPI: André Moreira, Daniel von Rhein, Paul Trilsbeek, Guilherme Silva