2. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
• Conclusions
2
3. Overview
• Infrastructure Core
CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
• Conclusions
3
4. Infrastructure Core
• 5 CLARIN Centres (‘Type B Centres’)
1. MPI
2. Meertens Institute
3. INL
4. Huygens ING
5. DANS
• 3 CLARIN Data Providers (‘Type D Centres’)
1. National Library (KB)
2. Utrecht University Library
3. Netherlands Institute for Sound and Vision
4
5. Infrastructure Core
• CLARIN Centres
– Have set up a proper repository system
• So resources can be stored there
– Have their CMDI-metadata harvestable
• So resources are visible to others
– Support for persistent identifiers (PIDs)
• So links to resources are ‘never’ broken
– Long-term archiving solution in place
• So resources will not get lost
– Provisions for federated identity management
• So you can login with your own institute account (single sign-on)
– Have acquired the Data Seal of Approval
• So the data repositories can be trusted and are sustainable
5
6. Infrastructure Core
• CLARIN Type A Centres in NL
– Offers services for the whole CLARIN infrastructure
– Mainly MPI, some Meertens (and UU)
• Enables you to search for resources:
– Harvesting of metadata , Virtual Language Observatory, Meertens
Metadata Search (Meertens), CLARIN-NL Portal (UU)
• Enables you to create metadata
– CMDI registry, CMDI Profile editor, Metadata editor
• Enables you to ensure semantic interoperability
– ISOCAT, RELCAT, SchemaCat
– CLAVAS, CLARIN Concept Registry (Meertens)
– Transfer from MPI to other centres (in EU) on-going
6
7. Overview
• Infrastructure Core
– CLARIN Centres
Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
• Conclusions
7
8. Infrastructure Core
• Metadata and Metadata Search
– CMDI metadata created for all data dealt with in CLARIN-
NL
– Using flexible CMDI
• If needed, with user defined profiles and components
– Searching for data possible via the
• VLO
• Meertens Metadata Search
• Some work done on metadata for software
– Partially reflected in CLARIN-NL Portal
– But not (yet) in CMDI records / VLO
8
9. Infrastructure Core
• Metadata and Metadata Search
– CMDI `too flexible’
– Big variation in granularity
– Hardly any requirements on obligatoriness of certain
metadata elements
• some crucial metadata elements are lacking
• VLO
– Gives access to over 800k metadata records
– KB metadata are not included (> 1 million)
– Many external origin with no control over the metadata
– Limited search options via VLO
• Search via VLO is not as easy as it should be
• CLARIN-NL Portal improves this for NL resources
• Will be taken up in CLARIAH 9
10. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
• Conclusions
10
11. Infrastructure Core
• Federated Content Search (FCS)
– Search via a single interface in multiple, distributed, data
• NL centres created ‘end points’ for selected
resources
– So they can participate in FCS
• Development of search interface and aggregator
– Different approaches NL v. DE
– NL Development stopped, adopted DE approach
– See CLARIN-D FCS Aggregator
• So far, only string (keyword) search is possible
• Will be taken up again in CLARIAH
11
12. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
• Conclusions
12
13. Data Curation
• By the CLARIN Data Curation Service (DCS)
– E.g. LESLLA, dialect dictionaries, IPNV Interviews with
veterans
• Via open calls and closed calls
– In many (small) projects
• Recent examples: VALID, DSS, eBNM+
• Broad coverage of the humanities
• Contributed significantly to user involvement
13
14. Data Curation
14
Discipline Count
Linguistics 16
History 9
Literary Studies 5
Culture Sciences 4
Communication & Media Studies 2
Religion Studies 2
Computational Linguistics 1
Philosophy 1
Political Sciences 1
16. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
• Conclusions
16
17. Software Curation /
Web Applications
• Via open calls and closed calls
In many (small) projects
– Curation / upgrades of existing software
• AVResearcherXL (QuaMerdes), SHEBANQ, ColTime and EXILSEA
upgrades of ELAN, PaQu, Cornetto Interface, …
– Creation of new software
• DSS, eBNM+, RemBench, OpenSONAR, PICCL, AutoSearch, …
– Broad coverage of the humanities
– Contributed significantly to user involvement
17
18. Software Curation /
web applications
18
Discipline Count
Linguistics 27
History 14
Literary Studies 5
Communication & Media Studies 4
Cultural Sciences 4
Political Sciences 4
Computational Linguistics 3
3 others with each 1-2
19. Software Curation /
web applications
19
Linguistics Count
Syntax 13
Morpho-syntax 7
Historical linguistics 5
Lexicology 5
Dialectology 4
Sign Language 4
7 others with each 2
20. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
Interoperability
• What you can do
• Education and Training
• Conclusions
20
21. Interoperability
• Interoperability
– Do tools apply to data seamlessly?
– Can data be combined seamlessly?
– Can tools be combined seamlessly?
– Does CLARIN support data in real-world formats?
21
22. Interoperability
• Syntactic Interoperability
– FoLIA becoming a de facto standard format for
linguistically annotated text corpora in the Netherlands
• TTNWW, PICCL, VU-DNC, Nederlab, Basilex, …
– CLAM de facto standard in NL for turning software into
RESTful web services
– But
• there are also other important formats that must be supported
(TEI, LASSY XML, …)
• And still too many ad-hoc formats, often without explicit syntax
and semantics
22
23. Interoperability
• Semantic Interoperability
– Data Categories for metadata elements actually used
(e.g. in the VLO)
– Data Categories for many data (content) elements defined
but hardly used yet
– ISOCAT was a useful data category registry
• But had many problems
– Now replaced by the CLARIN Concept Registry
• Solves some of ISOCAT’s problems but not all
• Will be addressed in CLARIAH
23
24. Interoperability
• Support for real world formats
– New research data do not come in standardized formats
– But as mixtures of .doc, .docx, HTML, PDF, plain text,
ePub, …
– And multiple standard formats must be supported in
CLARIN (e.g. both FoLIA and TEI)
– Support for data conversions via the OpenConvert project
– But more is needed
• Will be addressed in CLARIAH
24
25. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
What you can do
• Education and Training
• Conclusions
25
26. What you can do
• Find and select existing data
– Virtual Language Observatory, Meertens Metadata
Search, CLARIN-NL Portal
• Create new data through OCR and orthographic
normalisation
– PICCL
• Create metadata for new or existing data
– CMDI Registry, CMDI profile editor, metadata editors (e.g.
ARBIL), …
26
27. What you can do
• Make semantics of metadata and data explicit
– ISOCAT, RELCAT, SchemaCAT
• now replaced by CLARIN Concept Registry (CCR)
– CLAVAS
• Enrich data with various kinds of annotations
– TTNWW
• Orthographic normalisation, pos-tagging, lemmatisation,
parsing, named entity recognition, ….
– Adelheid, INPOLDER, PaQu, ColTime and EXILSEA
extensions to ELAN
• Upload enriched data to search applications
– PaQu, AutoSearch
27
28. What you can do
• Search, browse in data and analyze (meta)data and
query results
– OpenSONAR, GrETEL, PaQu, MIMORE, FESLI, SHEBANQ,
AutoSearch, …
– Arthurian Fiction, NameScape, COBWWWEB, eBNM+, C-
DSD, DSS, RemBench, Nederlab, …
– Interviews, WIP, VK, Polimedia, CKCC, DSS,
AVResearcherXL, …
– DUELME, WFT-GTB, CORNETTO, …
28
29. What you can do
• Visualize data analyses
– COAVA, FESLI, MIMORE, Gabmap, SHEBANQ, Nederlab,
OpenSONAR, …
– CKCC, MIGMAP, AVResearcherXL
• Store new data safely at a CLARIN Centre
– All 5 centres have the Data Seal of Approval
– 4 centres are certified CLARIN Centres
29
30. Invitation
• Use (elements from) the CLARIN infrastructure
• Join user groups of specific services
• Provide feedback so that we can further improve
CLARIN
• So that you can improve your research
30
31. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
Education and Training
• Conclusions
31
32. Education & Training
• How do you learn to use these tools?
– Courses / tutorials regularly organized
– LOT summer / winter school courses
– Demonstration scenarios and/or screen casts
• E.g. for Gabmap GrETEL OpenSONAR
– Educational modules via the portal:
• https://dev.clarin.nl/node/CLARIN%20Educational%20Packages
– Helpdesk: helpdesk@clarin.nl
32
33. Education & Training
• Do you want to know more?
– Visit the CLARIN-NL portal
• http://portal.clarin.nl
– View the CLARIN-NL movies
• http://www.clarin.nl/node/403
– Visit the demonstrations today
– Ask me (or others) today
33
34. Overview
• Infrastructure Core
– CLARIN Centres
– Metadata and Searching for data
– Federated Content Search
• Resource Curation
– Data Curation
– Software Curation & Web Applications
• Interoperability
• What you can do
• Education and Training
Conclusions
34
35. Conclusions (1)
• CLARIN is starting to provide the data, facilities and services to carry
out humanities research supported by large amounts of data and tools
• With easy interfaces and easy search options (no technical background
needed)
• Some training in using the tools is needed
– To use the possibilities optimally
– To understand the limitations of the data and the tools
– Educational modules for selected functionality are available
– Tutorials / trainings will continue to be regularly organized
35
36. Conclusions (2)
• But there is still a lot to do
– Extensions of and improvements in metadata
– Improvements of VLO
– Improved functionality for most tools
• Need / desire found b y actual use of the tools
– Extend and improve search options for individual resources
– Create options of searching across different resources of the same type
– Improved interoperability
36
37. Conclusions(3)
• A successor project is needed!
• CLARIAH www.clariah.nl
• Proposal approved June 1, 2014
• Started Jan 1st, 2015
• Kick-off this afternoon
37