1. A CMD Core Model for CLARIN Web Services
Menzo Windhouwer, Daan Broeder, Dieter van Uytvanck
The Language Archive – Max Planck Institute for Psycholinguistics
Nijmegen, The Netherlands
2. CLARIN vision
Our vision is that the resources for processing
language, the data to be processed as well as appropriate
guidance, advice and training be made available and can
be accessed over a distributed network from the user's
desktop. CLARIN proposes to make this vision a reality:
the user will have access to guidance and advice through
distributed knowledge centres, and via a single sign-on
the user will have access to repositories of data with
standardized descriptions, processing tools ready to
operate on standardized data, and all of this will be
available on the internet using a service oriented
architecture based on secure grid technologies.
http://www.clarin.eu/external/index.php?page=about-clarin&sub=0
22 May 2012 LREC Metadata 2012 workshop 2
3. Outline
• Web Service architectures
• Component Metadata Infrastructure (CMDI)
• National CLARIN initiatives
• CMD core model for Web Services
• Usage of the core model
• Future work and conclusions
22 May 2012 LREC Metadata 2012 workshop 3
4. Service
input
text=‘Welcome to Istanbul’ service output
tokenize tokens=[‘Welcome’, ‘to’, ‘Istanbul’]
separator=whitespace
22 May 2012 LREC Metadata 2012 workshop 4
5. Web Service
web server
http://www.example.com/
input
text=‘Welcome to Istanbul’ service output
tokenize tokens=[‘Welcome’, ‘to’, ‘Istanbul’]
separator=whitespace
How to invoke a service and pass the parameters?
Different Service Oriented Architectures.
22 May 2012 LREC Metadata 2012 workshop 5
6. SOA: RESTful Resource Orientation
web server
http://www.example.com/
input
text=‘Welcome to Istanbul’ service output
tokenize tokens=[‘Welcome’, ‘to’, ‘Istanbul’]
separator=whitespace
• Resource oriented instead of service oriented
• An URL tells what resource to operate on
• Uses HTTP verbs (PUT, GET, POST, DELETE) to tell what to do
1. create a resource
PUT http://www.example.com/text
Content-type: plain/text
Welcome to Istanbul
response
201
Location: http://www.example.com/text/123
2. request tokens resource
6
GET http://www.example.com/text/123/tokens/whitespace
7. SOA: Remote Procedure Call
web server
http://www.example.com/
input
text=‘Welcome to Istanbul’ service output
tokenize tokens=[‘Welcome’, ‘to’, ‘Istanbul’]
separator=whitespace
• The XML-RPC and SOAP are HTTP oriented RPC standards
• An URL may function as an endpoint for several operations
• Uses a standard envelope format to tell what to do
POST http://www.example.com/services
Content-type: text/xml
<methodCall>
<methodName>tokenize</methodName>
<params>
<param><value><string>Welcome to Istanbul</string></value></param>
<param><value><string>whitespace</string></value></param>
</params>
7
</methodCall>
8. SOA: REST-RPC hybrid
web server
http://www.example.com/
input
text=‘Welcome to Istanbul’ service output
tokenize tokens=[‘Welcome’, ‘to’, ‘Istanbul’]
separator=whitespace
• Mixes REST and RPC
• Can be more service than resource oriented
• URL indicates what operation to perform on which data
GET http://www.example.com/tokenize?text=Welcome+to+Istanbul&separator=whitespace
22 May 2012 LREC Metadata 2012 workshop 8
9. Interface Description Language
• RPC architectures tend to have an IDL which allows to
describe which operations are available at the endpoint
– SOAP: Web Service Description Language
• WSDL (2)
• For REST and REST-RPC hybrids an IDL is controversial
– Once you have the resource URL you can ‘just’ follow the links,
e.g., like a web crawler does with HTML pages
– However, REST(-RPC) allows too much freedom to allow a
machine to infer how to retrieve a resource or invoke a service
• RFC 6570: URI Template
• WADL (old W3C submission by Sun)
• WSDL 2
• ...
22 May 2012 LREC Metadata 2012 workshop 9
10. Profile matching
• To place Web Services in a chain or a workflow an
user can be supported by profile matching
– which service can operate on the input the user currently
has available
• The IDL describes the technical needs to invoke a
service, but profile matching needs more semantic
information
– it’s not useful to invoke the tokenizer on a project name,
although it is a string of characters
– next to a technical description also a semantic
description is needed
22 May 2012 LREC Metadata 2012 workshop 10
11. National CLARIN initiatives - Spain
• IULA at UPF (continuation in PANACEA)
• architecture: RPC (SOAP)
• IDL: WSDL
• semantic description:
– SoapLab 2 and myGrid inspired
– a CMD profile has been created
http://www.panacea-lr.eu/
22 May 2012 11
12. National CLARIN initiatives - Germany
• WebLicht (D-SPIN continuation in CLARIN-D)
• architecture: REST-RPC
• IDL: none as there is a known pattern, i.e., POST TCF documents
• semantic description:
– WebLicht used a propriety service description
– WebLicht 2.0 uses a core model compliant CMD profile
http://clarin-d.net/index.php/en/language-resources/weblicht-en
22 May 2012 12
13. National CLARIN initiatives – Netherlands and Flanders
• TTNWW project
• architecture: any
• IDL: when available and supported by the framework (Taverne)
• semantic description:
– a CMD profile has been created
– a core model compliant CMD profile has been created
http://www.clarin.nl/group/76#TTNWW
22 May 2012 13
14. A CMD core model for CLARIN Web Services
• Several CMD profiles have been created by the
national initiatives
– large overlap due to common area:
• service
• input and output specifications
– differences due to design choices:
• multiple operations per description
• separate technical description (IDL) or none at all
• handling of embedded parameters
• ...
• The CMD core model aims to align these
profiles, but also allow extensions for
accommodate differences in design choices
22 May 2012 LREC Metadata 2012 workshop 14
15. Additional aims for the core model
• The core model should preferably provide
enough information to
– do (basic) profile matching
– invoke a service
• This should allow (national) CLARIN web service
chaining and workflow engines to potentially
use all CLARIN web services
22 May 2012 LREC Metadata 2012 workshop 15
17. Salient points
• A (technical) service description is mandatory, but the model
doesn’t prescribe an IDL
• The location/endpoint of the service is part of the technical
description, i.e., only the PID/URL of the service description is
part of the semantic description
• A description can cover multiple operations
• Parameters might be (deeply) embedded in a technical input
document, e.g., the token or lemma layer inside a TCF
document, this is covered by parameter groups
• Names of operations, parameters and/or groups in the semantic
description should be resolvable in the technical description, so
after profile matching it is known how parameters should
technically be passed on during invocation
• Supports parameter (profile) matching on various semantic levels:
MIME type, data type, data category, semantic type
22 May 2012 LREC Metadata 2012 workshop 17
18. Parameter matching
• MIME type reveals the media type: text/plain
• Data type is generally an XML Schema data type: ID
• Data category is generally an ISOcat data category
PID: /project id/ (DC-2535)
• Semantic type is generally a service specific type:
clam.project.adelheid
• The tokenize server could specify text/plain as its
input MIME type but still an Adelheid project name
as the output of the Adelheid create project
service would not be proper input
22 May 2012 LREC Metadata 2012 workshop 18
19. From UML to CMD
• Transformation to deal with inheritance:
– each non-abstract class becomes a component
– each atomic attribute becomes an element, but
– each referential attribute becomes a component
with the referred class as a child component, except
– when this class is abstract all non-abstract
subclasses become child components
– copy cardinality constraints where possible
http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1311927752335
22 May 2012 LREC Metadata 2012 workshop 19
21. Usage of the CMD core model
• The core model is only a starting point, i.e., provides enough
information for basic profile matching and technical invocation
• It is a template to form a basis for the CMD profile of specific web
service registries, i.e., the model can be extended
• However, instantiations should also maintain compliance to the
core model
– cardinalities should be within the boundaries of the core model, e.g.,
mandatory elements cannot become optional
– closed value domains cannot be extended, but open value domains can be
turned into closed ones
– data category references should not be changed as this would imply different
semantics
• CLARIN-NL ToolService profile is a compliant extension:
http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1311927752306
• Validate compliance of instances to the core model:
http://www.isocat.org/clarin/ws/cmd-core/
22 May 2012 LREC Metadata 2012 workshop 21
22. Current status and future work
• Current state:
– There are some compliant profiles:
• CLARIN-NL ToolService profile
– but not in use by TTNWW yet
• WebLicht 2.0 profile
– in use, but still missing technical description (WADL)
– The core model is successful if there is a workflow/chaining
engine invoking web services which were originally targeted at
another engine or none at all
• Future work:
– Complex chains of web services are captured in (mini) workflows
• can this core model also describe the mini workflows?
– Alignment with or reuse by other initiatives
• e.g., META-SHARE meta model is also based on components and
ISOcat, and contains a section on Tools and Services
– Identify common extensions and incorporate them into the core
• e.g., default values, cardinalities, asynchronicity
22 May 2012 LREC Metadata 2012 workshop 22
23. Thanks for your attention!
Please visit:
http://www.isocat.org/clarin/ws/cmd-core/
22 May 2012 LREC Metadata 2012 workshop 23