Language Resource Processing Configuration and Run

Intro
This memo describes steps to configure and run a language resource processing. It
is intended for internal use only.
Architecture overview
Main components
There are three main components involved in the language resources processing:
● The Resource Server (hereafter RS) manages information about resources,
their status and associated files.
● The Workflow Server (hereafter WS) is responsible to process resource
input files to output files that are loaded to the Virtuoso server. The WS is
implemented using Oozie and Hadoop.
● DERI and others participants processing components
Data and Processing Flow
The following diagram shows communication between WS and RS during processing
a resource:
The flow:
1. The flow is started by the administrator with an http call to the RS REST API.
The call URL contains resource ID as a parameter. Example: POST
/resources/48957c5d-456c-4d7a-abc9-3062c91dafdd/processed
2. First step in the processing is done by the RS. It downloads the resource input
file, uploads it to the SCP server with name: ${resource_id}.ext

3. The resource server then selects flow by resource type, sets flow properties
and starts the flow using WS API of Oozie.
4. Oozie executes the flow that contains data moving steps and execution of the
resource processing components. The penultimate step in the flow moves is
the loading of data to the Virtuoso server, that is done by the miniLoader java
action.
5. The last step in the Oozie flow is notification of the resource server about
Virtuoso load status. The resource server then notify LRPMA about processing
status.
Processing set up overview
The whole processing is configured by following steps
1. resource type definition
2. registration of resource
3. definition of workflow
Processing set up
Definition of the resource type
1st is necessary to create an resource type using the resource server. Creating of
the resource type is the HTTP POST request so it is possible to do it either by
command line HTTP tool like curl or using a REST client. There are screen-shots
from the Postman REST client in following text for illustration. Beside it there are
also request parameters in table because it is easier to read. (and copy&paste).
The HTTP header ContentType should be set to “application/json”.
The resource server address is http://54.201.101.125:9999. Suppose that it is
necessary to process resources provided by Paradigma ltd. That contains a lexicon
so result of processing will be one graph.

Reques
t
POST http://54.201.101.125:9999/resourcestypes
Exampl
e body
{
"id":"paradigma",
"description": "type intended for processing of resources provided by
Paradigma ",
"graphsSuffixes": ["lexicon"]
}
Exampl
e
respons
e
{
"id": "paradigma"
}
The resource type define which workflow is used for processing of the resource and
the resource type id is used as a name of subfolder on HDFS for Oozie workflow.
Registration of the resource
The language resource should be registered in the resource server. Normally it is
done via the LRPMA but it it is possible to do it manually for test purposes using the
resource server REST API.
Request POST http://54.201.101.125:9999/resources
Example {

body "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0",
"resourceType": "paradigma",
"downloadUri":
"scp://ubuntu@54.201.101.125/home/ubuntu/ParadigmaData/hotel_
ca_tricks.csv",
"credentials": "-----BEGIN RSA PRIVATE KEY----- …...,
"language": "ca",
"domain": "hotel",
"provider": "Paradigma ltd",
"licence": "LRGPL",
"graphNamesPrefix":
"http://www.eurosentiment.com/hotel/ca/lexicon/paradigma/"
}
Example
response
{
"id": "48957c5d-456c-4d7a-abc9-3062c91dafE0"
}
Definition of Workflow
Processing steps are defined by XML work flow file that should be copied to Hadoop
Distributed File System to the location that is configured in the Resource file
configuration. The flow contains actions. Every action defines next action in case of
its success.
Properties populated by the resources server are used in the workflow definition
XML files.
Properties of flows populated by the Resource Server:
Properties calculated or retrieved from the resource properties:
Property Description
rsresourceid id of the resource
rsgraphprefix prefix for graphs, please see the miniLoader java action
description below
rsgraphsufix0,
[rsgraphsufix1]...
graph suffixes, one for each file produced by the flow
rsdomain domain of the processed resource
rslanguage language of the processed resource
rsprovider provider

rslicense license
oozie.wf.application.p
ath
${hdfs-folder-uri}/${resourceTypeId}
hdfs-folder-uri is specified in conf.properties of the rs,
resourceTypeId is property of the resource on the rs
The resource server also copy properties from the resource server
configuration file conf/job.properties to the flow properties. It can be used for
properties common for all flows like:
Property Description
nameNode HDFS name node address
jobTracker Map reduce job tracker address
queueName Map reduce jobs queue name
user.name user used to run the OOzie flow
inputfolder where downloaded resource files are stored
rspfilesdir folder for processed files
rsvirtuosoloadfolder absolute path to the folder where files for loading are
stored
rsvirtuosohost hostname or address of the virtuoso server
rsvirtuosojdbcport JDBC port
rsvirtuosojdbcuserr user
rsvirtuosojdbcpasswd password
rsprocessedurl url to send result of the virtuoso load
Example:

Configuring Actions
Work flows usually contains following sequence
◦ Move of data to place when it can be reached by the first processing
component
◦ Processing by the first component
◦ Move of data to place when it can be reached by the second processing
component
◦ Processing by second component
◦ ….
◦ Load to the Virtuoso triple store
Moving the resource file to the processing components
The following snippet shows an example of configuration of first step in flow to
move the resource files to folder where it can be picked up by a processing
component.
<workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow">
<start to="move-resource-file"/>
<action name="move-resource-file" retry-max="2" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptwf</host>
<command>${moveScriptPath} -onlyCopy ${inputfolder}$
{rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/$
{rsresourceid}.csv</command>
<capture-output/>
</sshWithRetry>
<ok to="lemon-marl-generator"/>

<error to="fail"/>
</action>
Configuring processing
The following xml snippet shows an example of processing by the Lomon Marl
generator.
<action name="lemon-marl-generator" retry-max="3" retry-interval="1">
<host>ubuntu@ptnuig</host>
<command>~/bin/runLemonMarlGeneratorParadigma.sh
/home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/$
{rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}$
{rsgraphsufix0}</command>
<capture-output/>
</sshWithRetry>
<ok to="move-file2virtuoso"/>
<error to="fail"/>
</action>
Moving data to Virtuoso Server
The following xml snippet shows an action which move output of previous step to
the Virtuoso server.
<action name="move-file2virtuoso" retry-max="2" retry-interval="1">
<command>${moveScriptPath} /home/ubuntu/data/outputs/$
{rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$
{rsresourceid}.ttl</command>
<capture-output/>
</sshWithRetry>
<ok to="load2virtuoso"/>
<error to="fail"/>
</action>
Load data to the Virtuoso Server
The following xml snippet shows an example configuration of the miniLoader
component that is used for load of the processed resources files to the Virtuoso
server.

<action name="load2virtuoso" retry-max="2" retry-interval="10">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.sindice.miniloader.Miniloader</main-class>
<arg>${rsvirtuosohost}</arg>
<arg>${rsvirtuosojdbcport}</arg>
<arg>${rsvirtuosojdbcuser}</arg>
<arg>${rsvirtuosojdbcpasswd}</arg>
<arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg>
<arg>${rsgraphprefix}${rsgraphsufix0}</arg>
<capture-output/>
</java>
<ok to="notify_rs" />
<error to="fail" />
</action>
Notifying the resource server
Last step notifies the RS that data was loaded to the Virtuoso server.
<action name="notify_rs">
<shell xmlns="uri:oozie:shell-action:0.1">
<exec>curl</exec>
<argument>-H</argument>
<argument>Content-Type:application/json</argument>
<argument>-X</argument>
<argument>POST</argument>
<argument>-d</argument>
<argument>${wf:actionData('load2virtuoso')
['miniloader_json4rs']}</argument>
<argument>${rsprocessedurl}$
{rsresourceid}/processed</argument>
</shell>

<ok to="end" />
<error to="fail" />
</action>
Copy the configuration to the HDFS
The property “hdfs-folder-uri” in conf.properties RS configuration file define the path
where the configuration should be stored.
The resource type ID (paradigma) is part of the HDFS path so it is firs necessary to
check if exists:
If the folder for given resource file does not exists yet it is necessary to create it.
Now is necessary to copy the workflow and required jars. In this case only the
miniloader jar is required and it should be copied to the lib subfolder.
hadoop fs -put workflow.xml /user/ubuntu/nuig-flows/paradigma/
fs -put ~/virtuoso-miniloader-0.0.1-SNAPSHOT.jar /user/ubuntu/nuig-
flows/paradigma/lib
Processing Resources
Processing is started by HTTP POST request to the RS server with empty body.

It is possible to control status of the processing using Oozie web console:
clicking the running line the detail window appears

When processing finished all step should have status OK

When resource is processed successfully it is possible to make a sparql request to
verify the content.
Appendix A: example of whole flow definition
<workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow">
<start to="move-resource-file"/>
<action name="move-resource-file" retry-max="2" retry-interval="1">
<host>ubuntu@ptwf</host>
<command>${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}*
ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv</command>
<capture-output/>
</sshWithRetry>
<ok to="lemon-marl-generator"/>
<error to="fail"/>
</action>
<action name="lemon-marl-generator" retry-max="3" retry-interval="1">
<command>~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/$
{rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} $
{rslanguage} ${rsgraphprefix}${rsgraphsufix0}</command>
<capture-output/>
</sshWithRetry>
<ok to="move-file2virtuoso"/>
<error to="fail"/>
</action>
<action name="move-file2virtuoso" retry-max="2" retry-interval="1">
<command>${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl $
{virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$

{rsresourceid}.ttl</command>
<capture-output/>
</sshWithRetry>
<ok to="load2virtuoso"/>
<error to="fail"/>
</action>
<action name="load2virtuoso" retry-max="2" retry-interval="10">
<java>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.sindice.miniloader.Miniloader</main-class>
<arg>${rsvirtuosohost}</arg>
<arg>${rsvirtuosojdbcport}</arg>
<arg>${rsvirtuosojdbcuser}</arg>
<arg>${rsvirtuosojdbcpasswd}</arg>
<arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg>
<arg>${rsgraphprefix}${rsgraphsufix0}</arg>
<capture-output/>
</java>
<ok to="notify_rs" />
<error to="fail" />
</action>
<action name="notify_rs">
<exec>curl</exec>
<argument>-H</argument>
<argument>Content-Type:application/json</argument>
<argument>-X</argument>
<argument>POST</argument>
<argument>-d</argument>
<argument>${wf:actionData('load2virtuoso')
['miniloader_json4rs']}</argument>
<argument>${rsprocessedurl}${rsresourceid}/processed</argument>
</shell>
<ok to="dir4processed_file" />
<error to="fail" />
</action>
<action name="dir4processed_file">

<exec>mkdir</exec>
<argument>${rspfilesdir}/${rsresourceid}</argument>
</shell>
<ok to="move_processed_file" />
<error to="fail" />
</action>
<action name="move_processed_file">
<exec>mv</exec>
<argument>${rsvirtuosoloadfolder}${rsresourceid}.ttl</argument>
<argument>${rspfilesdir}/${rsresourceid}</argument>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>SSH action failed, error message[$
{wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

Language Resource Processing Configuration and Run

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Language Resource Processing Configuration and Run

Similar to Language Resource Processing Configuration and Run (20)

Recently uploaded

Recently uploaded (20)

Language Resource Processing Configuration and Run