Your SlideShare is downloading. ×
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Language Resource Processing Configuration and Run
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Language Resource Processing Configuration and Run

2,973

Published on

Eurosentiment resources pipeline description.

Eurosentiment resources pipeline description.

Published in: Software, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,973
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Intro This memo describes steps to configure and run a language resource processing. It is intended for internal use only. Architecture overview Main components There are three main components involved in the language resources processing: ● The Resource Server (hereafter RS) manages information about resources, their status and associated files. ● The Workflow Server (hereafter WS) is responsible to process resource input files to output files that are loaded to the Virtuoso server. The WS is implemented using Oozie and Hadoop. ● DERI and others participants processing components Data and Processing Flow The following diagram shows communication between WS and RS during processing a resource: The flow: 1. The flow is started by the administrator with an http call to the RS REST API. The call URL contains resource ID as a parameter. Example: POST /resources/48957c5d-456c-4d7a-abc9-3062c91dafdd/processed 2. First step in the processing is done by the RS. It downloads the resource input file, uploads it to the SCP server with name: ${resource_id}.ext
  • 2. 3. The resource server then selects flow by resource type, sets flow properties and starts the flow using WS API of Oozie. 4. Oozie executes the flow that contains data moving steps and execution of the resource processing components. The penultimate step in the flow moves is the loading of data to the Virtuoso server, that is done by the miniLoader java action. 5. The last step in the Oozie flow is notification of the resource server about Virtuoso load status. The resource server then notify LRPMA about processing status. Processing set up overview The whole processing is configured by following steps 1. resource type definition 2. registration of resource 3. definition of workflow Processing set up Definition of the resource type 1st is necessary to create an resource type using the resource server. Creating of the resource type is the HTTP POST request so it is possible to do it either by command line HTTP tool like curl or using a REST client. There are screen-shots from the Postman REST client in following text for illustration. Beside it there are also request parameters in table because it is easier to read. (and copy&paste). The HTTP header ContentType should be set to “application/json”. The resource server address is http://54.201.101.125:9999. Suppose that it is necessary to process resources provided by Paradigma ltd. That contains a lexicon so result of processing will be one graph.
  • 3. Reques t POST http://54.201.101.125:9999/resourcestypes Exampl e body { "id":"paradigma", "description": "type intended for processing of resources provided by Paradigma ", "graphsSuffixes": ["lexicon"] } Exampl e respons e { "id": "paradigma" } The resource type define which workflow is used for processing of the resource and the resource type id is used as a name of subfolder on HDFS for Oozie workflow. Registration of the resource The language resource should be registered in the resource server. Normally it is done via the LRPMA but it it is possible to do it manually for test purposes using the resource server REST API. Request POST http://54.201.101.125:9999/resources Example {
  • 4. body "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0", "resourceType": "paradigma", "downloadUri": "scp://ubuntu@54.201.101.125/home/ubuntu/ParadigmaData/hotel_ ca_tricks.csv", "credentials": "-----BEGIN RSA PRIVATE KEY----- …..., "language": "ca", "domain": "hotel", "provider": "Paradigma ltd", "licence": "LRGPL", "graphNamesPrefix": "http://www.eurosentiment.com/hotel/ca/lexicon/paradigma/" } Example response { "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0" } Definition of Workflow Processing steps are defined by XML work flow file that should be copied to Hadoop Distributed File System to the location that is configured in the Resource file configuration. The flow contains actions. Every action defines next action in case of its success. Properties populated by the resources server are used in the workflow definition XML files. Properties of flows populated by the Resource Server: Properties calculated or retrieved from the resource properties: Property Description rsresourceid id of the resource rsgraphprefix prefix for graphs, please see the miniLoader java action description below rsgraphsufix0, [rsgraphsufix1]... graph suffixes, one for each file produced by the flow rsdomain domain of the processed resource rslanguage language of the processed resource rsprovider provider
  • 5. rslicense license oozie.wf.application.p ath ${hdfs-folder-uri}/${resourceTypeId} hdfs-folder-uri is specified in conf.properties of the rs, resourceTypeId is property of the resource on the rs The resource server also copy properties from the resource server configuration file conf/job.properties to the flow properties. It can be used for properties common for all flows like: Property Description nameNode HDFS name node address jobTracker Map reduce job tracker address queueName Map reduce jobs queue name user.name user used to run the OOzie flow inputfolder where downloaded resource files are stored rspfilesdir folder for processed files rsvirtuosoloadfolder absolute path to the folder where files for loading are stored rsvirtuosohost hostname or address of the virtuoso server rsvirtuosojdbcport JDBC port rsvirtuosojdbcuserr user rsvirtuosojdbcpasswd password rsprocessedurl url to send result of the virtuoso load Example:
  • 6. Configuring Actions Work flows usually contains following sequence ◦ Move of data to place when it can be reached by the first processing component ◦ Processing by the first component ◦ Move of data to place when it can be reached by the second processing component ◦ Processing by second component ◦ …. ◦ Load to the Virtuoso triple store Moving the resource file to the processing components The following snippet shows an example of configuration of first step in flow to move the resource files to folder where it can be picked up by a processing component. <workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow"> <start to="move-resource-file"/> <action name="move-resource-file" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptwf</host> <command>${moveScriptPath} -onlyCopy ${inputfolder}$ {rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/$ {rsresourceid}.csv</command> <capture-output/> </sshWithRetry> <ok to="lemon-marl-generator"/>
  • 7. <error to="fail"/> </action> Configuring processing The following xml snippet shows an example of processing by the Lomon Marl generator. <action name="lemon-marl-generator" retry-max="3" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/$ {rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}$ {rsgraphsufix0}</command> <capture-output/> </sshWithRetry> <ok to="move-file2virtuoso"/> <error to="fail"/> </action> Moving data to Virtuoso Server The following xml snippet shows an action which move output of previous step to the Virtuoso server. <action name="move-file2virtuoso" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>${moveScriptPath} /home/ubuntu/data/outputs/$ {rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$ {rsresourceid}.ttl</command> <capture-output/> </sshWithRetry> <ok to="load2virtuoso"/> <error to="fail"/> </action> Load data to the Virtuoso Server The following xml snippet shows an example configuration of the miniLoader component that is used for load of the processed resources files to the Virtuoso server.
  • 8. <action name="load2virtuoso" retry-max="2" retry-interval="10"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>com.sindice.miniloader.Miniloader</main-class> <arg>${rsvirtuosohost}</arg> <arg>${rsvirtuosojdbcport}</arg> <arg>${rsvirtuosojdbcuser}</arg> <arg>${rsvirtuosojdbcpasswd}</arg> <arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg> <arg>${rsgraphprefix}${rsgraphsufix0}</arg> <capture-output/> </java> <ok to="notify_rs" /> <error to="fail" /> </action> Notifying the resource server Last step notifies the RS that data was loaded to the Virtuoso server. <action name="notify_rs"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>curl</exec> <argument>-H</argument> <argument>Content-Type:application/json</argument> <argument>-X</argument> <argument>POST</argument> <argument>-d</argument> <argument>${wf:actionData('load2virtuoso') ['miniloader_json4rs']}</argument> <argument>${rsprocessedurl}$ {rsresourceid}/processed</argument> </shell>
  • 9. <ok to="end" /> <error to="fail" /> </action> Copy the configuration to the HDFS The property “hdfs-folder-uri” in conf.properties RS configuration file define the path where the configuration should be stored. The resource type ID (paradigma) is part of the HDFS path so it is firs necessary to check if exists: If the folder for given resource file does not exists yet it is necessary to create it. Now is necessary to copy the workflow and required jars. In this case only the miniloader jar is required and it should be copied to the lib subfolder. hadoop fs -put workflow.xml /user/ubuntu/nuig-flows/paradigma/ fs -put ~/virtuoso-miniloader-0.0.1-SNAPSHOT.jar /user/ubuntu/nuig- flows/paradigma/lib Processing Resources Processing is started by HTTP POST request to the RS server with empty body.
  • 10. It is possible to control status of the processing using Oozie web console: clicking the running line the detail window appears
  • 11. When processing finished all step should have status OK
  • 12. When resource is processed successfully it is possible to make a sparql request to verify the content. Appendix A: example of whole flow definition <workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow"> <start to="move-resource-file"/> <action name="move-resource-file" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptwf</host> <command>${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv</command> <capture-output/> </sshWithRetry> <ok to="lemon-marl-generator"/> <error to="fail"/> </action> <action name="lemon-marl-generator" retry-max="3" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/$ {rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} $ {rslanguage} ${rsgraphprefix}${rsgraphsufix0}</command> <capture-output/> </sshWithRetry> <ok to="move-file2virtuoso"/> <error to="fail"/> </action> <action name="move-file2virtuoso" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl $ {virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$
  • 13. {rsresourceid}.ttl</command> <capture-output/> </sshWithRetry> <ok to="load2virtuoso"/> <error to="fail"/> </action> <action name="load2virtuoso" retry-max="2" retry-interval="10"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>com.sindice.miniloader.Miniloader</main-class> <arg>${rsvirtuosohost}</arg> <arg>${rsvirtuosojdbcport}</arg> <arg>${rsvirtuosojdbcuser}</arg> <arg>${rsvirtuosojdbcpasswd}</arg> <arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg> <arg>${rsgraphprefix}${rsgraphsufix0}</arg> <capture-output/> </java> <ok to="notify_rs" /> <error to="fail" /> </action> <action name="notify_rs"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>curl</exec> <argument>-H</argument> <argument>Content-Type:application/json</argument> <argument>-X</argument> <argument>POST</argument> <argument>-d</argument> <argument>${wf:actionData('load2virtuoso') ['miniloader_json4rs']}</argument> <argument>${rsprocessedurl}${rsresourceid}/processed</argument> </shell> <ok to="dir4processed_file" /> <error to="fail" /> </action> <action name="dir4processed_file"> <shell xmlns="uri:oozie:shell-action:0.1">
  • 14. <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>mkdir</exec> <argument>${rspfilesdir}/${rsresourceid}</argument> </shell> <ok to="move_processed_file" /> <error to="fail" /> </action> <action name="move_processed_file"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>mv</exec> <argument>${rsvirtuosoloadfolder}${rsresourceid}.ttl</argument> <argument>${rspfilesdir}/${rsresourceid}</argument> </shell> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>SSH action failed, error message[$ {wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>

×