RDF and other linked data
standards — how to make use
of big localization data

!Dave Lewis!
FEISGILTT Vancouver 29th Oct 2014!
Problem – Data Management!
—  Language assets (TM and TB) scattered across the
localisation pipeline, !
–  different	
  versions,	
  	
  
–  different	
  stages	
  of	
  quality,	
  	
  
–  in	
  different	
  ‘silos’	
  
–  challenging	
  to	
  pool	
  &	
  clean	
  these	
  resources	
  
—  Difficult to share and search language resources within
and across organisations !
–  impacts	
  on	
  consistency	
  and	
  cost	
  &	
  quality	
  of	
  transla;on	
  
—  Merging and extending language resources is complex!
–  	
  makes	
  leveraging	
  new	
  resources	
  extremely	
  difficult	
  
Why RDF/Linked Data?!
—  Creates relationships (links) between data – ie linked
data!
—  Easier to integrate and leverage language resources
regardless of its format, where it is stored, who owns it !
–  saves	
  money	
  
—  Easier to search & analyse language resources !
–  saves	
  ;me	
  in	
  finding	
  the	
  most	
  suitable	
  resources	
  for	
  your	
  
projects	
  
—  Enriches language resources with additional meaning !
–  	
  allows	
  them	
  to	
  be	
  beCer	
  used	
  
—  Easy tracking of provenance !
–  helps	
  manage	
  versioning	
  
— Open Data on the Web: W3C Semantic Web
standards for data published on Web !
–  Fine-­‐grained	
  inter-­‐linking	
  of	
  data	
  “cells”	
  -­‐	
  URL	
  
–  Extensible	
  meta-­‐data	
  –	
  Resource	
  Descrip;on	
  Format	
  (RDF)	
  
–  Standard	
  Query	
  API	
  -­‐	
  SPARQL	
  
— LIDER Project: !
–  Stakeholder	
  data	
  needs	
  for	
  LT	
  and	
  language	
  resources	
  
–  Best	
  prac;ces	
  and	
  guidelines	
  to	
  apply	
  linked	
  data	
  
— Existing Open data vocabularies !
–  Lexical-­‐conceptual	
  data	
  –	
  LEMON	
  vocabulary	
  
–  Encyclopedic	
  -­‐	
  DBPedia	
  
–  Resource	
  meta-­‐data,	
  licensing,	
  annota;on,	
  provenance	
  etc	
  
Linked Data for Language Technology!
Linguis;c	
  Linked	
  Data:	
  Lexicons	
  
Red	
  
Phone;c	
  form	
  
Form	
  
singular	
  
[RED]	
  
Form	
  
plural	
  
[REDES]	
  
Phone;c	
  form	
  
number	
  
number	
  
Red	
  
Sense	
  
wriCen	
  form	
  
“red”	
  
Sense	
  
wriCen	
  form	
  
“malla”	
  
equivalent	
  
Red	
  
image	
  
Red	
  
Sense	
   Sense	
  
transla;on	
  
es	
  -­‐	
  en	
  
wriCen	
  form	
  
“red”	
   “network”	
  
wriCen	
  form	
  
Red	
  
wriCen	
  form	
  
Form	
  
gender	
  
femenine	
  
“red”	
  
Use Case: BabelNet.org!
Data Challenges in Using LT!
—  Language Technology is statistical!
—  Quality is limited by distance between training data and
job at hand!
—  Training Data is the Key Asset for LT!
–  E.g.	
  for	
  L10n	
  its	
  Transla;on	
  Memories	
  and	
  Term	
  Bases	
  
—  Challenges for Managing Training Data!
–  Discover	
  
–  Select	
  
–  Curate	
  
–  Share/Pool/Sell	
  
–  Understanding	
  Quality	
  
–  Measure	
  Impact	
  on	
  Produc;vity	
  
Language	
  
Workers	
  
Language	
  
Technology	
  
Language	
  
Resources	
  
Active Curation: Managing LT/LR Lifecycle !
ACTIVE	
  	
  
CURATION	
  
Use Case: FALCON Project!
Tool	
  Chain	
  
•  Website	
  
transla;on	
  
•  Transla;on	
  
Management	
  
•  Terminology	
  
Management	
  
Language	
  
Technology	
  
•  Machine	
  
Transla;on	
  
•  Term	
  
Iden;fica;on	
  
Linked	
  Data	
  
•  Parallel	
  Text	
  
•  Terms:	
  Lexical-­‐
conceptual	
  
XLIFF	
  
+ITS2.0	
  
—  Building the
Localization Web =
Decentralised
Annotated Global
Translation Memory
and Term Base !
—  Terms and translations
become linkable
resources!
—  Meta-data from L10n
tool chain adds value!
—  Use in training Machine
Translation and Text
Analytics!
FALCON Demo: Locworld Expo!
Web	
  Site	
  
Transla;on	
  
Transla;on	
  
Management	
  
System	
  
Terminology	
  
Management	
  
System	
  
Machine	
  
Transla;on	
  
Federated	
  	
  
L3Data	
  	
  
Plagorm	
  
Transla;on	
  Management	
  
Text	
  	
  
Analysis	
  
Localiza;on	
  
	
  Tool	
  chain	
  
Language	
  
Technology	
  
Language	
  	
  
Resources	
  
Public	
  
Resources	
  
DCU	
  
TCD	
  
Words as Resources on the Web!
Barak	
  Obama	
  is	
  the	
  
44th	
  president	
  of	
  the	
  
United	
  States	
  of	
  
America.	
  He	
  was	
  first	
  
elected	
  in	
  2008.	
  
Barak	
  Obama	
  si	
  el	
  44	
  º	
  
presidente	
  de	
  los	
  Estados	
  
Unidos	
  de	
  América.	
  Ha	
  
fue	
  electo	
  primera	
  vez	
  en	
  
2008.	
  
hCp://	
  www.ex.org/obama_en.html	
  
hCp://	
  www.ex.org/obama_es.html	
  
The	
  Web	
  of	
  Content	
   The	
  LocalizaDon	
  Web	
  
hCp://data.ex.org/String_0001	
  
hCp://	
  data.ex.org/String_0002	
  
Derived	
  
From	
  
Derived	
  
From	
  
Text:	
  “Barak	
  Obama	
  is	
  
the	
  44th	
  president	
  of	
  
the	
  United	
  States	
  of	
  
America.”	
  
Lang:en	
  
Text:“Barak	
  Obama	
  es	
  el	
  
44	
  º	
  presidente	
  de	
  los	
  
Estados	
  Unidos	
  de	
  
América.”	
  
Lang:es	
  
TranslatedBy:Google	
  
Translate	
  
Translated	
  	
  
From	
  
TranslaDon	
  Data	
  
Term:	
  “United	
  States	
  
of	
  America.”	
  
Lang:en	
  
Term:“Estados	
  Unidos	
  
de	
  América.”	
  
Lang:es	
  
Transla;on	
  	
  
Of	
  
hCp://	
  babelnet.org/345621	
  
hCp://	
  babelnet.org/57835	
  
Terminology	
  Data	
  
Topic:	
  Barack	
  Obama	
  
Lang:	
  en	
  
BirthDate:	
  1961-­‐08-­‐04	
  	
  
Spouse:	
  Michelle	
  Obama	
  
Residence:	
  White	
  House	
  
hCp://	
  Dbpedia.org/Page/	
  
Barak_Obama	
  
Encyclopaedic	
  Data	
  
L10n Use Case: Closing the Loop!
—  Active Curation: Systematic harvesting of 

LT-ready TM and TB from localization tool chain!
—  Data and Tools for Optimise process flow:!
–  Priori;ze	
  segments	
  for	
  postedi;ng	
  and	
  input	
  to	
  
incremental	
  MT	
  retraining	
  	
  	
  
–  Target	
  postedits	
  to	
  extract	
  target	
  terms	
  and	
  new	
  
morphologies	
  	
  	
  
—  Postediting Instrumentation:!
–  Postedit	
  ;me	
  and	
  resource	
  use	
  (terms,	
  concordance)	
  vs.	
  
automa;on	
  of	
  MT	
  metrics	
  
–  iOmegaT:	
  instrumented	
  open	
  source	
  CAT	
  tool	
  
—  LREC, AMTA, EDF, MLW, LocFoc, Multilingual, EdMedia,
FEISGILTT!
Research and Innovation Roadmap!
—  https://www.w3.org/community/ld4lt/wiki/
Linguistic_Linked_Data_for_Content_Analytics:_a_Roadmap!
Global	
  
Customer	
  
Engagement	
  
Use	
  Cases	
  
Public	
  Sector	
  
and	
  Civil	
  
Society	
  Use	
  
Cases	
  
LinguisDc	
  Linked	
  
Data	
  Life	
  Cycle	
  and	
  
Value	
  Network	
  
Requirements	
  
Best Practices for Multilingual Linked
Open Data!
—  Linguistic Vocabularies.  !
—  Resource-specific vocabularies!
—  Best Practices for Multilingual Linked
Data !
–  	
  Prac;ces	
  for	
  Naming.	
  
–  Prac;ces	
  for	
  Dereferencing	
  
–  Prac;ces	
  for	
  Textual	
  Informa;on	
  
–  Prac;ces	
  for	
  linking.	
  	
  
–  Iden;fica;on	
  of	
  languages	
  .	
  
–  DataID	
  
–  OWL	
  Metamodel	
  for	
  Language	
  
Resources	
  
–  License	
  Ontology	
  
—  Guidelines for Converting WordNets
to Linked Data !
—  Guidelines for Linguistic Linked Data
Generation: Multilingual Knowledge
Bases.!
—  Guidelines for Linguistic Linked
Data Generation: Bilingual
Dictionaries !
—  Guidelines for Converting TBX into
Linked Data!
—  Guidelines for NIF-based NLP
Services !
—  Comparison of Repositories!
Dublin Workshop Session!
—  There is a need for a common API to text analysis services, live
update of linked data source, user feedback mechanisms, or
annotation relevance indicators.!
—  "Too much information is no information": linked data information can
help the translator only if it does not lead to an information overflow.!
—  A stand-off annotation mechanism is needed to deal with annotation
overlap. NIF could be a solution.!
—  For the localization industry, licensing metadata is of key importance.
Only with such metadata one can also work with internal = closed
linked data.!
—  Terminology and linked data is a hot topic discussed also in
the LD4LT group. Currently there is no standard mapping of the TBX
format to RDF.!
—  Bitext (= aligned text of a source and one or several translations)
could be exposed as as linked data, as an alternative to TMX.!
More Information!
— Contact: dave.lewis@cs.tcd.ie!
–  hCp://www.falcon-­‐project.eu	
  
— Lider: best practices and roadmap!
–  hCp://www.lider-­‐project.eu/	
  
— See also: !
–  Linked	
  Data	
  for	
  Language	
  Technology	
  (LD4LT)	
  	
  W3C	
  
Community	
  Group	
  
•  hCp://www.w3.org/community/ld4lt/	
  
–  Best	
  Prac;ce	
  in	
  Mul;lingual	
  Linked	
  Open	
  Data	
  
•  hCp://www.w3.org/community/bpmlod/	
  
–  OntoLex	
  Community	
  Group	
  
•  hCp://www.w3.org/community/ontolex/	
  

RDF and other linked data standards — how to make use of big localization data

  • 1.
    RDF and otherlinked data standards — how to make use of big localization data
 !Dave Lewis! FEISGILTT Vancouver 29th Oct 2014!
  • 2.
    Problem – DataManagement! —  Language assets (TM and TB) scattered across the localisation pipeline, ! –  different  versions,     –  different  stages  of  quality,     –  in  different  ‘silos’   –  challenging  to  pool  &  clean  these  resources   —  Difficult to share and search language resources within and across organisations ! –  impacts  on  consistency  and  cost  &  quality  of  transla;on   —  Merging and extending language resources is complex! –   makes  leveraging  new  resources  extremely  difficult  
  • 3.
    Why RDF/Linked Data?! — Creates relationships (links) between data – ie linked data! —  Easier to integrate and leverage language resources regardless of its format, where it is stored, who owns it ! –  saves  money   —  Easier to search & analyse language resources ! –  saves  ;me  in  finding  the  most  suitable  resources  for  your   projects   —  Enriches language resources with additional meaning ! –   allows  them  to  be  beCer  used   —  Easy tracking of provenance ! –  helps  manage  versioning  
  • 4.
    — Open Data onthe Web: W3C Semantic Web standards for data published on Web ! –  Fine-­‐grained  inter-­‐linking  of  data  “cells”  -­‐  URL   –  Extensible  meta-­‐data  –  Resource  Descrip;on  Format  (RDF)   –  Standard  Query  API  -­‐  SPARQL   — LIDER Project: ! –  Stakeholder  data  needs  for  LT  and  language  resources   –  Best  prac;ces  and  guidelines  to  apply  linked  data   — Existing Open data vocabularies ! –  Lexical-­‐conceptual  data  –  LEMON  vocabulary   –  Encyclopedic  -­‐  DBPedia   –  Resource  meta-­‐data,  licensing,  annota;on,  provenance  etc   Linked Data for Language Technology!
  • 5.
    Linguis;c  Linked  Data:  Lexicons   Red   Phone;c  form   Form   singular   [RED]   Form   plural   [REDES]   Phone;c  form   number   number   Red   Sense   wriCen  form   “red”   Sense   wriCen  form   “malla”   equivalent   Red   image   Red   Sense   Sense   transla;on   es  -­‐  en   wriCen  form   “red”   “network”   wriCen  form   Red   wriCen  form   Form   gender   femenine   “red”  
  • 6.
  • 7.
    Data Challenges inUsing LT! —  Language Technology is statistical! —  Quality is limited by distance between training data and job at hand! —  Training Data is the Key Asset for LT! –  E.g.  for  L10n  its  Transla;on  Memories  and  Term  Bases   —  Challenges for Managing Training Data! –  Discover   –  Select   –  Curate   –  Share/Pool/Sell   –  Understanding  Quality   –  Measure  Impact  on  Produc;vity  
  • 8.
    Language   Workers   Language   Technology   Language   Resources   Active Curation: Managing LT/LR Lifecycle ! ACTIVE     CURATION  
  • 9.
    Use Case: FALCONProject! Tool  Chain   •  Website   transla;on   •  Transla;on   Management   •  Terminology   Management   Language   Technology   •  Machine   Transla;on   •  Term   Iden;fica;on   Linked  Data   •  Parallel  Text   •  Terms:  Lexical-­‐ conceptual   XLIFF   +ITS2.0   —  Building the Localization Web = Decentralised Annotated Global Translation Memory and Term Base ! —  Terms and translations become linkable resources! —  Meta-data from L10n tool chain adds value! —  Use in training Machine Translation and Text Analytics!
  • 10.
    FALCON Demo: LocworldExpo! Web  Site   Transla;on   Transla;on   Management   System   Terminology   Management   System   Machine   Transla;on   Federated     L3Data     Plagorm   Transla;on  Management   Text     Analysis   Localiza;on    Tool  chain   Language   Technology   Language     Resources   Public   Resources   DCU   TCD  
  • 11.
    Words as Resourceson the Web! Barak  Obama  is  the   44th  president  of  the   United  States  of   America.  He  was  first   elected  in  2008.   Barak  Obama  si  el  44  º   presidente  de  los  Estados   Unidos  de  América.  Ha   fue  electo  primera  vez  en   2008.   hCp://  www.ex.org/obama_en.html   hCp://  www.ex.org/obama_es.html   The  Web  of  Content   The  LocalizaDon  Web   hCp://data.ex.org/String_0001   hCp://  data.ex.org/String_0002   Derived   From   Derived   From   Text:  “Barak  Obama  is   the  44th  president  of   the  United  States  of   America.”   Lang:en   Text:“Barak  Obama  es  el   44  º  presidente  de  los   Estados  Unidos  de   América.”   Lang:es   TranslatedBy:Google   Translate   Translated     From   TranslaDon  Data   Term:  “United  States   of  America.”   Lang:en   Term:“Estados  Unidos   de  América.”   Lang:es   Transla;on     Of   hCp://  babelnet.org/345621   hCp://  babelnet.org/57835   Terminology  Data   Topic:  Barack  Obama   Lang:  en   BirthDate:  1961-­‐08-­‐04     Spouse:  Michelle  Obama   Residence:  White  House   hCp://  Dbpedia.org/Page/   Barak_Obama   Encyclopaedic  Data  
  • 12.
    L10n Use Case:Closing the Loop! —  Active Curation: Systematic harvesting of 
 LT-ready TM and TB from localization tool chain! —  Data and Tools for Optimise process flow:! –  Priori;ze  segments  for  postedi;ng  and  input  to   incremental  MT  retraining       –  Target  postedits  to  extract  target  terms  and  new   morphologies       —  Postediting Instrumentation:! –  Postedit  ;me  and  resource  use  (terms,  concordance)  vs.   automa;on  of  MT  metrics   –  iOmegaT:  instrumented  open  source  CAT  tool   —  LREC, AMTA, EDF, MLW, LocFoc, Multilingual, EdMedia, FEISGILTT!
  • 13.
    Research and InnovationRoadmap! —  https://www.w3.org/community/ld4lt/wiki/ Linguistic_Linked_Data_for_Content_Analytics:_a_Roadmap! Global   Customer   Engagement   Use  Cases   Public  Sector   and  Civil   Society  Use   Cases   LinguisDc  Linked   Data  Life  Cycle  and   Value  Network   Requirements  
  • 14.
    Best Practices forMultilingual Linked Open Data! —  Linguistic Vocabularies.  ! —  Resource-specific vocabularies! —  Best Practices for Multilingual Linked Data ! –   Prac;ces  for  Naming.   –  Prac;ces  for  Dereferencing   –  Prac;ces  for  Textual  Informa;on   –  Prac;ces  for  linking.     –  Iden;fica;on  of  languages  .   –  DataID   –  OWL  Metamodel  for  Language   Resources   –  License  Ontology   —  Guidelines for Converting WordNets to Linked Data ! —  Guidelines for Linguistic Linked Data Generation: Multilingual Knowledge Bases.! —  Guidelines for Linguistic Linked Data Generation: Bilingual Dictionaries ! —  Guidelines for Converting TBX into Linked Data! —  Guidelines for NIF-based NLP Services ! —  Comparison of Repositories!
  • 15.
    Dublin Workshop Session! — There is a need for a common API to text analysis services, live update of linked data source, user feedback mechanisms, or annotation relevance indicators.! —  "Too much information is no information": linked data information can help the translator only if it does not lead to an information overflow.! —  A stand-off annotation mechanism is needed to deal with annotation overlap. NIF could be a solution.! —  For the localization industry, licensing metadata is of key importance. Only with such metadata one can also work with internal = closed linked data.! —  Terminology and linked data is a hot topic discussed also in the LD4LT group. Currently there is no standard mapping of the TBX format to RDF.! —  Bitext (= aligned text of a source and one or several translations) could be exposed as as linked data, as an alternative to TMX.!
  • 16.
    More Information! — Contact: dave.lewis@cs.tcd.ie! – hCp://www.falcon-­‐project.eu   — Lider: best practices and roadmap! –  hCp://www.lider-­‐project.eu/   — See also: ! –  Linked  Data  for  Language  Technology  (LD4LT)    W3C   Community  Group   •  hCp://www.w3.org/community/ld4lt/   –  Best  Prac;ce  in  Mul;lingual  Linked  Open  Data   •  hCp://www.w3.org/community/bpmlod/   –  OntoLex  Community  Group   •  hCp://www.w3.org/community/ontolex/