• Like
  • Save
Slides
Upcoming SlideShare
Loading in...5
×
 

Slides

on

  • 352 views

 

Statistics

Views

Total Views
352
Views on SlideShare
349
Embed Views
3

Actions

Likes
0
Downloads
3
Comments
0

2 Embeds 3

http://coderwall.com 2
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Slides Slides Presentation Transcript

    • Contributions for building a Corpora-Flow system Andr´ Santos e andrefs@cpan.org Informatics Engineering MSc University of Minho December 2011
    • Concepts Aligned parallel corpus: Set of parallel texts in which correspondences have been marked between blocks (paragraphs, sentences, words, . . . ) from each text. Corpora-flow: Adaptation of the concept of workflow to the several tasks, decisions and sequences of steps involved in the process of building a corpus.1 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Concepts Aligned parallel corpus: Set of parallel texts in which correspondences have been marked between blocks (paragraphs, sentences, words, . . . ) from each text. Corpora-flow: Adaptation of the concept of workflow to the several tasks, decisions and sequences of steps involved in the process of building a corpus. This presentation and the underlying master thesis describe the implementation of several tools to be used in typical corpus building activities.1 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Context The work developed in the context of this master thesis was motivated and supported by Project Per-fide, an undergoing project in University of Minho which aims to build large parallel corpora between Portuguese and other six languages.2 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Corpora building challenges file format and format conversion finding duplicated files text encoding format structural residues section delimiters unpaired sections (parallel corpora) ...3 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Corpora building challenges Severe problems which often lead to bad results Many (most?) of them are hard/impossible to solve completely Find the problem and report it when it is not solvable automatically Provide intelligent ways of describing what was found and done4 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • 5 key issues Book cleaning Duplicates and candidate pairs detection Book synchronization Alignment evaluation Corpora-flow system5 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne <92>’ : right single quot. mark (CP1252) <96>- : en dash (CP1252) ^L : page break (0xC) prot´-(...)geait : transpagination e6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne (...) d ’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´geait _pb1_ e e e e la partie ant´rieure contre l ’ action e des rayons solaires, reposait sur de sveltes bambous. (...)6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Book cleaning Subdivided in several steps:7 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Sections ontology chap PT cap´tulo, ı contains common section types cap, capitulo FR chapitre, chap used to automatically generate EN chapter, chap the code to recognize section NT sec delimiters end PT fim allows discussion/cooperation FR fin EN the_end with people with no BT _alone programming knowledge scene code becomes more simple and PT cena FR sc`ne e clean EN scene RU глава BT act8 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Duplicates and pairs detection Motivation Duplicates can result in a biased corpus Finding candidate pairs for alignment Language independent elements (LIEs) terms which are usually kept untranslated year references – “1973” proper names – “Hamlet” Measuring similarity Thresholds < 0.2: unrelated |ALIEs ∩ BLIEs | > 0.4: pair similarity (A, B) = |ALIEs ∪ BLIEs | > 0.9: duplicates9 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Book synchronization Definition Structural alignment at section level, based on previously added section delimiting marks. Motivation Some aligners cannot handle large documents Section delimiters can act as anchor points Unpaired sections can be discarded Implementation match similar section delimiters synchronization points10 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Output pair of files with synchronization marks pair of files divided into smaller pairs of chunks text report synchronization matrix11 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Output pair of files with synchronization marks pair of files divided into smaller pairs of chunks text report synchronization matrix11 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation Motivation compare alignments of the same documents (performed by different tools, with different options, . . . ) determine if an alignment was successful12 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation Motivation compare alignments of the same documents (performed by different tools, with different options, . . . ) determine if an alignment was successful Comparing alignments parse TMX files and output the total number correspondences of each type 0:1/1:0, 1:1, 2:1/1:2 and 2:2 evaluate the other tools developed compare the performance of the available alignment tools12 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation Determine if an alignment was successful Summarize a TMX by sampling. Sampling can be performed based on: number of samples desired explicit sampling points translation units which match a given regular expression Output is a (much?) smaller TMX file13 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation The Name of the Rose, Umberto Eco14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Distribution All the tools implemented as Perl modules: Text::Perfide::BookCleaner Text::Perfide::BookPairs Text::Perfide::BookSync Text::Perfide::TMX::Utils publicly available on CPAN including tests and documentation additional effort required to make code installable and usable by other people15 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Corpora-flow Motivation building a corpus is a complex task linear pipeline is not powerful enough Workflow Makefiles states file-oriented actions timestamps and conditions dependencies context fail-fast and resumable execution parallelization16 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Corpora-flow workflow + Makefiles = corpora-flow DSL (→ Slay::Makefile) workflow: rule* rule: pre-condition* action post-condition* action: targets dependencies function condition: filename function target: pattern* dependencies: pattern* function: Perl code17 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Future work Document cleaners other types of documents (e.g. scientific articles) algorithm for finding section delimiters with notion of hierarchy create ebooks/bilingual books Duplicates and pair detection list of correspondences (e.g. Adson → Адсо, London → Londres) calculate best threshold values in real time19 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Future work Document synchronization interactive mode improvements on synchronization matrix and metrics hierarchical sections other section alignment algorithms Corpora-flow finish specification and implementation implement a corpora-flow for Project Per-fide20 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
    • Contributions for building a Corpora-Flow system Andr´ Santos e andrefs@cpan.org Informatics Engineering MSc University of Minho December 2011