SlideShare a Scribd company logo
Contributions for building a
  Corpora-Flow system

         Andr´ Santos
             e
      andrefs@cpan.org


       Informatics Engineering MSc
            University of Minho




          December 2011
Concepts
    Aligned parallel corpus: Set of parallel texts in
             which correspondences have been marked
             between blocks (paragraphs, sentences,
             words, . . . ) from each text.
    Corpora-flow: Adaptation of the concept of
             workflow to the several tasks, decisions
             and sequences of steps involved in the
             process of building a corpus.




1           Andr´ Santos, andrefs@cpan.org
                e                            Contributions for building a Corpora-Flow system
Concepts
    Aligned parallel corpus: Set of parallel texts in
             which correspondences have been marked
             between blocks (paragraphs, sentences,
             words, . . . ) from each text.
    Corpora-flow: Adaptation of the concept of
             workflow to the several tasks, decisions
             and sequences of steps involved in the
             process of building a corpus.

    This presentation and the underlying master thesis
    describe the implementation of several tools to be
    used in typical corpus building activities.
1            Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Context


    The work developed in the context of this master
    thesis was motivated and supported by
    Project Per-fide, an undergoing project in
    University of Minho which aims to build large
    parallel corpora between Portuguese and other six
    languages.




2            Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Corpora building challenges


     file format and format conversion
     finding duplicated files
     text encoding format
     structural residues
     section delimiters
     unpaired sections (parallel corpora)
     ...



3        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Corpora building challenges


     Severe problems which often lead to bad results
     Many (most?) of them are hard/impossible to
     solve completely
     Find the problem and report it when it is not
     solvable automatically
     Provide intelligent ways of describing what was
     found and done



4        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
5 key issues


      Book cleaning
      Duplicates and candidate pairs detection
      Book synchronization
      Alignment evaluation
      Corpora-flow system



5        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Book processing problems – Motivation
    (...) d <92>’ entr´e, donnant acc`s dans la salle commune.
                      e              e
    Une l´g`re v´randa, qui en prot´-
         e e    e                   e
    M
                          <96>- 86 <96>-
     ^L geait la partie ant´rieure contre l <92>’ action
                             e
    des rayons solaires, reposait sur de sveltes bambous. (...)
                                           La Jangada, Jules Verne




6             Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Book processing problems – Motivation
    (...) d <92>’ entr´e, donnant acc`s dans la salle commune.
                      e              e
    Une l´g`re v´randa, qui en prot´-
         e e    e                   e
    M
                           <96>- 86 <96>-
     ^L geait la partie ant´rieure contre l <92>’ action
                             e
    des rayons solaires, reposait sur de sveltes bambous. (...)
                                           La Jangada, Jules Verne


                  <92>’ : right single quot. mark (CP1252)
                  <96>- : en dash (CP1252)
                      ^L : page break (0xC)

        prot´-(...)geait : transpagination
            e


6              Andr´ Santos, andrefs@cpan.org
                   e                            Contributions for building a Corpora-Flow system
Book processing problems – Motivation
    (...) d <92>’ entr´e, donnant acc`s dans la salle commune.
                      e              e
    Une l´g`re v´randa, qui en prot´-
         e e    e                   e
    M
                          <96>- 86 <96>-
     ^L geait la partie ant´rieure contre l <92>’ action
                             e
    des rayons solaires, reposait sur de sveltes bambous. (...)
                                           La Jangada, Jules Verne



    (...) d ’ entr´e, donnant acc`s dans la salle commune.
                   e             e
    Une l´g`re v´randa, qui en prot´geait _pb1_
         e e     e                  e
    la partie ant´rieure contre l ’ action
                 e
    des rayons solaires, reposait sur de sveltes bambous. (...)


6             Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Book cleaning
    Subdivided in several steps:




7            Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Sections ontology
                                                                chap
                                                                PT cap´tulo,
                                                                      ı
    contains common section types                                  cap, capitulo
                                                                FR chapitre, chap
    used to automatically generate                              EN chapter, chap
    the code to recognize section                               NT sec

    delimiters                                                  end
                                                                PT fim
    allows discussion/cooperation                               FR fin
                                                                EN the_end
    with people with no                                         BT _alone
    programming knowledge                                       scene
    code becomes more simple and                                PT cena
                                                                FR sc`ne
                                                                     e
    clean                                                       EN scene
                                                                RU глава
                                                                BT act



8        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Duplicates and pairs detection
    Motivation
        Duplicates can result in a biased corpus
        Finding candidate pairs for alignment

    Language independent elements (LIEs)
        terms which are usually kept untranslated
               year references – “1973”
               proper names – “Hamlet”

    Measuring similarity                                     Thresholds
                                                                        < 0.2: unrelated
                        |ALIEs ∩ BLIEs |                                > 0.4: pair
    similarity (A, B) =
                        |ALIEs ∪ BLIEs |                                > 0.9: duplicates

9              Andr´ Santos, andrefs@cpan.org
                   e                            Contributions for building a Corpora-Flow system
Book synchronization
     Definition
     Structural alignment at section level, based on
     previously added section delimiting marks.

     Motivation
         Some aligners cannot handle large documents
         Section delimiters can act as anchor points
         Unpaired sections can be discarded

     Implementation
         match similar section delimiters
         synchronization points

10            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Output


     pair of files with
     synchronization
     marks
     pair of files divided
     into smaller pairs
     of chunks
     text report
     synchronization
     matrix


11            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Output


     pair of files with
     synchronization
     marks
     pair of files divided
     into smaller pairs
     of chunks
     text report
     synchronization
     matrix


11            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Alignment evaluation
     Motivation
         compare alignments of the same documents
         (performed by different tools, with different options, . . . )
         determine if an alignment was successful




12            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Alignment evaluation
     Motivation
         compare alignments of the same documents
         (performed by different tools, with different options, . . . )
         determine if an alignment was successful

     Comparing alignments
        parse TMX files and output the total number
        correspondences of each type
         0:1/1:0, 1:1, 2:1/1:2 and 2:2
         evaluate the other tools developed
         compare the performance of the available
         alignment tools
12            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Alignment evaluation


     Determine if an alignment was successful
         Summarize a TMX by sampling. Sampling can
         be performed based on:
             number of samples desired
             explicit sampling points
             translation units which match a given regular
             expression
         Output is a (much?) smaller TMX file



13           Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Alignment evaluation




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Alignment evaluation
                     AdsonDE = АдсоRU




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Alignment evaluation
                     AdsonDE = АдсоRU




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Alignment evaluation
                     AdsonDE = АдсоRU




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Distribution

      All the tools implemented as Perl modules:
          Text::Perfide::BookCleaner
          Text::Perfide::BookPairs
          Text::Perfide::BookSync
          Text::Perfide::TMX::Utils
      publicly available on CPAN
      including tests and documentation
      additional effort required to make code
      installable and usable by other people


15        Andr´ Santos, andrefs@cpan.org
              e                            Contributions for building a Corpora-Flow system
Corpora-flow
     Motivation
         building a corpus is a complex task
         linear pipeline is not powerful enough


     Workflow                     Makefiles
         states                     file-oriented
         actions                    timestamps and
         conditions                 dependencies
         context                    fail-fast and resumable
                                    execution
                                    parallelization
16            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Corpora-flow

            workflow + Makefiles = corpora-flow

     DSL (→ Slay::Makefile)
     workflow:         rule*
     rule:             pre-condition* action post-condition*
     action:           targets dependencies function
     condition:        filename function
     target:           pattern*
     dependencies:     pattern*
     function:         Perl code


17           Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Future work
     Document cleaners
         other types of documents (e.g. scientific
         articles)
         algorithm for finding section delimiters with
         notion of hierarchy
         create ebooks/bilingual books

     Duplicates and pair detection
         list of correspondences (e.g. Adson → Адсо,
         London → Londres)
         calculate best threshold values in real time
19            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Future work

     Document synchronization
         interactive mode
         improvements on synchronization matrix and
         metrics
         hierarchical sections
         other section alignment algorithms

     Corpora-flow
         finish specification and implementation
         implement a corpora-flow for Project Per-fide

20           Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Contributions for building a
  Corpora-Flow system

         Andr´ Santos
             e
      andrefs@cpan.org


       Informatics Engineering MSc
            University of Minho




          December 2011

More Related Content

Viewers also liked

Dibujo tecnico i
Dibujo tecnico iDibujo tecnico i
Dibujo tecnico i
Diego030809
 
Colchon flotable a luz solar
Colchon flotable a luz solarColchon flotable a luz solar
Colchon flotable a luz solarLili Krrillo
 
Set a featured image of a page in WordPress
Set a featured image of a page in WordPressSet a featured image of a page in WordPress
Set a featured image of a page in WordPress
Jeremy Dawes
 
Texto base
Texto baseTexto base

Viewers also liked (8)

Dibujo tecnico i
Dibujo tecnico iDibujo tecnico i
Dibujo tecnico i
 
Universal Design August Workshop
Universal Design August Workshop Universal Design August Workshop
Universal Design August Workshop
 
Sassycacuss
SassycacussSassycacuss
Sassycacuss
 
Pp infoo
Pp infooPp infoo
Pp infoo
 
Colchon flotable a luz solar
Colchon flotable a luz solarColchon flotable a luz solar
Colchon flotable a luz solar
 
Set a featured image of a page in WordPress
Set a featured image of a page in WordPressSet a featured image of a page in WordPress
Set a featured image of a page in WordPress
 
Business Lending
Business LendingBusiness Lending
Business Lending
 
Texto base
Texto baseTexto base
Texto base
 

Similar to Slides

A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
andrefsantos
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
Roberto Pereira Silveira
 
The Bash Love
The Bash LoveThe Bash Love
The Bash Love
ishwon
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
Bayes Nets meetup London
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science
 
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Facultad de Informática UCM
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
Tae-Gil Noh
 
"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017
Neeran Karnik
 
Introduction to post_gis
Introduction to post_gisIntroduction to post_gis
Introduction to post_gis
AboutHydrology Slides
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databaseslovingprince58
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linuxNorberto Angulo
 

Similar to Slides (11)

A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
The Bash Love
The Bash LoveThe Bash Love
The Bash Love
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
 
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
 
"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017
 
Introduction to post_gis
Introduction to post_gisIntroduction to post_gis
Introduction to post_gis
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
 

More from andrefsantos

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
andrefsantos
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
andrefsantos
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
andrefsantos
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
andrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
andrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
andrefsantos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
andrefsantos
 
Bigorna
BigornaBigorna
Bigorna
andrefsantos
 

More from andrefsantos (9)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
 
Bigorna
BigornaBigorna
Bigorna
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Slides

  • 1. Contributions for building a Corpora-Flow system Andr´ Santos e andrefs@cpan.org Informatics Engineering MSc University of Minho December 2011
  • 2. Concepts Aligned parallel corpus: Set of parallel texts in which correspondences have been marked between blocks (paragraphs, sentences, words, . . . ) from each text. Corpora-flow: Adaptation of the concept of workflow to the several tasks, decisions and sequences of steps involved in the process of building a corpus. 1 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 3. Concepts Aligned parallel corpus: Set of parallel texts in which correspondences have been marked between blocks (paragraphs, sentences, words, . . . ) from each text. Corpora-flow: Adaptation of the concept of workflow to the several tasks, decisions and sequences of steps involved in the process of building a corpus. This presentation and the underlying master thesis describe the implementation of several tools to be used in typical corpus building activities. 1 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 4. Context The work developed in the context of this master thesis was motivated and supported by Project Per-fide, an undergoing project in University of Minho which aims to build large parallel corpora between Portuguese and other six languages. 2 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 5. Corpora building challenges file format and format conversion finding duplicated files text encoding format structural residues section delimiters unpaired sections (parallel corpora) ... 3 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 6. Corpora building challenges Severe problems which often lead to bad results Many (most?) of them are hard/impossible to solve completely Find the problem and report it when it is not solvable automatically Provide intelligent ways of describing what was found and done 4 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 7. 5 key issues Book cleaning Duplicates and candidate pairs detection Book synchronization Alignment evaluation Corpora-flow system 5 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 8. Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne 6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 9. Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne <92>’ : right single quot. mark (CP1252) <96>- : en dash (CP1252) ^L : page break (0xC) prot´-(...)geait : transpagination e 6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 10. Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne (...) d ’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´geait _pb1_ e e e e la partie ant´rieure contre l ’ action e des rayons solaires, reposait sur de sveltes bambous. (...) 6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 11. Book cleaning Subdivided in several steps: 7 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 12. Sections ontology chap PT cap´tulo, ı contains common section types cap, capitulo FR chapitre, chap used to automatically generate EN chapter, chap the code to recognize section NT sec delimiters end PT fim allows discussion/cooperation FR fin EN the_end with people with no BT _alone programming knowledge scene code becomes more simple and PT cena FR sc`ne e clean EN scene RU глава BT act 8 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 13. Duplicates and pairs detection Motivation Duplicates can result in a biased corpus Finding candidate pairs for alignment Language independent elements (LIEs) terms which are usually kept untranslated year references – “1973” proper names – “Hamlet” Measuring similarity Thresholds < 0.2: unrelated |ALIEs ∩ BLIEs | > 0.4: pair similarity (A, B) = |ALIEs ∪ BLIEs | > 0.9: duplicates 9 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 14. Book synchronization Definition Structural alignment at section level, based on previously added section delimiting marks. Motivation Some aligners cannot handle large documents Section delimiters can act as anchor points Unpaired sections can be discarded Implementation match similar section delimiters synchronization points 10 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 15. Output pair of files with synchronization marks pair of files divided into smaller pairs of chunks text report synchronization matrix 11 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 16. Output pair of files with synchronization marks pair of files divided into smaller pairs of chunks text report synchronization matrix 11 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 17. Alignment evaluation Motivation compare alignments of the same documents (performed by different tools, with different options, . . . ) determine if an alignment was successful 12 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 18. Alignment evaluation Motivation compare alignments of the same documents (performed by different tools, with different options, . . . ) determine if an alignment was successful Comparing alignments parse TMX files and output the total number correspondences of each type 0:1/1:0, 1:1, 2:1/1:2 and 2:2 evaluate the other tools developed compare the performance of the available alignment tools 12 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 19. Alignment evaluation Determine if an alignment was successful Summarize a TMX by sampling. Sampling can be performed based on: number of samples desired explicit sampling points translation units which match a given regular expression Output is a (much?) smaller TMX file 13 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 20. Alignment evaluation The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 21. Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 22. Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 23. Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 24. Distribution All the tools implemented as Perl modules: Text::Perfide::BookCleaner Text::Perfide::BookPairs Text::Perfide::BookSync Text::Perfide::TMX::Utils publicly available on CPAN including tests and documentation additional effort required to make code installable and usable by other people 15 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 25. Corpora-flow Motivation building a corpus is a complex task linear pipeline is not powerful enough Workflow Makefiles states file-oriented actions timestamps and conditions dependencies context fail-fast and resumable execution parallelization 16 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 26. Corpora-flow workflow + Makefiles = corpora-flow DSL (→ Slay::Makefile) workflow: rule* rule: pre-condition* action post-condition* action: targets dependencies function condition: filename function target: pattern* dependencies: pattern* function: Perl code 17 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 27. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 28. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 29. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 30. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 31. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 32. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 33. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 34. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 35. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 36. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 37. Future work Document cleaners other types of documents (e.g. scientific articles) algorithm for finding section delimiters with notion of hierarchy create ebooks/bilingual books Duplicates and pair detection list of correspondences (e.g. Adson → Адсо, London → Londres) calculate best threshold values in real time 19 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 38. Future work Document synchronization interactive mode improvements on synchronization matrix and metrics hierarchical sections other section alignment algorithms Corpora-flow finish specification and implementation implement a corpora-flow for Project Per-fide 20 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 39. Contributions for building a Corpora-Flow system Andr´ Santos e andrefs@cpan.org Informatics Engineering MSc University of Minho December 2011