Processing XML: a rewriting system approach

Alberto Simões
Alberto SimõesTeacher, programmer at University of Minho
Processing XML
A rewriting system approach

           Alberto Simões
 alberto.simoes@eu.ipp.pt


  Portuguese Perl Workshop – 2010




       Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI
Rewrite this. . .                       . . . into this!
*Cachimbo*,                             <entry id="cachimbo">
_m._                                    <form><orth>Cachimbo</orth></form>
Apparelho de fumador, composto d..      <sense>
Peça de ferro, em que entra o es..      <gramGrp>m.</gramGrp>
Buraco, em que se encaixa a vela..      <def>
* _Bras. de Pernambuco._                Apparelho de fumador, composto d..
Bebida, preparada com aguardente..      Peça de ferro, em que entra o es..
* _Pl. Gír._                            Buraco, em que se encaixa a vela..
Pés.                                    </def>
(Do químb. _quixima_)                   </sense>
                                        <sense ast="1">
                                        <usg type="geo">Bras. de Pernamb..
                                        <def>
                                        Bebida, preparada com aguardente..
                                        </def>
                                        </sense>
                                        <sense ast="1"><gramGrp>Pl.</gra..
                                        <usg type="style">Gír.</usg>
                                        <def>
                                        Pés.
                                        </def>
                                        </sense>
                                        <etym ori="químb">(Do químb. _qu..
                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML

 As a simple example, we can remove duplicate translation units
 in a translation memory file:
 Code example
 RULES/m duplicates
 ([[:XML(tu):]])==>!!duplicate($1)
 ENDRULES

 sub duplicate {
   my $tu = shift;
   my $tumd5 = md5(dtstring($tu,
                            -default => sub{$c}));
   return 1 if exists $visited{$tumd5};
   $visited{$tumd5}++
   return 0;
 }


                      Alberto Simões   Processing XML: a rewriting system approach
Conclusions


    The rewriting approach is:
        flexible;
        powerful;
        easy to learn;
        grows quickly;
        big systems can be difficult to maintain;
    The Perl regular engine:
        makes it easy to match anything;
        almost supports full grammars;
        makes it possible to define block structures;

    So, it can be applied to XML easily!




                     Alberto Simões   Processing XML: a rewriting system approach
Thank you




               Thank You!



              Alberto Simões
        alberto.simoes@eu.ipp.pt




              Alberto Simões   Processing XML: a rewriting system approach
1 of 47

Recommended

Introduction to Scala by
Introduction to ScalaIntroduction to Scala
Introduction to ScalaSynesso
654 views17 slides
Lexical1 by
Lexical1Lexical1
Lexical1ASHOK KUMAR REDDY
351 views7 slides
Swift by
SwiftSwift
SwiftFutada Takashi
373 views12 slides
Pseudo code by
Pseudo codePseudo code
Pseudo codeArindam Ghosh
8.9K views17 slides
Control structure by
Control structureControl structure
Control structurebaran19901990
1.3K views51 slides
Compiler design syntax analysis by
Compiler design syntax analysisCompiler design syntax analysis
Compiler design syntax analysisRicha Sharma
8K views24 slides

More Related Content

Similar to Processing XML: a rewriting system approach

Regular expressions by
Regular expressionsRegular expressions
Regular expressionsRaghu nath
505 views24 slides
Parsing by
ParsingParsing
ParsingDayananda Sagar University
128 views24 slides
xml2tex at TUG 2014 by
xml2tex at TUG 2014xml2tex at TUG 2014
xml2tex at TUG 2014Keiichiro Shikano
1.6K views36 slides
Plc part 2 by
Plc  part 2Plc  part 2
Plc part 2Taymoor Nazmy
15 views122 slides
Introduction to Boost regex by
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regexYongqiang Li
1.4K views44 slides
Finaal application on regular expression by
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expressionGagan019
2.7K views21 slides

Similar to Processing XML: a rewriting system approach(20)

Regular expressions by Raghu nath
Regular expressionsRegular expressions
Regular expressions
Raghu nath505 views
Introduction to Boost regex by Yongqiang Li
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
Yongqiang Li1.4K views
Finaal application on regular expression by Gagan019
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
Gagan0192.7K views
Designing A Syntax Based Retrieval System03 by Avelin Huo
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
Avelin Huo282 views
role of lexical anaysis by Sudhaa Ravi
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
Sudhaa Ravi66 views
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi... by Novell
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Novell1.6K views
BayFP: Concurrent and Multicore Haskell by Bryan O'Sullivan
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
Bryan O'Sullivan2.3K views
How does intellisense work? by Adam Friedman
How does intellisense work?How does intellisense work?
How does intellisense work?
Adam Friedman144 views
What is the deal with Elixir? by George Coffey
What is the deal with Elixir?What is the deal with Elixir?
What is the deal with Elixir?
George Coffey141 views
09 string processing_with_regex copy by Shay Cohen
09 string processing_with_regex copy09 string processing_with_regex copy
09 string processing_with_regex copy
Shay Cohen463 views
COMPILER CONSTRUCTION KU 1.pptx by Rossy719186
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptx
Rossy7191864 views
Experiments with Different Models of Statistcial Machine Translation by khyati gupta
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
khyati gupta511 views
Experiments with Different Models of Statistcial Machine Translation by khyati gupta
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
khyati gupta608 views

More from Alberto Simões

Source Code Quality by
Source Code QualitySource Code Quality
Source Code QualityAlberto Simões
1.6K views61 slides
Language Identification: A neural network approach by
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approachAlberto Simões
2.2K views40 slides
Google Maps JS API by
Google Maps JS APIGoogle Maps JS API
Google Maps JS APIAlberto Simões
2.5K views55 slides
Making the most of a 100-year-old dictionary by
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryAlberto Simões
1.5K views42 slides
Dictionary Alignment by Rewrite-based Entry Translation by
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationAlberto Simões
903 views23 slides
EMLex-A5: Specialized Dictionaries by
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesAlberto Simões
2.9K views150 slides

More from Alberto Simões(20)

Language Identification: A neural network approach by Alberto Simões
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approach
Alberto Simões2.2K views
Making the most of a 100-year-old dictionary by Alberto Simões
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
Alberto Simões1.5K views
Dictionary Alignment by Rewrite-based Entry Translation by Alberto Simões
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry Translation
Alberto Simões903 views
EMLex-A5: Specialized Dictionaries by Alberto Simões
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
Alberto Simões2.9K views
Aula 04 - Introdução aos Diagramas de Sequência by Alberto Simões
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
Alberto Simões1.2K views
Aula 03 - Introdução aos Diagramas de Atividade by Alberto Simões
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de Atividade
Alberto Simões3.8K views
Aula 02 - Engenharia de Requisitos by Alberto Simões
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
Alberto Simões1.6K views
Aula 01 - Planeamento de Sistemas de Informação by Alberto Simões
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
Alberto Simões4.4K views
Building C and C++ libraries with Perl by Alberto Simões
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
Alberto Simões1.6K views
Arquitecturas de Tradução Automática by Alberto Simões
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
Alberto Simões751 views
Extracção de Recursos para Tradução Automática by Alberto Simões
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
Alberto Simões677 views

Recently uploaded

Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
32 views45 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
418 views92 slides
The Forbidden VPN Secrets.pdf by
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdfMariam Shaba
20 views72 slides
Uni Systems for Power Platform.pptx by
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptxUni Systems S.M.S.A.
58 views21 slides
PRODUCT LISTING.pptx by
PRODUCT LISTING.pptxPRODUCT LISTING.pptx
PRODUCT LISTING.pptxangelicacueva6
18 views1 slide
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
126 views32 slides

Recently uploaded(20)

The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson126 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty22 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc72 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf

Processing XML: a rewriting system approach

  • 1. Processing XML A rewriting system approach Alberto Simões alberto.simoes@eu.ipp.pt Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach
  • 2. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 3. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 4. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 5. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 6. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 7. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 8. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 9. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 10. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 11. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 12. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 13. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 14. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 15. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 16. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 17. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 18. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 19. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 20. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 21. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 22. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 23. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 24. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 25. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 26. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 27. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 28. Rewriting Text into TEI Rewrite this. . . . . . into this! *Cachimbo*, <entry id="cachimbo"> _m._ <form><orth>Cachimbo</orth></form> Apparelho de fumador, composto d.. <sense> Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp> Buraco, em que se encaixa a vela.. <def> * _Bras. de Pernambuco._ Apparelho de fumador, composto d.. Bebida, preparada com aguardente.. Peça de ferro, em que entra o es.. * _Pl. Gír._ Buraco, em que se encaixa a vela.. Pés. </def> (Do químb. _quixima_) </sense> <sense ast="1"> <usg type="geo">Bras. de Pernamb.. <def> Bebida, preparada com aguardente.. </def> </sense> <sense ast="1"><gramGrp>Pl.</gra.. <usg type="style">Gír.</usg> <def> Pés. </def> </sense> <etym ori="químb">(Do químb. _qu.. Alberto Simões Processing XML: a rewriting system approach
  • 29. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 30. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 31. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 32. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 33. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 34. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 35. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 36. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 37. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 38. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 39. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 40. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 41. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 42. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 43. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 44. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 45. Rewriting XML As a simple example, we can remove duplicate translation units in a translation memory file: Code example RULES/m duplicates ([[:XML(tu):]])==>!!duplicate($1) ENDRULES sub duplicate { my $tu = shift; my $tumd5 = md5(dtstring($tu, -default => sub{$c})); return 1 if exists $visited{$tumd5}; $visited{$tumd5}++ return 0; } Alberto Simões Processing XML: a rewriting system approach
  • 46. Conclusions The rewriting approach is: flexible; powerful; easy to learn; grows quickly; big systems can be difficult to maintain; The Perl regular engine: makes it easy to match anything; almost supports full grammars; makes it possible to define block structures; So, it can be applied to XML easily! Alberto Simões Processing XML: a rewriting system approach
  • 47. Thank you Thank You! Alberto Simões alberto.simoes@eu.ipp.pt Alberto Simões Processing XML: a rewriting system approach