Processing XML: a rewriting system approach

1,218 views

Published on

Yet another method to parse XML: rewrite it!

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,218
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Processing XML: a rewriting system approach

  1. 1. Processing XML A rewriting system approach Alberto Simões alberto.simoes@eu.ipp.pt Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach
  2. 2. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  3. 3. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  4. 4. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  5. 5. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  6. 6. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  7. 7. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  8. 8. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  9. 9. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  10. 10. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  11. 11. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  12. 12. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  13. 13. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  14. 14. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  15. 15. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  16. 16. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  17. 17. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  18. 18. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  19. 19. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  20. 20. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  21. 21. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  22. 22. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  23. 23. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  24. 24. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  25. 25. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  26. 26. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  27. 27. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  28. 28. Rewriting Text into TEI Rewrite this. . . . . . into this! *Cachimbo*, <entry id="cachimbo"> _m._ <form><orth>Cachimbo</orth></form> Apparelho de fumador, composto d.. <sense> Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp> Buraco, em que se encaixa a vela.. <def> * _Bras. de Pernambuco._ Apparelho de fumador, composto d.. Bebida, preparada com aguardente.. Peça de ferro, em que entra o es.. * _Pl. Gír._ Buraco, em que se encaixa a vela.. Pés. </def> (Do químb. _quixima_) </sense> <sense ast="1"> <usg type="geo">Bras. de Pernamb.. <def> Bebida, preparada com aguardente.. </def> </sense> <sense ast="1"><gramGrp>Pl.</gra.. <usg type="style">Gír.</usg> <def> Pés. </def> </sense> <etym ori="químb">(Do químb. _qu.. Alberto Simões Processing XML: a rewriting system approach
  29. 29. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  30. 30. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  31. 31. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  32. 32. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  33. 33. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  34. 34. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  35. 35. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  36. 36. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  37. 37. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  38. 38. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  39. 39. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  40. 40. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  41. 41. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  42. 42. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  43. 43. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  44. 44. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  45. 45. Rewriting XML As a simple example, we can remove duplicate translation units in a translation memory file: Code example RULES/m duplicates ([[:XML(tu):]])==>!!duplicate($1) ENDRULES sub duplicate { my $tu = shift; my $tumd5 = md5(dtstring($tu, -default => sub{$c})); return 1 if exists $visited{$tumd5}; $visited{$tumd5}++ return 0; } Alberto Simões Processing XML: a rewriting system approach
  46. 46. Conclusions The rewriting approach is: flexible; powerful; easy to learn; grows quickly; big systems can be difficult to maintain; The Perl regular engine: makes it easy to match anything; almost supports full grammars; makes it possible to define block structures; So, it can be applied to XML easily! Alberto Simões Processing XML: a rewriting system approach
  47. 47. Thank you Thank You! Alberto Simões alberto.simoes@eu.ipp.pt Alberto Simões Processing XML: a rewriting system approach

×