Your SlideShare is downloading. ×
0
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Processing XML: a rewriting system approach
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Processing XML: a rewriting system approach

960

Published on

Yet another method to parse XML: rewrite it!

Yet another method to parse XML: rewrite it!

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
960
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Processing XML A rewriting system approach Alberto Simões alberto.simoes@eu.ipp.pt Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach
  • 2. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 3. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 4. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 5. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 6. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 7. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 8. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 9. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 10. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 11. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 12. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 13. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 14. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 15. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 16. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 17. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 18. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 19. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 20. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 21. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 22. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 23. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 24. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 25. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 26. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 27. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 28. Rewriting Text into TEI Rewrite this. . . . . . into this! *Cachimbo*, <entry id="cachimbo"> _m._ <form><orth>Cachimbo</orth></form> Apparelho de fumador, composto d.. <sense> Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp> Buraco, em que se encaixa a vela.. <def> * _Bras. de Pernambuco._ Apparelho de fumador, composto d.. Bebida, preparada com aguardente.. Peça de ferro, em que entra o es.. * _Pl. Gír._ Buraco, em que se encaixa a vela.. Pés. </def> (Do químb. _quixima_) </sense> <sense ast="1"> <usg type="geo">Bras. de Pernamb.. <def> Bebida, preparada com aguardente.. </def> </sense> <sense ast="1"><gramGrp>Pl.</gra.. <usg type="style">Gír.</usg> <def> Pés. </def> </sense> <etym ori="químb">(Do químb. _qu.. Alberto Simões Processing XML: a rewriting system approach
  • 29. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 30. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 31. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 32. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 33. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 34. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 35. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 36. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 37. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 38. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 39. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 40. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 41. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 42. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 43. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 44. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 45. Rewriting XML As a simple example, we can remove duplicate translation units in a translation memory file: Code example RULES/m duplicates ([[:XML(tu):]])==>!!duplicate($1) ENDRULES sub duplicate { my $tu = shift; my $tumd5 = md5(dtstring($tu, -default => sub{$c})); return 1 if exists $visited{$tumd5}; $visited{$tumd5}++ return 0; } Alberto Simões Processing XML: a rewriting system approach
  • 46. Conclusions The rewriting approach is: flexible; powerful; easy to learn; grows quickly; big systems can be difficult to maintain; The Perl regular engine: makes it easy to match anything; almost supports full grammars; makes it possible to define block structures; So, it can be applied to XML easily! Alberto Simões Processing XML: a rewriting system approach
  • 47. Thank you Thank You! Alberto Simões alberto.simoes@eu.ipp.pt Alberto Simões Processing XML: a rewriting system approach

×