Perly Parsing with Regexp::Grammars
 

Perly Parsing with Regexp::Grammars

on

  • 759 views

A short description of Perly grammar processors leading up to Regexp::Grammars. Develops two R::G modules, one for single-line logfile entries, another for larger FASTA format entries in the NCBI ...

A short description of Perly grammar processors leading up to Regexp::Grammars. Develops two R::G modules, one for single-line logfile entries, another for larger FASTA format entries in the NCBI "nr.gz" file. The second example shows how to derive one grammar from another by overriding tags in the base grammar.

Statistics

Views

Total Views
759
Views on SlideShare
759
Embed Views
0

Actions

Likes
2
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Perly Parsing with Regexp::Grammars Perly Parsing with Regexp::Grammars Presentation Transcript

  • Perly Parsers:Perl-byaccParse::YappParse::RecDescentRegex::GrammarSteven LembarkWorkhorse Computinglembark@wrkhors.com
  • Grammars are the guts of compilers● Compilers convert text from one form to another.– C compilers convert C source to CPU-specific assembly.– Databases compile SQL into RDBMS ops.● Grammars define structure, precedence, valid inputs.– Realistic ones are often recursive or context-sensitive.– The complexity in defining grammars led to a variety of tools for definingthem.– The standard format for a long time has been “BNF”, which is the input toYACC.● They are wasted on for flat text.– If “split /t/” does the job skip grammars entirely.
  • The first Yet Another: YACC● Yet Another Compiler Compiler– YACC takes in a standard-format grammar structure.– It processes tokens and their values, organizing theresults according to the grammar into a structure.● Between the source and YACC is a tokenizer.– This parses the inputs into individual tokens defined bythe grammar.– It doesnt know about structure, only breaking the textstream up into tokens. View slide
  • Parsing is a pain in the lex● The real pain is gluing the parser and tokenizertogether.– Tokenizers deal in the language of patterns.– Grammars are defined in terms of structure.● Passing data between them makes for most of thedifficulty.– One issue is the global yylex call, which makes havingmultiple parsers difficult.– Context-sensitive grammars with multiple sub-grammars are painful. View slide
  • The perly way● Regexen, logic, glue... hmm... been there before.– The first approach most of us try is lexing with regexen.– Then add captures and if-blocks or excute (?{code})blocks inside of each regex.● The problem is that the grammar is embedded inyour code structure.– You have to modify the code structure to change thegrammar or its tokens.– Hubris, maybe, but Truly Lazy it aint.– Was the whole reason for developing standardgrammars & their handlers in the first place.
  • Early Perl Grammar Modules● These take in a YACC grammar and spit outcompiler code.● Intentionally looked like YACC:– Able to re-cycle existing YACC grammar files.– Benefit from using Perl as a built-in lexer.– Perl-byacc & Parse::Yapp.● Good: Recycles knowledge for YACC users.● Bad: Still not lazy: The grammars are difficult tomaintain and you still have to plug in post-processing code to deal with the results.
  • %right =%left - +%left * /%left NEG%right ^%%input: #empty| input line { push(@{$_[1]},$_[2]); $_[1] };line: n { $_[1] }| exp n { print "$_[1]n" }| error n { $_[0]->YYErrok };exp: NUM| VAR { $_[0]->YYData->{VARS}{$_[1]} }| VAR = exp { $_[0]->YYData->{VARS}{$_[1]}=$_[3] }| exp + exp { $_[1] + $_[3] }| exp - exp { $_[1] - $_[3] }| exp * exp { $_[1] * $_[3] }Example: Parse::Yapp grammar
  • The Swiss Army Chainsaw● Parse::RecDescent extended the original BNFsyntax, combining the tokens & handlers.● Grammars are largely declarative, using OO Perl todo the heavy lifting.– OO interface allows multiple, context sensitive parsers.– Rules with Perl blocks allows the code to do anything.– Results can be acquired from a hash, an array, or $1.– Left, right, associative tags simplify messy situations.
  • Example P::RD● This is partof an infixformulacompiler Iwrote.● It compilesequations toa sequenceof closures.add_op : + | - | % { $item[ 1 ] }mult_op : * | / | ^ { $item[ 1 ] }add : <leftop: mult add_op mult>{compile_binop @{ $item[1] }}mult : <leftop: factor mult_op factor>{compile_binop @{ $item[1] }}
  • Just enough rope to shoot yourself...● The biggest problem: P::RD is sloooooooowsloooooooow.● Learning curve is perl-ish: shallow and long.– Unless you really know what all of it does you may notbe able to figure out the pieces.– Lots of really good docs that most people never read.● Perly blocks also made it look too much like a job-dispatcher.– People used it for a lot of things that are not compilers.– Good & Bad thing: it really is a compiler.
  • R.I.P. P::RD● Supposed to be replaced with Parse::FastDescent.– Damian dropped work on P::FD for Perl6.– His goal was to replace the shortcomings with P::RD withsomething more complete, and quite a bit faster.● The result is Perl6 Grammars.– Declarative syntax extends matching with rules.– Built into Perl6 as a structure, not an add-on.– Much faster.– Not available in Perl5
  • Regex::Grammars● Perl5 implementation derived from Perl6.– Back-porting an idea, not the Perl6 syntax.– Much better performance than P::RD.● Extends the v5.10 recursive matching syntax,leveraging the regex engine.– Most of the speed issues are with regex design, not theparser itself.– Simplifies mixing code and matching.– Single place to get the final results.– Cleaner syntax with automatic whitespace handling.
  • Extending regexen● “use Regexp::Grammar” turns on added syntax.– block-scoped (avoids collisions with existing code).● You will probably want to add “xm” or “xs”– extended syntax avoids whitespace issues.– multi-line mode (m) simplifies line anchors for line-oriented parsing.– single-line mode (s) makes ignoring line-wrapwhitespace largely automatic.– I use “xm” with explicit “n” or “s” matches to spanlines where necessary.
  • What you get● The parser is simply a regex-ref.– You can bless it or have multiple parsers for contextgrammars.● Grammars can reference one another.– Extending grammars via objects or modules isstraightforward.● Comfortable for incremental development orrefactoring.– Largely declarative syntax helps.– OOP provides inheritance with overrides for rules.
  • my $compiler= do{use Regexp::Grammars;qr{<data><rule: data > <[text]>+<rule: text > .+}xm};Example: Creating a compiler● Context can bea do-block,subroutine, orbranch logic.● “data” is theentry rule.● All this does isread lines intoan array withautomatic wshandling.
  • Results: %/● The results of parsing are in a tree-hash named %/.– Keys are the rule names that produced the results.– Empty keys () hold input text (for errors ordebugging).– Easy to handle with Data::Dumper.● The hash has at least one key for the entry rule, oneempty key for input data if context is being saved.● For example, feeding two lines of a Gentoo emergelog through the line grammar gives:
  • { => 1367874132: Started emerge on: May 06, 201321:02:121367874132: *** emerge --jobs --autounmask-write --keep-going --load-average=4.0 --complete-graph --with-bdeps=y--deep talk,data =>{ => 1367874132: Started emerge on: May 06, 201321:02:121367874132: *** emerge --jobs --autounmask-write --keep-going --load-average=4.0 --complete-graph --with-bdeps=y--deep talk,text =>[1367874132: Started emerge on: May 06, 201321:02:12,1367874132: *** emerge --jobs --autounmask-write --keep-going --load-average=4.0 --complete-graph --with-bdeps=y--deep talk]Parsing a few lines of logfile
  • Getting rid of context● The empty-keyed values are useful fordevelopment or explicit error messages.● They also get in the way and can cost a lot ofmemory on large inputs.● You can turn them on and off with <context:> and<nocontext:> in the rules.
  • qr{<nocontext:> # turn off globally<data><rule: data > <text>+ # oops, left off the []!<rule: text > .+}xm;warn | Repeated subrule <text>+ will only capture itsfinal match| (Did you mean <[text]>+ instead?)|{data => {text => 1367874132: *** emerge --jobs --autounmask-write --keep-going --load-average=4.0 --complete-graph --with-bdeps=y--deep talk}}You usually want [] with +
  • {data =>{text => the [text] parses to an array of text[1367874132: Started emerge on: May 06, 2013 21:02:12,1367874132: *** emerge --jobs --autounmask-write –...],...qr{<nocontext:> # turn off globally<data><rule: data > <[text]>+<rule: text > (.+)}xm;An array[ref] of text
  • Breaking up lines● Each log entry is prefixed with an entry id.● Parsing the ref_id off the front adds:<data><rule: data > <[line]>+<rule: line > <ref_id> <[text]><token: ref_id > ^(d+)<rule: text > .+line =>[{ref_id => 1367874132,text => : Started emerge on: May 06, 2013 21:02:12},…]
  • Removing cruft: “ws”● Be nice to remove the leading “: “ from text lines.● In this case the “whitespace” needs to include acolon along with the spaces.● Whitespace is defined by <ws: … ><rule: line> <ws:[s:]+> <ref_id> <text>{ref_id => 1367874132,text => *** emerge --jobs –autounmask-wr...}
  • The *** prefix means something● Be nice to know what type of line was beingprocessed.● <prefix= regex > asigns the regexs capture to the“prefix” tag:<rule: line > <ws:[s:]*> <ref_id> <entry><rule: entry ><prefix=([*][*][*])> <text>|<prefix=([>][>][>])> <text>|<prefix=([=][=][=])> <text>|<prefix=([:][:][:])> <text>|<text>
  • {entry => {text => Started emerge on: May 06, 2013 21:02:12},ref_id => 1367874132},{entry => {prefix => ***,text => emerge --jobs –autounmask-write...},ref_id => 1367874132},{entry => {prefix => >>>,text => emerge (1 of 2) sys-apps/...},ref_id => 1367874256}“entry” now contains optional prefix
  • Aliases can also assign tag results● Aliases assign akey to ruleresults.● The match from“text” is aliasedto a named typeof log entry.<rule: entry><prefix=([*][*][*])> <command=text>|<prefix=([>][>][>])> <stage=text>|<prefix=([=][=][=])> <status=text>|<prefix=([:][:][:])> <final=text>|<message=text>
  • {entry => {message => Started emerge on: May 06, 2013 21:02:12},ref_id => 1367874132},{entry => {command => emerge --jobs --autounmask-write –...prefix => ***},ref_id => 1367874132},{entry => {command => terminating.,prefix => ***},ref_id => 1367874133},Generic “text” replaced with a type:
  • Parsing without capturing● At this point we dont really need the prefix stringssince the entries are labeled.● A leading . tells R::G to parse but not store theresults in %/:<rule: entry ><.prefix=([*][*][*])> <command=text>|<.prefix=([>][>][>])> <stage=text>|<.prefix=([=][=][=])> <status=text>|<.prefix=([:][:][:])> <final=text>|<message=text>
  • {entry => {message => Started emerge on: May 06, 2013 21:02:12},ref_id => 1367874132},{entry => {command => emerge --jobs --autounmask-write -...},ref_id => 1367874132},{entry => {command => terminating.},ref_id => 1367874133},“entry” now has typed keys:
  • The “entry” nesting gets in the way● The named subrule is not hard to get rid of: justmove its syntax up one level:<ws:[s:]*> <ref_id>(<.prefix=([*][*][*])> <command=text>|<.prefix=([>][>][>])> <stage=text>|<.prefix=([=][=][=])> <status=text>|<.prefix=([:][:][:])> <final=text>|<message=text>)
  • data => {line => [{message => Started emerge on: May 06, 2013 21:02:12,ref_id => 1367874132},{command => emerge --jobs --autounmask-write --keep-going --load-average=4.0 --complete-graph --with-bdeps=y --deeptalk,ref_id => 1367874132},{command => terminating.,ref_id => 1367874133},{message => Started emerge on: May 06, 2013 21:02:17,ref_id => 1367874137},Result: array of “line” with ref_id & type
  • Funny names for things● Maybe “command” and “status” arent the best wayto distinguish the text.● You can store an optional token followed by text:<rule: entry > <ws:[s:]*> <ref_id> <type>? <text><token: type>([*][*][*]|[>][>][>]|[=][=][=]|[:][:][:])
  • Entrys now have “text” and “type”entry => [{ref_id => 1367874132,text => Started emerge on: May 06, 2013 21:02:12},{ref_id => 1367874133,text => terminating.,type => ***},{ref_id => 1367874137,text => Started emerge on: May 06, 2013 21:02:17},{ref_id => 1367874137,text => emerge --jobs --autounmask-write –...type => ***},
  • prefix alternations look ugly.● Using a count works:[*]{3} | [>]{3} | [:]{3} | [=]{3}but isnt all that much more readable.● Given the way these are used, use a block:[*>:=] {3}
  • qr{<nocontext:><data><rule: data > <[entry]>+<rule: entry ><ws:[s:]*><ref_id> <prefix>? <text><token: ref_id > ^(d+)<token: prefix > [*>=:]{3}<token: text > .+}xm;This is the skeleton parser:● Doesnt take much:– Declarative syntax.– No Perl code at all!● Easy to modify byextending thedefinition of “text”for specific types ofmessages.
  • Finishing the parser● Given the different line types it will be useful toextract commands, switches, outcomes fromappropriate lines.– Sub-rules can be defined for the different line types.<rule: command> “emerge”<.ws><[switch]>+<token: switch> ([-][-]S+)● This is what makes the grammars useful: nested,context-sensitive content.
  • Inheriting & Extending Grammars● <grammar: name> and <extends: name> allow abuilding-block approach.● Code can assemble the contents of for a qr{} withouthaving to eval or deal with messy quote strings.● This makes modular or context-sensitive grammarsrelatively simple to compose.– References can cross package or module boundaries.– Easy to define a basic grammar in one place and referenceor extend it from multiple other parsers.
  • The Non-Redundant File● NCBIs “nr.gz” file is a list if sequences and all ofthe places they are known to appear.● It is moderately large: 140+GB uncompressed.● The file consists of a simple FASTA format withheading separated by ctrl-A chars:>Heading 1[amino-acid sequence characters...]>Heading 2...
  • Example: A short nr.gz FASTA entry● Headings are grouped by species, separated by ctrl-A(“cA”) characters.– Each species has a set of sources & identifier pairsfollowed by a single description.– Within-species separator is a pipe (“|”) with optionalwhitespace.– Species counts in some header run into the thousands.>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827[Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDIRecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1|calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1|hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ...KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK...VQKLLNPDQ
  • First step: Parse FASTAqr{<grammar: Parse::Fasta><nocontext:><rule: fasta > <.start> <head> <.ws> <[body]>+<rule: head > .+ <.ws><rule: body > ( <[seq]> | <.comment> ) <.ws><token: start > ^ [>]<token: comment > ^ [;] .+<token: seq > ^ [nw-]+}xm;● Instead of defining an entry rule, this just defines aname “Parse::Fasta”.– This cannot be used to generate results by itself.– Accessible anywhere via Rexep::Grammars.
  • The output needs help, however.● The “<seq>” token captures newlines that need to bestripped out to get a single string.● Munging these requires adding code to the parser usingPerls regex code-block syntax: (?{...})– Allows inserting almost-arbitrary code into the regex.– “almost” because the code cannot include regexen.seq =>[ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]
  • Munging results: $MATCH● The $MATCH and %MATCH can be assigned to alterthe results from the current or lower levels of the parse.● In this case I take the “seq” match contents out of %/,join them with nothing, and use “tr” to strip thenewlines.– join + split wont work because split uses a regex.<rule: body > ( <[seq]> | <.comment> ) <.ws>(?{$MATCH = join => @{ delete $MATCH{ seq } };$MATCH =~ tr/n//d;})
  • One more step: Remove the arrayref● Now the body is a single string.● No need for an arrayref to contain one string.● Since the body has one entry, assign offset zero:body =>[MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ],<rule: fasta> <.start> <head> <.ws> <[body]>+(?{$MATCH{ body } = $MATCH{ body }[0];})
  • Result: a generic FASTA parser.{fasta => [{body =>MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ,head => gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyosteliumdiscoideum]gi|60470106|gb|EAL68086.1| hypothetical proteinDDB_G0277827 [Dictyostelium discoideum AX4]}]}● The head and body are easily accessible.● Next: parse the nr-specific header.
  • Deriving a grammar● Existing grammars are “extended”.● The derived grammars are capable of producingresults.● In this case:● References the grammar and extracts a list of fastaentries.<extends: Parse::Fasta><[fasta]>+
  • Splitting the head into identifiers● Overloading fastas “head” rule handles allowssplitting identifiers for individual species.● Catch: cA is separator, not a terminator.– The tail item on the list doest have a cA to anchor on.– Using “.+[cAn] walks off the header onto the sequence.– This is a common problem with separators & tokenizers.– This can be handled with special tokens in the grammar,but R::G provides a cleaner way.
  • First pass: Literal “tail” item● This works but is ugly:– Have two rules for the main list and tail.– Alias the tail to get them all in one place.<rule: head> <[ident]>+ <[ident=final]>(?{# remove the matched anchorstr/cAn//d for @{ $MATCH{ ident } };})<token: ident > .+? cA<token: final > .+ n
  • Breaking up the header● The last header item is aliased to “ident”.● Breaks up all of the entries:head => {ident => [gi|66816243|ref|XP_642131.1| hypothetical proteinDDB_G0277827 [Dictyostelium discoideum AX4],gi|1705556|sp|P54670.1|CAF1_DICDI RecName:Full=Calfumirin-1; Short=CAF-1,gi|793761|dbj|BAA06266.1| calfumirin-1[Dictyostelium discoideum],gi|60470106|gb|EAL68086.1| hypothetical proteinDDB_G0277827 [Dictyostelium discoideum AX4]]}
  • Dealing with separators: % <sep>● Separators happen often enough:– 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces– g-c-a-g-t-t-a-c-a # characters by dashes– /usr/local/bin # basenames by dir markers– /usr:/usr/local:bin # dirs separated by colonsthat R::G has special syntax for dealing with them.● Combining the item with % and a seprator:<rule: list> <[item]>+ % <separator> # one-or-more<rule: list_zom> <[item]>* % <separator> # zero-or-more
  • Cleaner nr.gz header rule● Separator syntax cleans things up:– No more tail rule with an alias.– No code block required to strip the separators and trailingnewline.– Non-greedy match “.+?” avoids capturing separators.qr{<nocontext:><extends: Parse::Fasta><[fasta]>+<rule: head > <[ident]>+ % [cA]<token: ident > .+?}xm
  • Nested “ident” tag is extraneous● Simpler to replace the “head” with a list ofidentifiers.● Replace $MATCH from the “head” rule with thenested identifier contents:qr{<nocontext:><extends: Parse::Fasta><[fasta]>+<rule: head > <[ident]>+ % [cA](?{$MATCH = delete $MATCH{ ident };})<token: ident > .+?}xm
  • Result:{fasta => [{body => MASTQNIVEEVQKMLDT...NPDQ,head => [gi|66816243|ref|XP_6...rt=CAF-1,gi|793761|dbj|BAA0626...oideum],gi|60470106|gb|EAL68086...m discoideum AX4]]}]}● The fasta content is broken into the usual “body” plusa “head” broken down on cA boundaries.● Not bad for a dozen lines of grammar with a fewlines of code:
  • One more level of structure: idents.● Species have <source > | <identifier> pairs followedby a description.● Add a separator clause “ % (?:s*|s*)”– This can be parsed into a hash something like:gi|66816243|ref|XP_642131.1|hypothetical ...Becomes:{gi => 66816243,ref => XP_642131.1,desc => hypothetical...}
  • Munging the separated input<fasta>(?{my $identz = delete $MATCH{ fasta }{ head }{ ident };for( @$identz ){my $pairz = $_->{ taxa };my $desc = pop @$pairz;$_ = { @$pairz, desc => $desc }}$MATCH{ fasta }{ head } = $identz;})<rule: head > <[ident]>+ % [cA]<token: ident > <[taxa]>+ % (?: s* [|] s* )<token: taxa > .+?
  • Result: head with sources, “desc”{fasta => {body => MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN,head => [{desc => 30S ribosomal protein S18 [Lactococ...gi => 15674171,ref => NP_268346.1},{desc => 30S ribosomal protein S18 [Lactoco...gi => 116513137,ref => YP_812044.1},...
  • Balancing R::G with calling code● The regex engine could process all of nr.gz.– Catch: <[fasta]>+ returns about 250_000 keys and literallymillions of total identifiers in the heads.– Better approach: <fasta> on single entries, but chunking inputon > removes it as a leading charactor.– Making it optional with <.start>? fixes the problem:local $/ = >;while( my $chunk = readline ){chomp;length $chunk or do { --$.; next };$chunk =~ $nr_gz;# process single fasta record in %/}
  • Fasta base grammar: 3 lines of codeqr{<grammar: Parse::Fasta><nocontext:><rule: fasta > <.start>? <head> <.ws> <[body]>+(?{$MATCH{ body } = $MATCH{ body }[0];})<rule: head > .+ <.ws><rule: body > ( <[seq]> | <.comment> ) <.ws>(?{$MATCH = join => @{ delete $MATCH{ seq } };$MATCH =~ tr/n//d;})<token: start > ^ [>]<token: comment > ^ [;] .+<token: seq > ^ ( [nw-]+ )}xm;
  • Extension to Fasta: 6 lines of code.qr{<nocontext:><extends: Parse::Fasta><fasta>(?{my $identz = delete $MATCH{ fasta }{ head }{ ident };for( @$identz ){my $pairz = $_->{ taxa };my $desc = pop @$pairz;$_ = { @$pairz, desc => $desc };}$MATCH{ fasta }{ head } = $identz;})<rule: head > <[ident]>+ % [cA]<rule: ident > <[taxa]>+ % (?: s* [|] s* )<token: taxa > .+?}xm
  • Result: Use grammars● Most of the “real” work is done under the hood.– Regexp::Grammars does the lexing, basic compilation.– Code only needed for cleanups or re-arranging structs.● Code can simplify your grammar.– Too much code makes them hard to maintain.– Trick is keeping the balance between simplicity in thegrammar and cleanup in the code.● Either way, the result is going to be moremaintainable than hardwiring the grammar into code.
  • Aside: KwikFix for Perl v5.18● v5.17 changed how the regex engine handles inlinecode.● Code that used to be eval-ed in the regex is nowcompiled up front.– This requires “use re eval” and “no strict vars”.– One for the Perl code, the other for $MATCH and friends.● The immediate fix for this is in the last few lines ofR::G::import, which push the pragmas into the caller:● Look up $^H in perlvars to see how it works.require re; re->import( eval );require strict; strict->unimport( vars );
  • Use Regexp::Grammars● Unless you have old YACC BNF grammars toconvert, the newer facility for defining thegrammars is cleaner.– Frankly, even if you do have old grammars...● Regexp::Grammars avoids the performance pitfallsof P::RD.– It is worth taking time to learn how to optimize NDFregexen, however.● Or, better yet, use Perl6 grammars, available todayat your local copy of Rakudo Perl6.
  • More info on Regexp::Grammars● The POD is thorough and quite descriptive[comfortable chair, enjoyable beverage suggested].● The ./demo directory has a number of working – ifun-annotated – examples.● “perldoc perlre” shows how recursive matching inv5.10+.● PerlMonks has plenty of good postings.● Perl Review article by brian d foy on recursivematching in Perl 5.10.