Perly Parsing with Regexp::Grammars

Perly Parsers:
Perl-byacc
Parse::Yapp
Parse::RecDescent
Regex::Grammar
Steven Lembark
Workhorse Computing
lembark@wrkhors.com

Grammars are the guts of compilers
● Compilers convert text from one form to another.
– C compilers convert C source to CPU-specific assembly.
– Databases compile SQL into RDBMS op's.
● Grammars define structure, precedence, valid inputs.
– Realistic ones are often recursive or context-sensitive.
– The complexity in defining grammars led to a variety of tools for defining
them.
– The standard format for a long time has been “BNF”, which is the input to
YACC.
● They are wasted on for 'flat text'.
– If “split /t/” does the job skip grammars entirely.

The first Yet Another: YACC
● Yet Another Compiler Compiler
– YACC takes in a standard-format grammar structure.
– It processes tokens and their values, organizing the
results according to the grammar into a structure.
● Between the source and YACC is a tokenizer.
– This parses the inputs into individual tokens defined by
the grammar.
– It doesn't know about structure, only breaking the text
stream up into tokens.

Parsing is a pain in the lex
● The real pain is gluing the parser and tokenizer
together.
– Tokenizers deal in the language of patterns.
– Grammars are defined in terms of structure.
● Passing data between them makes for most of the
difficulty.
– One issue is the global yylex call, which makes having
multiple parsers difficult.
– Context-sensitive grammars with multiple sub-
grammars are painful.

The perly way
● Regexen, logic, glue... hmm... been there before.
– The first approach most of us try is lexing with regexen.
– Then add captures and if-blocks or excute (?{code})
blocks inside of each regex.
● The problem is that the grammar is embedded in
your code structure.
– You have to modify the code structure to change the
grammar or its tokens.
– Hubris, maybe, but Truly Lazy it ain't.
– Was the whole reason for developing standard
grammars & their handlers in the first place.

Early Perl Grammar Modules
● These take in a YACC grammar and spit out
compiler code.
● Intentionally looked like YACC:
– Able to re-cycle existing YACC grammar files.
– Benefit from using Perl as a built-in lexer.
– Perl-byacc & Parse::Yapp.
● Good: Recycles knowledge for YACC users.
● Bad: Still not lazy: The grammars are difficult to
maintain and you still have to plug in post-
processing code to deal with the results.

%right '='
%left '-' '+'
%left '*' '/'
%left NEG
%right '^'
%%
input: #empty
| input line { push(@{$_[1]},$_[2]); $_[1] }
;
line: 'n' { $_[1] }
| exp 'n' { print "$_[1]n" }
| error 'n' { $_[0]->YYErrok }
;
exp: NUM
| VAR { $_[0]->YYData->{VARS}{$_[1]} }
| VAR '=' exp { $_[0]->YYData->{VARS}{$_[1]}=$_[3] }
| exp '+' exp { $_[1] + $_[3] }
| exp '-' exp { $_[1] - $_[3] }
| exp '*' exp { $_[1] * $_[3] }
Example: Parse::Yapp grammar

The Swiss Army Chainsaw
● Parse::RecDescent extended the original BNF
syntax, combining the tokens & handlers.
● Grammars are largely declarative, using OO Perl to
do the heavy lifting.
– OO interface allows multiple, context sensitive parsers.
– Rules with Perl blocks allows the code to do anything.
– Results can be acquired from a hash, an array, or $1.
– Left, right, associative tags simplify messy situations.

Example P::RD
● This is part
of an infix
formula
compiler I
wrote.
● It compiles
equations to
a sequence
of closures.
add_op : '+' | '-' | '%' { $item[ 1 ] }
mult_op : '*' | '/' | '^' { $item[ 1 ] }
add : <leftop: mult add_op mult>
{
compile_binop @{ $item[1] }
}
mult : <leftop: factor mult_op factor>
{
compile_binop @{ $item[1] }
}

Just enough rope to shoot yourself...
● The biggest problem: P::RD is sloooooooowsloooooooow.
● Learning curve is perl-ish: shallow and long.
– Unless you really know what all of it does you may not
be able to figure out the pieces.
– Lots of really good docs that most people never read.
● Perly blocks also made it look too much like a job-
dispatcher.
– People used it for a lot of things that are not compilers.
– Good & Bad thing: it really is a compiler.

R.I.P. P::RD
● Supposed to be replaced with Parse::FastDescent.
– Damian dropped work on P::FD for Perl6.
– His goal was to replace the shortcomings with P::RD with
something more complete, and quite a bit faster.
● The result is Perl6 Grammars.
– Declarative syntax extends matching with rules.
– Built into Perl6 as a structure, not an add-on.
– Much faster.
– Not available in Perl5

Regex::Grammars
● Perl5 implementation derived from Perl6.
– Back-porting an idea, not the Perl6 syntax.
– Much better performance than P::RD.
● Extends the v5.10 recursive matching syntax,
leveraging the regex engine.
– Most of the speed issues are with regex design, not the
parser itself.
– Simplifies mixing code and matching.
– Single place to get the final results.
– Cleaner syntax with automatic whitespace handling.

Extending regexen
● “use Regexp::Grammar” turns on added syntax.
– block-scoped (avoids collisions with existing code).
● You will probably want to add “xm” or “xs”
– extended syntax avoids whitespace issues.
– multi-line mode (m) simplifies line anchors for line-
oriented parsing.
– single-line mode (s) makes ignoring line-wrap
whitespace largely automatic.
– I use “xm” with explicit “n” or “s” matches to span
lines where necessary.

What you get
● The parser is simply a regex-ref.
– You can bless it or have multiple parsers for context
grammars.
● Grammars can reference one another.
– Extending grammars via objects or modules is
straightforward.
● Comfortable for incremental development or
refactoring.
– Largely declarative syntax helps.
– OOP provides inheritance with overrides for rules.

my $compiler
= do
{
use Regexp::Grammars;
qr
{
<data>
<rule: data > <[text]>+
<rule: text > .+
}xm
};
Example: Creating a compiler
● Context can be
a do-block,
subroutine, or
branch logic.
● “data” is the
entry rule.
● All this does is
read lines into
an array with
automatic ws
handling.

Results: %/
● The results of parsing are in a tree-hash named %/.
– Keys are the rule names that produced the results.
– Empty keys ('') hold input text (for errors or
debugging).
– Easy to handle with Data::Dumper.
● The hash has at least one key for the entry rule, one
empty key for input data if context is being saved.
● For example, feeding two lines of a Gentoo emerge
log through the line grammar gives:

{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
data =>
{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
--deep talk',
text =>
[
'1367874132: Started emerge on: May 06, 2013
21:02:12',
'
--deep talk'
]
Parsing a few lines of logfile

Getting rid of context
● The empty-keyed values are useful for
development or explicit error messages.
● They also get in the way and can cost a lot of
memory on large inputs.
● You can turn them on and off with <context:> and
<nocontext:> in the rules.

qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <text>+ # oops, left off the []!
<rule: text > .+
}xm;
warn | Repeated subrule <text>+ will only capture its
final match
| (Did you mean <[text]>+ instead?)
|
{
data => {
text => '
--deep talk'
}
}
You usually want [] with +

{
data =>
{
text => the [text] parses to an array of text
[
'1367874132: Started emerge on: May 06, 2013 21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write –...
],
...
qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <[text]>+
<rule: text > (.+)
}xm;
An array[ref] of text

Breaking up lines
● Each log entry is prefixed with an entry id.
● Parsing the ref_id off the front adds:
<data>
<rule: data > <[line]>+
<rule: line > <ref_id> <[text]>
<token: ref_id > ^(d+)
<rule: text > .+
line =>
[
{
ref_id => '1367874132',
text => ': Started emerge on: May 06, 2013 21:02:12'
},
…
]

Removing cruft: “ws”
● Be nice to remove the leading “: “ from text lines.
● In this case the “whitespace” needs to include a
colon along with the spaces.
● Whitespace is defined by <ws: … >
<rule: line> <ws:[s:]+> <ref_id> <text>
{
ref_id => '1367874132',
text => '*** emerge --jobs –autounmask-wr...
}

The '***' prefix means something
● Be nice to know what type of line was being
processed.
● <prefix= regex > asigns the regex's capture to the
“prefix” tag:
<rule: line > <ws:[s:]*> <ref_id> <entry>
<rule: entry >
<prefix=([*][*][*])> <text>
|
<prefix=([>][>][>])> <text>
|
<prefix=([=][=][=])> <text>
|
<prefix=([:][:][:])> <text>
|
<text>

{
entry => {
text => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
prefix => '***',
text => 'emerge --jobs –autounmask-write...
},
ref_id => '1367874132'
},
{
entry => {
prefix => '>>>',
text => 'emerge (1 of 2) sys-apps/...
},
ref_id => '1367874256'
}
“entry” now contains optional prefix

Aliases can also assign tag results
● Aliases assign a
key to rule
results.
● The match from
“text” is aliased
to a named type
of log entry.
<rule: entry>
<prefix=([*][*][*])> <command=text>
|
<prefix=([>][>][>])> <stage=text>
|
<prefix=([=][=][=])> <status=text>
|
<prefix=([:][:][:])> <final=text>
|
<message=text>

{
entry => {
message => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
command => 'emerge --jobs --autounmask-write –...
prefix => '***'
},
ref_id => '1367874132'
},
{
entry => {
command => 'terminating.',
prefix => '***'
},
ref_id => '1367874133'
},
Generic “text” replaced with a type:

Parsing without capturing
● At this point we don't really need the prefix strings
since the entries are labeled.
● A leading '.' tells R::G to parse but not store the
results in %/:
<rule: entry >
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>

{
entry => {
message => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
command => 'emerge --jobs --autounmask-write -...
},
ref_id => '1367874132'
},
{
entry => {
command => 'terminating.'
},
ref_id => '1367874133'
},
“entry” now has typed keys:

The “entry” nesting gets in the way
● The named subrule is not hard to get rid of: just
move its syntax up one level:
<ws:[s:]*> <ref_id>
(
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
)

data => {
line => [
{
message => 'Started emerge on: May 06, 2013 21:02:12',
ref_id => '1367874132'
},
{
command => 'emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y --deep
talk',
ref_id => '1367874132'
},
{
command => 'terminating.',
ref_id => '1367874133'
},
{
message => 'Started emerge on: May 06, 2013 21:02:17',
ref_id => '1367874137'
},
Result: array of “line” with ref_id & type

Funny names for things
● Maybe “command” and “status” aren't the best way
to distinguish the text.
● You can store an optional token followed by text:
<rule: entry > <ws:[s:]*> <ref_id> <type>? <text>
<token: type>
(
[*][*][*]
|
[>][>][>]
|
[=][=][=]
|
[:][:][:]
)

Entrys now have “text” and “type”
entry => [
{
ref_id => '1367874132',
},
{
ref_id => '1367874133',
text => 'terminating.',
type => '***'
},
{
ref_id => '1367874137',
},
{
ref_id => '1367874137',
text => 'emerge --jobs --autounmask-write –...
type => '***'
},

prefix alternations look ugly.
● Using a count works:
[*]{3} | [>]{3} | [:]{3} | [=]{3}
but isn't all that much more readable.
● Given the way these are used, use a block:
[*>:=] {3}

qr
{
<nocontext:>
<data>
<rule: data > <[entry]>+
<rule: entry >
<ws:[s:]*>
<ref_id> <prefix>? <text>
<token: ref_id > ^(d+)
<token: prefix > [*>=:]{3}
<token: text > .+
}xm;
This is the skeleton parser:
● Doesn't take much:
– Declarative syntax.
– No Perl code at all!
● Easy to modify by
extending the
definition of “text”
for specific types of
messages.

Finishing the parser
● Given the different line types it will be useful to
extract commands, switches, outcomes from
appropriate lines.
– Sub-rules can be defined for the different line types.
<rule: command> “emerge”
<.ws><[switch]>+
<token: switch> ([-][-]S+)
● This is what makes the grammars useful: nested,
context-sensitive content.

Inheriting & Extending Grammars
● <grammar: name> and <extends: name> allow a
building-block approach.
● Code can assemble the contents of for a qr{} without
having to eval or deal with messy quote strings.
● This makes modular or context-sensitive grammars
relatively simple to compose.
– References can cross package or module boundaries.
– Easy to define a basic grammar in one place and reference
or extend it from multiple other parsers.

The Non-Redundant File
● NCBI's “nr.gz” file is a list if sequences and all of
the places they are known to appear.
● It is moderately large: 140+GB uncompressed.
● The file consists of a simple FASTA format with
heading separated by ctrl-A char's:
>Heading 1
[amino-acid sequence characters...]
>Heading 2
...

Example: A short nr.gz FASTA entry
● Headings are grouped by species, separated by ctrl-A
(“cA”) characters.
– Each species has a set of sources & identifier pairs
followed by a single description.
– Within-species separator is a pipe (“|”) with optional
whitespace.
– Species counts in some header run into the thousands.
>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
[Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI
RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1|
calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1|
hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ...
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK...
VQKLLNPDQ

First step: Parse FASTA
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start> <head> <.ws> <[body]>+
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ [nw-]+
}xm;
● Instead of defining an entry rule, this just defines a
name “Parse::Fasta”.
– This cannot be used to generate results by itself.
– Accessible anywhere via Rexep::Grammars.

The output needs help, however.
● The “<seq>” token captures newlines that need to be
stripped out to get a single string.
● Munging these requires adding code to the parser using
Perl's regex code-block syntax: (?{...})
– Allows inserting almost-arbitrary code into the regex.
– “almost” because the code cannot include regexen.
seq =>
[ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYD
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDP
VQKLLNPDQ
'
]

Munging results: $MATCH
● The $MATCH and %MATCH can be assigned to alter
the results from the current or lower levels of the parse.
● In this case I take the “seq” match contents out of %/,
join them with nothing, and use “tr” to strip the
newlines.
– join + split won't work because split uses a regex.
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})

One more step: Remove the arrayref
● Now the body is a single string.
● No need for an arrayref to contain one string.
● Since the body has one entry, assign offset zero:
body =>
[
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ'
],
<rule: fasta> <.start> <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})

Result: a generic FASTA parser.
{
fasta => [
{
body =>
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDIT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ',
head => 'gi|66816243|ref|XP_642131.1| hypothetical p
rotein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556
|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=C
AF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium
discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]
'
}
]
}
● The head and body are easily accessible.
● Next: parse the nr-specific header.

Deriving a grammar
● Existing grammars are “extended”.
● The derived grammars are capable of producing
results.
● In this case:
● References the grammar and extracts a list of fasta
entries.
<extends: Parse::Fasta>
<[fasta]>+

Splitting the head into identifiers
● Overloading fasta's “head” rule handles allows
splitting identifiers for individual species.
● Catch: cA is separator, not a terminator.
– The tail item on the list does't have a cA to anchor on.
– Using “.+[cAn] walks off the header onto the sequence.
– This is a common problem with separators & tokenizers.
– This can be handled with special tokens in the grammar,
but R::G provides a cleaner way.

First pass: Literal “tail” item
● This works but is ugly:
– Have two rules for the main list and tail.
– Alias the tail to get them all in one place.
<rule: head> <[ident]>+ <[ident=final]>
(?{
# remove the matched anchors
tr/cAn//d for @{ $MATCH{ ident } };
})
<token: ident > .+? cA
<token: final > .+ n

Breaking up the header
● The last header item is aliased to “ident”.
● Breaks up all of the entries:
head => {
ident => [
'gi|66816243|ref|XP_642131.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]',
'gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
Full=Calfumirin-1; Short=CAF-1',
'gi|793761|dbj|BAA06266.1| calfumirin-1
[Dictyostelium discoideum]',
'gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]'
]
}

Dealing with separators: '% <sep>
● Separators happen often enough:
– 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces
– g-c-a-g-t-t-a-c-a # characters by dashes
– /usr/local/bin # basenames by dir markers
– /usr:/usr/local:bin # dir's separated by colons
that R::G has special syntax for dealing with them.
● Combining the item with '%' and a seprator:
<rule: list> <[item]>+ % <separator> # one-or-more
<rule: list_zom> <[item]>* % <separator> # zero-or-more

Cleaner nr.gz header rule
● Separator syntax cleans things up:
– No more tail rule with an alias.
– No code block required to strip the separators and trailing
newline.
– Non-greedy match “.+?” avoids capturing separators.
qr
{
<nocontext:>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
<token: ident > .+?
}xm

Nested “ident” tag is extraneous
● Simpler to replace the “head” with a list of
identifiers.
● Replace $MATCH from the “head” rule with the
nested identifier contents:
qr
{
<nocontext:>
<[fasta]>+
(?{
$MATCH = delete $MATCH{ ident };
})
<token: ident > .+?
}xm

Result:
{
fasta => [
{
body => 'MASTQNIVEEVQKMLDT...NPDQ',
head => [
'gi|66816243|ref|XP_6...rt=CAF-1',
'gi|793761|dbj|BAA0626...oideum]',
'gi|60470106|gb|EAL68086...m discoideum AX4]'
]
}
]
}
● The fasta content is broken into the usual “body” plus
a “head” broken down on cA boundaries.
● Not bad for a dozen lines of grammar with a few
lines of code:

One more level of structure: idents.
● Species have <source > | <identifier> pairs followed
by a description.
● Add a separator clause “ % (?:s*|s*)”
– This can be parsed into a hash something like:
gi|66816243|ref|XP_642131.1|hypothetical ...
Becomes:
{
gi => '66816243',
ref => 'XP_642131.1',
desc => 'hypothetical...'
}

Munging the separated input
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc }
}
$MATCH{ fasta }{ head } = $identz;
})
<token: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?

Result: head with sources, “desc”
{
fasta => {
body => 'MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN',
head => [
{
desc => '30S ribosomal protein S18 [Lactococ...
gi => '15674171',
ref => 'NP_268346.1'
},
{
desc => '30S ribosomal protein S18 [Lactoco...
gi => '116513137',
ref => 'YP_812044.1'
},
...

Balancing R::G with calling code
● The regex engine could process all of nr.gz.
– Catch: <[fasta]>+ returns about 250_000 keys and literally
millions of total identifiers in the head's.
– Better approach: <fasta> on single entries, but chunking input
on '>' removes it as a leading charactor.
– Making it optional with <.start>? fixes the problem:
local $/ = '>';
while( my $chunk = readline )
{
chomp;
length $chunk or do { --$.; next };
$chunk =~ $nr_gz;
# process single fasta record in %/
}

Fasta base grammar: 3 lines of code
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start>? <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
<rule: head > .+ <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ ( [nw-]+ )
}xm;

Extension to Fasta: 6 lines of code.
qr
{
<nocontext:>
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc };
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
}xm

Result: Use grammars
● Most of the “real” work is done under the hood.
– Regexp::Grammars does the lexing, basic compilation.
– Code only needed for cleanups or re-arranging structs.
● Code can simplify your grammar.
– Too much code makes them hard to maintain.
– Trick is keeping the balance between simplicity in the
grammar and cleanup in the code.
● Either way, the result is going to be more
maintainable than hardwiring the grammar into code.

Aside: KwikFix for Perl v5.18
● v5.17 changed how the regex engine handles inline
code.
● Code that used to be eval-ed in the regex is now
compiled up front.
– This requires “use re 'eval'” and “no strict 'vars'”.
– One for the Perl code, the other for $MATCH and friends.
● The immediate fix for this is in the last few lines of
R::G::import, which push the pragmas into the caller:
● Look up $^H in perlvars to see how it works.
require re; re->import( 'eval' );
require strict; strict->unimport( 'vars' );

Use Regexp::Grammars
● Unless you have old YACC BNF grammars to
convert, the newer facility for defining the
grammars is cleaner.
– Frankly, even if you do have old grammars...
● Regexp::Grammars avoids the performance pitfalls
of P::RD.
– It is worth taking time to learn how to optimize NDF
regexen, however.
● Or, better yet, use Perl6 grammars, available today
at your local copy of Rakudo Perl6.

More info on Regexp::Grammars
● The POD is thorough and quite descriptive
[comfortable chair, enjoyable beverage suggested].
● The ./demo directory has a number of working – if
un-annotated – examples.
● “perldoc perlre” shows how recursive matching in
v5.10+.
● PerlMonks has plenty of good postings.
● Perl Review article by brian d foy on recursive
matching in Perl 5.10.

Perly Parsing with Regexp::Grammars

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Perly Parsing with Regexp::Grammars

Similar to Perly Parsing with Regexp::Grammars (20)

More from Workhorse Computing

More from Workhorse Computing (20)

Recently uploaded

Recently uploaded (20)

Perly Parsing with Regexp::Grammars