SlideShare a Scribd company logo
Perly Parsers:
Perl-byacc
Parse::Yapp
Parse::RecDescent
Regex::Grammar
Steven Lembark
Workhorse Computing
lembark@wrkhors.com
Grammars are the guts of compilers
● Compilers convert text from one form to another.
– C compilers convert C source to CPU-specific assembly.
– Databases compile SQL into RDBMS op's.
● Grammars define structure, precedence, valid inputs.
– Realistic ones are often recursive or context-sensitive.
– The complexity in defining grammars led to a variety of tools for defining
them.
– The standard format for a long time has been “BNF”, which is the input to
YACC.
● They are wasted on for 'flat text'.
– If “split /t/” does the job skip grammars entirely.
The first Yet Another: YACC
● Yet Another Compiler Compiler
– YACC takes in a standard-format grammar structure.
– It processes tokens and their values, organizing the
results according to the grammar into a structure.
● Between the source and YACC is a tokenizer.
– This parses the inputs into individual tokens defined by
the grammar.
– It doesn't know about structure, only breaking the text
stream up into tokens.
Parsing is a pain in the lex
● The real pain is gluing the parser and tokenizer
together.
– Tokenizers deal in the language of patterns.
– Grammars are defined in terms of structure.
● Passing data between them makes for most of the
difficulty.
– One issue is the global yylex call, which makes having
multiple parsers difficult.
– Context-sensitive grammars with multiple sub-
grammars are painful.
The perly way
● Regexen, logic, glue... hmm... been there before.
– The first approach most of us try is lexing with regexen.
– Then add captures and if-blocks or excute (?{code})
blocks inside of each regex.
● The problem is that the grammar is embedded in
your code structure.
– You have to modify the code structure to change the
grammar or its tokens.
– Hubris, maybe, but Truly Lazy it ain't.
– Was the whole reason for developing standard
grammars & their handlers in the first place.
Early Perl Grammar Modules
● These take in a YACC grammar and spit out
compiler code.
● Intentionally looked like YACC:
– Able to re-cycle existing YACC grammar files.
– Benefit from using Perl as a built-in lexer.
– Perl-byacc & Parse::Yapp.
● Good: Recycles knowledge for YACC users.
● Bad: Still not lazy: The grammars are difficult to
maintain and you still have to plug in post-
processing code to deal with the results.
%right '='
%left '-' '+'
%left '*' '/'
%left NEG
%right '^'
%%
input: #empty
| input line { push(@{$_[1]},$_[2]); $_[1] }
;
line: 'n' { $_[1] }
| exp 'n' { print "$_[1]n" }
| error 'n' { $_[0]->YYErrok }
;
exp: NUM
| VAR { $_[0]->YYData->{VARS}{$_[1]} }
| VAR '=' exp { $_[0]->YYData->{VARS}{$_[1]}=$_[3] }
| exp '+' exp { $_[1] + $_[3] }
| exp '-' exp { $_[1] - $_[3] }
| exp '*' exp { $_[1] * $_[3] }
Example: Parse::Yapp grammar
The Swiss Army Chainsaw
● Parse::RecDescent extended the original BNF
syntax, combining the tokens & handlers.
● Grammars are largely declarative, using OO Perl to
do the heavy lifting.
– OO interface allows multiple, context sensitive parsers.
– Rules with Perl blocks allows the code to do anything.
– Results can be acquired from a hash, an array, or $1.
– Left, right, associative tags simplify messy situations.
Example P::RD
● This is part
of an infix
formula
compiler I
wrote.
● It compiles
equations to
a sequence
of closures.
add_op : '+' | '-' | '%' { $item[ 1 ] }
mult_op : '*' | '/' | '^' { $item[ 1 ] }
add : <leftop: mult add_op mult>
{
compile_binop @{ $item[1] }
}
mult : <leftop: factor mult_op factor>
{
compile_binop @{ $item[1] }
}
Just enough rope to shoot yourself...
● The biggest problem: P::RD is sloooooooowsloooooooow.
● Learning curve is perl-ish: shallow and long.
– Unless you really know what all of it does you may not
be able to figure out the pieces.
– Lots of really good docs that most people never read.
● Perly blocks also made it look too much like a job-
dispatcher.
– People used it for a lot of things that are not compilers.
– Good & Bad thing: it really is a compiler.
R.I.P. P::RD
● Supposed to be replaced with Parse::FastDescent.
– Damian dropped work on P::FD for Perl6.
– His goal was to replace the shortcomings with P::RD with
something more complete, and quite a bit faster.
● The result is Perl6 Grammars.
– Declarative syntax extends matching with rules.
– Built into Perl6 as a structure, not an add-on.
– Much faster.
– Not available in Perl5
Regex::Grammars
● Perl5 implementation derived from Perl6.
– Back-porting an idea, not the Perl6 syntax.
– Much better performance than P::RD.
● Extends the v5.10 recursive matching syntax,
leveraging the regex engine.
– Most of the speed issues are with regex design, not the
parser itself.
– Simplifies mixing code and matching.
– Single place to get the final results.
– Cleaner syntax with automatic whitespace handling.
Extending regexen
● “use Regexp::Grammar” turns on added syntax.
– block-scoped (avoids collisions with existing code).
● You will probably want to add “xm” or “xs”
– extended syntax avoids whitespace issues.
– multi-line mode (m) simplifies line anchors for line-
oriented parsing.
– single-line mode (s) makes ignoring line-wrap
whitespace largely automatic.
– I use “xm” with explicit “n” or “s” matches to span
lines where necessary.
What you get
● The parser is simply a regex-ref.
– You can bless it or have multiple parsers for context
grammars.
● Grammars can reference one another.
– Extending grammars via objects or modules is
straightforward.
● Comfortable for incremental development or
refactoring.
– Largely declarative syntax helps.
– OOP provides inheritance with overrides for rules.
my $compiler
= do
{
use Regexp::Grammars;
qr
{
<data>
<rule: data > <[text]>+
<rule: text > .+
}xm
};
Example: Creating a compiler
● Context can be
a do-block,
subroutine, or
branch logic.
● “data” is the
entry rule.
● All this does is
read lines into
an array with
automatic ws
handling.
Results: %/
● The results of parsing are in a tree-hash named %/.
– Keys are the rule names that produced the results.
– Empty keys ('') hold input text (for errors or
debugging).
– Easy to handle with Data::Dumper.
● The hash has at least one key for the entry rule, one
empty key for input data if context is being saved.
● For example, feeding two lines of a Gentoo emerge
log through the line grammar gives:
{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
data =>
{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
text =>
[
'1367874132: Started emerge on: May 06, 2013
21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk'
]
Parsing a few lines of logfile
Getting rid of context
● The empty-keyed values are useful for
development or explicit error messages.
● They also get in the way and can cost a lot of
memory on large inputs.
● You can turn them on and off with <context:> and
<nocontext:> in the rules.
qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <text>+ # oops, left off the []!
<rule: text > .+
}xm;
warn | Repeated subrule <text>+ will only capture its
final match
| (Did you mean <[text]>+ instead?)
|
{
data => {
text => '
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk'
}
}
You usually want [] with +
{
data =>
{
text => the [text] parses to an array of text
[
'1367874132: Started emerge on: May 06, 2013 21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write –...
],
...
qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <[text]>+
<rule: text > (.+)
}xm;
An array[ref] of text
Breaking up lines
● Each log entry is prefixed with an entry id.
● Parsing the ref_id off the front adds:
<data>
<rule: data > <[line]>+
<rule: line > <ref_id> <[text]>
<token: ref_id > ^(d+)
<rule: text > .+
line =>
[
{
ref_id => '1367874132',
text => ': Started emerge on: May 06, 2013 21:02:12'
},
…
]
Removing cruft: “ws”
● Be nice to remove the leading “: “ from text lines.
● In this case the “whitespace” needs to include a
colon along with the spaces.
● Whitespace is defined by <ws: … >
<rule: line> <ws:[s:]+> <ref_id> <text>
{
ref_id => '1367874132',
text => '*** emerge --jobs –autounmask-wr...
}
The '***' prefix means something
● Be nice to know what type of line was being
processed.
● <prefix= regex > asigns the regex's capture to the
“prefix” tag:
<rule: line > <ws:[s:]*> <ref_id> <entry>
<rule: entry >
<prefix=([*][*][*])> <text>
|
<prefix=([>][>][>])> <text>
|
<prefix=([=][=][=])> <text>
|
<prefix=([:][:][:])> <text>
|
<text>
{
entry => {
text => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
prefix => '***',
text => 'emerge --jobs –autounmask-write...
},
ref_id => '1367874132'
},
{
entry => {
prefix => '>>>',
text => 'emerge (1 of 2) sys-apps/...
},
ref_id => '1367874256'
}
“entry” now contains optional prefix
Aliases can also assign tag results
● Aliases assign a
key to rule
results.
● The match from
“text” is aliased
to a named type
of log entry.
<rule: entry>
<prefix=([*][*][*])> <command=text>
|
<prefix=([>][>][>])> <stage=text>
|
<prefix=([=][=][=])> <status=text>
|
<prefix=([:][:][:])> <final=text>
|
<message=text>
{
entry => {
message => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
command => 'emerge --jobs --autounmask-write –...
prefix => '***'
},
ref_id => '1367874132'
},
{
entry => {
command => 'terminating.',
prefix => '***'
},
ref_id => '1367874133'
},
Generic “text” replaced with a type:
Parsing without capturing
● At this point we don't really need the prefix strings
since the entries are labeled.
● A leading '.' tells R::G to parse but not store the
results in %/:
<rule: entry >
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
{
entry => {
message => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
command => 'emerge --jobs --autounmask-write -...
},
ref_id => '1367874132'
},
{
entry => {
command => 'terminating.'
},
ref_id => '1367874133'
},
“entry” now has typed keys:
The “entry” nesting gets in the way
● The named subrule is not hard to get rid of: just
move its syntax up one level:
<ws:[s:]*> <ref_id>
(
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
)
data => {
line => [
{
message => 'Started emerge on: May 06, 2013 21:02:12',
ref_id => '1367874132'
},
{
command => 'emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y --deep
talk',
ref_id => '1367874132'
},
{
command => 'terminating.',
ref_id => '1367874133'
},
{
message => 'Started emerge on: May 06, 2013 21:02:17',
ref_id => '1367874137'
},
Result: array of “line” with ref_id & type
Funny names for things
● Maybe “command” and “status” aren't the best way
to distinguish the text.
● You can store an optional token followed by text:
<rule: entry > <ws:[s:]*> <ref_id> <type>? <text>
<token: type>
(
[*][*][*]
|
[>][>][>]
|
[=][=][=]
|
[:][:][:]
)
Entrys now have “text” and “type”
entry => [
{
ref_id => '1367874132',
text => 'Started emerge on: May 06, 2013 21:02:12'
},
{
ref_id => '1367874133',
text => 'terminating.',
type => '***'
},
{
ref_id => '1367874137',
text => 'Started emerge on: May 06, 2013 21:02:17'
},
{
ref_id => '1367874137',
text => 'emerge --jobs --autounmask-write –...
type => '***'
},
prefix alternations look ugly.
● Using a count works:
[*]{3} | [>]{3} | [:]{3} | [=]{3}
but isn't all that much more readable.
● Given the way these are used, use a block:
[*>:=] {3}
qr
{
<nocontext:>
<data>
<rule: data > <[entry]>+
<rule: entry >
<ws:[s:]*>
<ref_id> <prefix>? <text>
<token: ref_id > ^(d+)
<token: prefix > [*>=:]{3}
<token: text > .+
}xm;
This is the skeleton parser:
● Doesn't take much:
– Declarative syntax.
– No Perl code at all!
● Easy to modify by
extending the
definition of “text”
for specific types of
messages.
Finishing the parser
● Given the different line types it will be useful to
extract commands, switches, outcomes from
appropriate lines.
– Sub-rules can be defined for the different line types.
<rule: command> “emerge”
<.ws><[switch]>+
<token: switch> ([-][-]S+)
● This is what makes the grammars useful: nested,
context-sensitive content.
Inheriting & Extending Grammars
● <grammar: name> and <extends: name> allow a
building-block approach.
● Code can assemble the contents of for a qr{} without
having to eval or deal with messy quote strings.
● This makes modular or context-sensitive grammars
relatively simple to compose.
– References can cross package or module boundaries.
– Easy to define a basic grammar in one place and reference
or extend it from multiple other parsers.
The Non-Redundant File
● NCBI's “nr.gz” file is a list if sequences and all of
the places they are known to appear.
● It is moderately large: 140+GB uncompressed.
● The file consists of a simple FASTA format with
heading separated by ctrl-A char's:
>Heading 1
[amino-acid sequence characters...]
>Heading 2
...
Example: A short nr.gz FASTA entry
● Headings are grouped by species, separated by ctrl-A
(“cA”) characters.
– Each species has a set of sources & identifier pairs
followed by a single description.
– Within-species separator is a pipe (“|”) with optional
whitespace.
– Species counts in some header run into the thousands.
>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
[Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI
RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1|
calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1|
hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ...
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK...
VQKLLNPDQ
First step: Parse FASTA
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start> <head> <.ws> <[body]>+
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ [nw-]+
}xm;
● Instead of defining an entry rule, this just defines a
name “Parse::Fasta”.
– This cannot be used to generate results by itself.
– Accessible anywhere via Rexep::Grammars.
The output needs help, however.
● The “<seq>” token captures newlines that need to be
stripped out to get a single string.
● Munging these requires adding code to the parser using
Perl's regex code-block syntax: (?{...})
– Allows inserting almost-arbitrary code into the regex.
– “almost” because the code cannot include regexen.
seq =>
[ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYD
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDP
VQKLLNPDQ
'
]
Munging results: $MATCH
● The $MATCH and %MATCH can be assigned to alter
the results from the current or lower levels of the parse.
● In this case I take the “seq” match contents out of %/,
join them with nothing, and use “tr” to strip the
newlines.
– join + split won't work because split uses a regex.
<rule: body > ( <[seq]> | <.comment> ) <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
One more step: Remove the arrayref
● Now the body is a single string.
● No need for an arrayref to contain one string.
● Since the body has one entry, assign offset zero:
body =>
[
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ'
],
<rule: fasta> <.start> <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
Result: a generic FASTA parser.
{
fasta => [
{
body =>
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDIT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ',
head => 'gi|66816243|ref|XP_642131.1| hypothetical p
rotein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556
|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=C
AF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium
discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]
'
}
]
}
● The head and body are easily accessible.
● Next: parse the nr-specific header.
Deriving a grammar
● Existing grammars are “extended”.
● The derived grammars are capable of producing
results.
● In this case:
● References the grammar and extracts a list of fasta
entries.
<extends: Parse::Fasta>
<[fasta]>+
Splitting the head into identifiers
● Overloading fasta's “head” rule handles allows
splitting identifiers for individual species.
● Catch: cA is separator, not a terminator.
– The tail item on the list does't have a cA to anchor on.
– Using “.+[cAn] walks off the header onto the sequence.
– This is a common problem with separators & tokenizers.
– This can be handled with special tokens in the grammar,
but R::G provides a cleaner way.
First pass: Literal “tail” item
● This works but is ugly:
– Have two rules for the main list and tail.
– Alias the tail to get them all in one place.
<rule: head> <[ident]>+ <[ident=final]>
(?{
# remove the matched anchors
tr/cAn//d for @{ $MATCH{ ident } };
})
<token: ident > .+? cA
<token: final > .+ n
Breaking up the header
● The last header item is aliased to “ident”.
● Breaks up all of the entries:
head => {
ident => [
'gi|66816243|ref|XP_642131.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]',
'gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
Full=Calfumirin-1; Short=CAF-1',
'gi|793761|dbj|BAA06266.1| calfumirin-1
[Dictyostelium discoideum]',
'gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]'
]
}
Dealing with separators: '% <sep>
● Separators happen often enough:
– 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces
– g-c-a-g-t-t-a-c-a # characters by dashes
– /usr/local/bin # basenames by dir markers
– /usr:/usr/local:bin # dir's separated by colons
that R::G has special syntax for dealing with them.
● Combining the item with '%' and a seprator:
<rule: list> <[item]>+ % <separator> # one-or-more
<rule: list_zom> <[item]>* % <separator> # zero-or-more
Cleaner nr.gz header rule
● Separator syntax cleans things up:
– No more tail rule with an alias.
– No code block required to strip the separators and trailing
newline.
– Non-greedy match “.+?” avoids capturing separators.
qr
{
<nocontext:>
<extends: Parse::Fasta>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
<token: ident > .+?
}xm
Nested “ident” tag is extraneous
● Simpler to replace the “head” with a list of
identifiers.
● Replace $MATCH from the “head” rule with the
nested identifier contents:
qr
{
<nocontext:>
<extends: Parse::Fasta>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
(?{
$MATCH = delete $MATCH{ ident };
})
<token: ident > .+?
}xm
Result:
{
fasta => [
{
body => 'MASTQNIVEEVQKMLDT...NPDQ',
head => [
'gi|66816243|ref|XP_6...rt=CAF-1',
'gi|793761|dbj|BAA0626...oideum]',
'gi|60470106|gb|EAL68086...m discoideum AX4]'
]
}
]
}
● The fasta content is broken into the usual “body” plus
a “head” broken down on cA boundaries.
● Not bad for a dozen lines of grammar with a few
lines of code:
One more level of structure: idents.
● Species have <source > | <identifier> pairs followed
by a description.
● Add a separator clause “ % (?:s*|s*)”
– This can be parsed into a hash something like:
gi|66816243|ref|XP_642131.1|hypothetical ...
Becomes:
{
gi => '66816243',
ref => 'XP_642131.1',
desc => 'hypothetical...'
}
Munging the separated input
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc }
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: head > <[ident]>+ % [cA]
<token: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
Result: head with sources, “desc”
{
fasta => {
body => 'MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN',
head => [
{
desc => '30S ribosomal protein S18 [Lactococ...
gi => '15674171',
ref => 'NP_268346.1'
},
{
desc => '30S ribosomal protein S18 [Lactoco...
gi => '116513137',
ref => 'YP_812044.1'
},
...
Balancing R::G with calling code
● The regex engine could process all of nr.gz.
– Catch: <[fasta]>+ returns about 250_000 keys and literally
millions of total identifiers in the head's.
– Better approach: <fasta> on single entries, but chunking input
on '>' removes it as a leading charactor.
– Making it optional with <.start>? fixes the problem:
local $/ = '>';
while( my $chunk = readline )
{
chomp;
length $chunk or do { --$.; next };
$chunk =~ $nr_gz;
# process single fasta record in %/
}
Fasta base grammar: 3 lines of code
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start>? <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ ( [nw-]+ )
}xm;
Extension to Fasta: 6 lines of code.
qr
{
<nocontext:>
<extends: Parse::Fasta>
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc };
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: head > <[ident]>+ % [cA]
<rule: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
}xm
Result: Use grammars
● Most of the “real” work is done under the hood.
– Regexp::Grammars does the lexing, basic compilation.
– Code only needed for cleanups or re-arranging structs.
● Code can simplify your grammar.
– Too much code makes them hard to maintain.
– Trick is keeping the balance between simplicity in the
grammar and cleanup in the code.
● Either way, the result is going to be more
maintainable than hardwiring the grammar into code.
Aside: KwikFix for Perl v5.18
● v5.17 changed how the regex engine handles inline
code.
● Code that used to be eval-ed in the regex is now
compiled up front.
– This requires “use re 'eval'” and “no strict 'vars'”.
– One for the Perl code, the other for $MATCH and friends.
● The immediate fix for this is in the last few lines of
R::G::import, which push the pragmas into the caller:
● Look up $^H in perlvars to see how it works.
require re; re->import( 'eval' );
require strict; strict->unimport( 'vars' );
Use Regexp::Grammars
● Unless you have old YACC BNF grammars to
convert, the newer facility for defining the
grammars is cleaner.
– Frankly, even if you do have old grammars...
● Regexp::Grammars avoids the performance pitfalls
of P::RD.
– It is worth taking time to learn how to optimize NDF
regexen, however.
● Or, better yet, use Perl6 grammars, available today
at your local copy of Rakudo Perl6.
More info on Regexp::Grammars
● The POD is thorough and quite descriptive
[comfortable chair, enjoyable beverage suggested].
● The ./demo directory has a number of working – if
un-annotated – examples.
● “perldoc perlre” shows how recursive matching in
v5.10+.
● PerlMonks has plenty of good postings.
● Perl Review article by brian d foy on recursive
matching in Perl 5.10.

More Related Content

What's hot

Hyperledger 구조 분석
Hyperledger 구조 분석Hyperledger 구조 분석
Hyperledger 구조 분석
Jongseok Choi
 
typemap in Perl/XS
typemap in Perl/XS  typemap in Perl/XS
typemap in Perl/XS
charsbar
 
Introduction to PHP 5.3
Introduction to PHP 5.3Introduction to PHP 5.3
Introduction to PHP 5.3guestcc91d4
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
daoswald
 
Use perl creating web services with xml rpc
Use perl creating web services with xml rpcUse perl creating web services with xml rpc
Use perl creating web services with xml rpcJohnny Pork
 
Php
PhpPhp
Open Gurukul Language PL/SQL
Open Gurukul Language PL/SQLOpen Gurukul Language PL/SQL
Open Gurukul Language PL/SQL
Open Gurukul
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and Lemon
Tristan Penman
 
10 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.610 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.6
Webline Infosoft P Ltd
 
Perl Basics with Examples
Perl Basics with ExamplesPerl Basics with Examples
Perl Basics with Examples
Nithin Kumar Singani
 
Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014
Hari K T
 
Internet Technology and its Applications
Internet Technology and its ApplicationsInternet Technology and its Applications
Internet Technology and its Applications
amichoksi
 
Introduction to Perl and BioPerl
Introduction to Perl and BioPerlIntroduction to Perl and BioPerl
Introduction to Perl and BioPerl
Bioinformatics and Computational Biosciences Branch
 
Perl Programming - 01 Basic Perl
Perl Programming - 01 Basic PerlPerl Programming - 01 Basic Perl
Perl Programming - 01 Basic Perl
Danairat Thanabodithammachari
 
Perl 5.10 in 2010
Perl 5.10 in 2010Perl 5.10 in 2010
Perl 5.10 in 2010
guest7899f0
 
Modern Perl Catch-Up
Modern Perl Catch-UpModern Perl Catch-Up
Modern Perl Catch-Up
Dave Cross
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!
Franklin Chen
 
Php
PhpPhp

What's hot (20)

Hyperledger 구조 분석
Hyperledger 구조 분석Hyperledger 구조 분석
Hyperledger 구조 분석
 
typemap in Perl/XS
typemap in Perl/XS  typemap in Perl/XS
typemap in Perl/XS
 
Introduction to PHP 5.3
Introduction to PHP 5.3Introduction to PHP 5.3
Introduction to PHP 5.3
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
 
Use perl creating web services with xml rpc
Use perl creating web services with xml rpcUse perl creating web services with xml rpc
Use perl creating web services with xml rpc
 
Php
PhpPhp
Php
 
Cs3430 lecture 15
Cs3430 lecture 15Cs3430 lecture 15
Cs3430 lecture 15
 
Javascript
JavascriptJavascript
Javascript
 
Open Gurukul Language PL/SQL
Open Gurukul Language PL/SQLOpen Gurukul Language PL/SQL
Open Gurukul Language PL/SQL
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and Lemon
 
10 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.610 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.6
 
Perl Basics with Examples
Perl Basics with ExamplesPerl Basics with Examples
Perl Basics with Examples
 
Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014
 
Internet Technology and its Applications
Internet Technology and its ApplicationsInternet Technology and its Applications
Internet Technology and its Applications
 
Introduction to Perl and BioPerl
Introduction to Perl and BioPerlIntroduction to Perl and BioPerl
Introduction to Perl and BioPerl
 
Perl Programming - 01 Basic Perl
Perl Programming - 01 Basic PerlPerl Programming - 01 Basic Perl
Perl Programming - 01 Basic Perl
 
Perl 5.10 in 2010
Perl 5.10 in 2010Perl 5.10 in 2010
Perl 5.10 in 2010
 
Modern Perl Catch-Up
Modern Perl Catch-UpModern Perl Catch-Up
Modern Perl Catch-Up
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!
 
Php
PhpPhp
Php
 

Similar to Perly Parsing with Regexp::Grammars

Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
Ayush Sharma
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
jnewmanux
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
Gleicon Moraes
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
Jorg Janke
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
Workhorse Computing
 
Functional Smalltalk
Functional SmalltalkFunctional Smalltalk
Functional Smalltalk
ESUG
 
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
digitalwave
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
PVS-Studio
 
Perl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one linersPerl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one liners
Kirk Kimmel
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.
Poznań Ruby User Group
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
7986-lect 7.pdf
7986-lect 7.pdf7986-lect 7.pdf
7986-lect 7.pdf
RiazAhmad521284
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Workhorse Computing
 

Similar to Perly Parsing with Regexp::Grammars (20)

Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Os Wilhelm
Os WilhelmOs Wilhelm
Os Wilhelm
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Functional Smalltalk
Functional SmalltalkFunctional Smalltalk
Functional Smalltalk
 
JavaScripts & jQuery
JavaScripts & jQueryJavaScripts & jQuery
JavaScripts & jQuery
 
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
 
Perl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one linersPerl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one liners
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
7986-lect 7.pdf
7986-lect 7.pdf7986-lect 7.pdf
7986-lect 7.pdf
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
 

More from Workhorse Computing

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
Workhorse Computing
 
mro-every.pdf
mro-every.pdfmro-every.pdf
mro-every.pdf
Workhorse Computing
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add Up
Workhorse Computing
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
Workhorse Computing
 
Unit Testing Lots of Perl
Unit Testing Lots of PerlUnit Testing Lots of Perl
Unit Testing Lots of Perl
Workhorse Computing
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in Posgresql
Workhorse Computing
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!
Workhorse Computing
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command Interpolation
Workhorse Computing
 
Findbin libs
Findbin libsFindbin libs
Findbin libs
Workhorse Computing
 
Memory Manglement in Raku
Memory Manglement in RakuMemory Manglement in Raku
Memory Manglement in Raku
Workhorse Computing
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
Workhorse Computing
 
Effective Benchmarks
Effective BenchmarksEffective Benchmarks
Effective Benchmarks
Workhorse Computing
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
Workhorse Computing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.
Workhorse Computing
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
Workhorse Computing
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
Workhorse Computing
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
Workhorse Computing
 
Getting Testy With Perl6
Getting Testy With Perl6Getting Testy With Perl6
Getting Testy With Perl6
Workhorse Computing
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Workhorse Computing
 
Neatly folding-a-tree
Neatly folding-a-treeNeatly folding-a-tree
Neatly folding-a-tree
Workhorse Computing
 

More from Workhorse Computing (20)

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
mro-every.pdf
mro-every.pdfmro-every.pdf
mro-every.pdf
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add Up
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
 
Unit Testing Lots of Perl
Unit Testing Lots of PerlUnit Testing Lots of Perl
Unit Testing Lots of Perl
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in Posgresql
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command Interpolation
 
Findbin libs
Findbin libsFindbin libs
Findbin libs
 
Memory Manglement in Raku
Memory Manglement in RakuMemory Manglement in Raku
Memory Manglement in Raku
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
 
Effective Benchmarks
Effective BenchmarksEffective Benchmarks
Effective Benchmarks
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
 
Getting Testy With Perl6
Getting Testy With Perl6Getting Testy With Perl6
Getting Testy With Perl6
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
 
Neatly folding-a-tree
Neatly folding-a-treeNeatly folding-a-tree
Neatly folding-a-tree
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 

Perly Parsing with Regexp::Grammars

  • 2. Grammars are the guts of compilers ● Compilers convert text from one form to another. – C compilers convert C source to CPU-specific assembly. – Databases compile SQL into RDBMS op's. ● Grammars define structure, precedence, valid inputs. – Realistic ones are often recursive or context-sensitive. – The complexity in defining grammars led to a variety of tools for defining them. – The standard format for a long time has been “BNF”, which is the input to YACC. ● They are wasted on for 'flat text'. – If “split /t/” does the job skip grammars entirely.
  • 3. The first Yet Another: YACC ● Yet Another Compiler Compiler – YACC takes in a standard-format grammar structure. – It processes tokens and their values, organizing the results according to the grammar into a structure. ● Between the source and YACC is a tokenizer. – This parses the inputs into individual tokens defined by the grammar. – It doesn't know about structure, only breaking the text stream up into tokens.
  • 4. Parsing is a pain in the lex ● The real pain is gluing the parser and tokenizer together. – Tokenizers deal in the language of patterns. – Grammars are defined in terms of structure. ● Passing data between them makes for most of the difficulty. – One issue is the global yylex call, which makes having multiple parsers difficult. – Context-sensitive grammars with multiple sub- grammars are painful.
  • 5. The perly way ● Regexen, logic, glue... hmm... been there before. – The first approach most of us try is lexing with regexen. – Then add captures and if-blocks or excute (?{code}) blocks inside of each regex. ● The problem is that the grammar is embedded in your code structure. – You have to modify the code structure to change the grammar or its tokens. – Hubris, maybe, but Truly Lazy it ain't. – Was the whole reason for developing standard grammars & their handlers in the first place.
  • 6. Early Perl Grammar Modules ● These take in a YACC grammar and spit out compiler code. ● Intentionally looked like YACC: – Able to re-cycle existing YACC grammar files. – Benefit from using Perl as a built-in lexer. – Perl-byacc & Parse::Yapp. ● Good: Recycles knowledge for YACC users. ● Bad: Still not lazy: The grammars are difficult to maintain and you still have to plug in post- processing code to deal with the results.
  • 7. %right '=' %left '-' '+' %left '*' '/' %left NEG %right '^' %% input: #empty | input line { push(@{$_[1]},$_[2]); $_[1] } ; line: 'n' { $_[1] } | exp 'n' { print "$_[1]n" } | error 'n' { $_[0]->YYErrok } ; exp: NUM | VAR { $_[0]->YYData->{VARS}{$_[1]} } | VAR '=' exp { $_[0]->YYData->{VARS}{$_[1]}=$_[3] } | exp '+' exp { $_[1] + $_[3] } | exp '-' exp { $_[1] - $_[3] } | exp '*' exp { $_[1] * $_[3] } Example: Parse::Yapp grammar
  • 8. The Swiss Army Chainsaw ● Parse::RecDescent extended the original BNF syntax, combining the tokens & handlers. ● Grammars are largely declarative, using OO Perl to do the heavy lifting. – OO interface allows multiple, context sensitive parsers. – Rules with Perl blocks allows the code to do anything. – Results can be acquired from a hash, an array, or $1. – Left, right, associative tags simplify messy situations.
  • 9. Example P::RD ● This is part of an infix formula compiler I wrote. ● It compiles equations to a sequence of closures. add_op : '+' | '-' | '%' { $item[ 1 ] } mult_op : '*' | '/' | '^' { $item[ 1 ] } add : <leftop: mult add_op mult> { compile_binop @{ $item[1] } } mult : <leftop: factor mult_op factor> { compile_binop @{ $item[1] } }
  • 10. Just enough rope to shoot yourself... ● The biggest problem: P::RD is sloooooooowsloooooooow. ● Learning curve is perl-ish: shallow and long. – Unless you really know what all of it does you may not be able to figure out the pieces. – Lots of really good docs that most people never read. ● Perly blocks also made it look too much like a job- dispatcher. – People used it for a lot of things that are not compilers. – Good & Bad thing: it really is a compiler.
  • 11. R.I.P. P::RD ● Supposed to be replaced with Parse::FastDescent. – Damian dropped work on P::FD for Perl6. – His goal was to replace the shortcomings with P::RD with something more complete, and quite a bit faster. ● The result is Perl6 Grammars. – Declarative syntax extends matching with rules. – Built into Perl6 as a structure, not an add-on. – Much faster. – Not available in Perl5
  • 12. Regex::Grammars ● Perl5 implementation derived from Perl6. – Back-porting an idea, not the Perl6 syntax. – Much better performance than P::RD. ● Extends the v5.10 recursive matching syntax, leveraging the regex engine. – Most of the speed issues are with regex design, not the parser itself. – Simplifies mixing code and matching. – Single place to get the final results. – Cleaner syntax with automatic whitespace handling.
  • 13. Extending regexen ● “use Regexp::Grammar” turns on added syntax. – block-scoped (avoids collisions with existing code). ● You will probably want to add “xm” or “xs” – extended syntax avoids whitespace issues. – multi-line mode (m) simplifies line anchors for line- oriented parsing. – single-line mode (s) makes ignoring line-wrap whitespace largely automatic. – I use “xm” with explicit “n” or “s” matches to span lines where necessary.
  • 14. What you get ● The parser is simply a regex-ref. – You can bless it or have multiple parsers for context grammars. ● Grammars can reference one another. – Extending grammars via objects or modules is straightforward. ● Comfortable for incremental development or refactoring. – Largely declarative syntax helps. – OOP provides inheritance with overrides for rules.
  • 15. my $compiler = do { use Regexp::Grammars; qr { <data> <rule: data > <[text]>+ <rule: text > .+ }xm }; Example: Creating a compiler ● Context can be a do-block, subroutine, or branch logic. ● “data” is the entry rule. ● All this does is read lines into an array with automatic ws handling.
  • 16. Results: %/ ● The results of parsing are in a tree-hash named %/. – Keys are the rule names that produced the results. – Empty keys ('') hold input text (for errors or debugging). – Easy to handle with Data::Dumper. ● The hash has at least one key for the entry rule, one empty key for input data if context is being saved. ● For example, feeding two lines of a Gentoo emerge log through the line grammar gives:
  • 17. { '' => '1367874132: Started emerge on: May 06, 2013 21:02:12 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk', data => { '' => '1367874132: Started emerge on: May 06, 2013 21:02:12 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk', text => [ '1367874132: Started emerge on: May 06, 2013 21:02:12', ' 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk' ] Parsing a few lines of logfile
  • 18. Getting rid of context ● The empty-keyed values are useful for development or explicit error messages. ● They also get in the way and can cost a lot of memory on large inputs. ● You can turn them on and off with <context:> and <nocontext:> in the rules.
  • 19. qr { <nocontext:> # turn off globally <data> <rule: data > <text>+ # oops, left off the []! <rule: text > .+ }xm; warn | Repeated subrule <text>+ will only capture its final match | (Did you mean <[text]>+ instead?) | { data => { text => ' 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk' } } You usually want [] with +
  • 20. { data => { text => the [text] parses to an array of text [ '1367874132: Started emerge on: May 06, 2013 21:02:12', ' 1367874132: *** emerge --jobs --autounmask-write –... ], ... qr { <nocontext:> # turn off globally <data> <rule: data > <[text]>+ <rule: text > (.+) }xm; An array[ref] of text
  • 21. Breaking up lines ● Each log entry is prefixed with an entry id. ● Parsing the ref_id off the front adds: <data> <rule: data > <[line]>+ <rule: line > <ref_id> <[text]> <token: ref_id > ^(d+) <rule: text > .+ line => [ { ref_id => '1367874132', text => ': Started emerge on: May 06, 2013 21:02:12' }, … ]
  • 22. Removing cruft: “ws” ● Be nice to remove the leading “: “ from text lines. ● In this case the “whitespace” needs to include a colon along with the spaces. ● Whitespace is defined by <ws: … > <rule: line> <ws:[s:]+> <ref_id> <text> { ref_id => '1367874132', text => '*** emerge --jobs –autounmask-wr... }
  • 23. The '***' prefix means something ● Be nice to know what type of line was being processed. ● <prefix= regex > asigns the regex's capture to the “prefix” tag: <rule: line > <ws:[s:]*> <ref_id> <entry> <rule: entry > <prefix=([*][*][*])> <text> | <prefix=([>][>][>])> <text> | <prefix=([=][=][=])> <text> | <prefix=([:][:][:])> <text> | <text>
  • 24. { entry => { text => 'Started emerge on: May 06, 2013 21:02:12' }, ref_id => '1367874132' }, { entry => { prefix => '***', text => 'emerge --jobs –autounmask-write... }, ref_id => '1367874132' }, { entry => { prefix => '>>>', text => 'emerge (1 of 2) sys-apps/... }, ref_id => '1367874256' } “entry” now contains optional prefix
  • 25. Aliases can also assign tag results ● Aliases assign a key to rule results. ● The match from “text” is aliased to a named type of log entry. <rule: entry> <prefix=([*][*][*])> <command=text> | <prefix=([>][>][>])> <stage=text> | <prefix=([=][=][=])> <status=text> | <prefix=([:][:][:])> <final=text> | <message=text>
  • 26. { entry => { message => 'Started emerge on: May 06, 2013 21:02:12' }, ref_id => '1367874132' }, { entry => { command => 'emerge --jobs --autounmask-write –... prefix => '***' }, ref_id => '1367874132' }, { entry => { command => 'terminating.', prefix => '***' }, ref_id => '1367874133' }, Generic “text” replaced with a type:
  • 27. Parsing without capturing ● At this point we don't really need the prefix strings since the entries are labeled. ● A leading '.' tells R::G to parse but not store the results in %/: <rule: entry > <.prefix=([*][*][*])> <command=text> | <.prefix=([>][>][>])> <stage=text> | <.prefix=([=][=][=])> <status=text> | <.prefix=([:][:][:])> <final=text> | <message=text>
  • 28. { entry => { message => 'Started emerge on: May 06, 2013 21:02:12' }, ref_id => '1367874132' }, { entry => { command => 'emerge --jobs --autounmask-write -... }, ref_id => '1367874132' }, { entry => { command => 'terminating.' }, ref_id => '1367874133' }, “entry” now has typed keys:
  • 29. The “entry” nesting gets in the way ● The named subrule is not hard to get rid of: just move its syntax up one level: <ws:[s:]*> <ref_id> ( <.prefix=([*][*][*])> <command=text> | <.prefix=([>][>][>])> <stage=text> | <.prefix=([=][=][=])> <status=text> | <.prefix=([:][:][:])> <final=text> | <message=text> )
  • 30. data => { line => [ { message => 'Started emerge on: May 06, 2013 21:02:12', ref_id => '1367874132' }, { command => 'emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk', ref_id => '1367874132' }, { command => 'terminating.', ref_id => '1367874133' }, { message => 'Started emerge on: May 06, 2013 21:02:17', ref_id => '1367874137' }, Result: array of “line” with ref_id & type
  • 31. Funny names for things ● Maybe “command” and “status” aren't the best way to distinguish the text. ● You can store an optional token followed by text: <rule: entry > <ws:[s:]*> <ref_id> <type>? <text> <token: type> ( [*][*][*] | [>][>][>] | [=][=][=] | [:][:][:] )
  • 32. Entrys now have “text” and “type” entry => [ { ref_id => '1367874132', text => 'Started emerge on: May 06, 2013 21:02:12' }, { ref_id => '1367874133', text => 'terminating.', type => '***' }, { ref_id => '1367874137', text => 'Started emerge on: May 06, 2013 21:02:17' }, { ref_id => '1367874137', text => 'emerge --jobs --autounmask-write –... type => '***' },
  • 33. prefix alternations look ugly. ● Using a count works: [*]{3} | [>]{3} | [:]{3} | [=]{3} but isn't all that much more readable. ● Given the way these are used, use a block: [*>:=] {3}
  • 34. qr { <nocontext:> <data> <rule: data > <[entry]>+ <rule: entry > <ws:[s:]*> <ref_id> <prefix>? <text> <token: ref_id > ^(d+) <token: prefix > [*>=:]{3} <token: text > .+ }xm; This is the skeleton parser: ● Doesn't take much: – Declarative syntax. – No Perl code at all! ● Easy to modify by extending the definition of “text” for specific types of messages.
  • 35. Finishing the parser ● Given the different line types it will be useful to extract commands, switches, outcomes from appropriate lines. – Sub-rules can be defined for the different line types. <rule: command> “emerge” <.ws><[switch]>+ <token: switch> ([-][-]S+) ● This is what makes the grammars useful: nested, context-sensitive content.
  • 36. Inheriting & Extending Grammars ● <grammar: name> and <extends: name> allow a building-block approach. ● Code can assemble the contents of for a qr{} without having to eval or deal with messy quote strings. ● This makes modular or context-sensitive grammars relatively simple to compose. – References can cross package or module boundaries. – Easy to define a basic grammar in one place and reference or extend it from multiple other parsers.
  • 37. The Non-Redundant File ● NCBI's “nr.gz” file is a list if sequences and all of the places they are known to appear. ● It is moderately large: 140+GB uncompressed. ● The file consists of a simple FASTA format with heading separated by ctrl-A char's: >Heading 1 [amino-acid sequence characters...] >Heading 2 ...
  • 38. Example: A short nr.gz FASTA entry ● Headings are grouped by species, separated by ctrl-A (“cA”) characters. – Each species has a set of sources & identifier pairs followed by a single description. – Within-species separator is a pipe (“|”) with optional whitespace. – Species counts in some header run into the thousands. >gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ... KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK... VQKLLNPDQ
  • 39. First step: Parse FASTA qr { <grammar: Parse::Fasta> <nocontext:> <rule: fasta > <.start> <head> <.ws> <[body]>+ <rule: head > .+ <.ws> <rule: body > ( <[seq]> | <.comment> ) <.ws> <token: start > ^ [>] <token: comment > ^ [;] .+ <token: seq > ^ [nw-]+ }xm; ● Instead of defining an entry rule, this just defines a name “Parse::Fasta”. – This cannot be used to generate results by itself. – Accessible anywhere via Rexep::Grammars.
  • 40. The output needs help, however. ● The “<seq>” token captures newlines that need to be stripped out to get a single string. ● Munging these requires adding code to the parser using Perl's regex code-block syntax: (?{...}) – Allows inserting almost-arbitrary code into the regex. – “almost” because the code cannot include regexen. seq => [ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYD KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDP VQKLLNPDQ ' ]
  • 41. Munging results: $MATCH ● The $MATCH and %MATCH can be assigned to alter the results from the current or lower levels of the parse. ● In this case I take the “seq” match contents out of %/, join them with nothing, and use “tr” to strip the newlines. – join + split won't work because split uses a regex. <rule: body > ( <[seq]> | <.comment> ) <.ws> (?{ $MATCH = join '' => @{ delete $MATCH{ seq } }; $MATCH =~ tr/n//d; })
  • 42. One more step: Remove the arrayref ● Now the body is a single string. ● No need for an arrayref to contain one string. ● Since the body has one entry, assign offset zero: body => [ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDT KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ' ], <rule: fasta> <.start> <head> <.ws> <[body]>+ (?{ $MATCH{ body } = $MATCH{ body }[0]; })
  • 43. Result: a generic FASTA parser. { fasta => [ { body => 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDIT KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ', head => 'gi|66816243|ref|XP_642131.1| hypothetical p rotein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556 |sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=C AF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ' } ] } ● The head and body are easily accessible. ● Next: parse the nr-specific header.
  • 44. Deriving a grammar ● Existing grammars are “extended”. ● The derived grammars are capable of producing results. ● In this case: ● References the grammar and extracts a list of fasta entries. <extends: Parse::Fasta> <[fasta]>+
  • 45. Splitting the head into identifiers ● Overloading fasta's “head” rule handles allows splitting identifiers for individual species. ● Catch: cA is separator, not a terminator. – The tail item on the list does't have a cA to anchor on. – Using “.+[cAn] walks off the header onto the sequence. – This is a common problem with separators & tokenizers. – This can be handled with special tokens in the grammar, but R::G provides a cleaner way.
  • 46. First pass: Literal “tail” item ● This works but is ugly: – Have two rules for the main list and tail. – Alias the tail to get them all in one place. <rule: head> <[ident]>+ <[ident=final]> (?{ # remove the matched anchors tr/cAn//d for @{ $MATCH{ ident } }; }) <token: ident > .+? cA <token: final > .+ n
  • 47. Breaking up the header ● The last header item is aliased to “ident”. ● Breaks up all of the entries: head => { ident => [ 'gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]', 'gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1', 'gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum]', 'gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]' ] }
  • 48. Dealing with separators: '% <sep> ● Separators happen often enough: – 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces – g-c-a-g-t-t-a-c-a # characters by dashes – /usr/local/bin # basenames by dir markers – /usr:/usr/local:bin # dir's separated by colons that R::G has special syntax for dealing with them. ● Combining the item with '%' and a seprator: <rule: list> <[item]>+ % <separator> # one-or-more <rule: list_zom> <[item]>* % <separator> # zero-or-more
  • 49. Cleaner nr.gz header rule ● Separator syntax cleans things up: – No more tail rule with an alias. – No code block required to strip the separators and trailing newline. – Non-greedy match “.+?” avoids capturing separators. qr { <nocontext:> <extends: Parse::Fasta> <[fasta]>+ <rule: head > <[ident]>+ % [cA] <token: ident > .+? }xm
  • 50. Nested “ident” tag is extraneous ● Simpler to replace the “head” with a list of identifiers. ● Replace $MATCH from the “head” rule with the nested identifier contents: qr { <nocontext:> <extends: Parse::Fasta> <[fasta]>+ <rule: head > <[ident]>+ % [cA] (?{ $MATCH = delete $MATCH{ ident }; }) <token: ident > .+? }xm
  • 51. Result: { fasta => [ { body => 'MASTQNIVEEVQKMLDT...NPDQ', head => [ 'gi|66816243|ref|XP_6...rt=CAF-1', 'gi|793761|dbj|BAA0626...oideum]', 'gi|60470106|gb|EAL68086...m discoideum AX4]' ] } ] } ● The fasta content is broken into the usual “body” plus a “head” broken down on cA boundaries. ● Not bad for a dozen lines of grammar with a few lines of code:
  • 52. One more level of structure: idents. ● Species have <source > | <identifier> pairs followed by a description. ● Add a separator clause “ % (?:s*|s*)” – This can be parsed into a hash something like: gi|66816243|ref|XP_642131.1|hypothetical ... Becomes: { gi => '66816243', ref => 'XP_642131.1', desc => 'hypothetical...' }
  • 53. Munging the separated input <fasta> (?{ my $identz = delete $MATCH{ fasta }{ head }{ ident }; for( @$identz ) { my $pairz = $_->{ taxa }; my $desc = pop @$pairz; $_ = { @$pairz, desc => $desc } } $MATCH{ fasta }{ head } = $identz; }) <rule: head > <[ident]>+ % [cA] <token: ident > <[taxa]>+ % (?: s* [|] s* ) <token: taxa > .+?
  • 54. Result: head with sources, “desc” { fasta => { body => 'MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN', head => [ { desc => '30S ribosomal protein S18 [Lactococ... gi => '15674171', ref => 'NP_268346.1' }, { desc => '30S ribosomal protein S18 [Lactoco... gi => '116513137', ref => 'YP_812044.1' }, ...
  • 55. Balancing R::G with calling code ● The regex engine could process all of nr.gz. – Catch: <[fasta]>+ returns about 250_000 keys and literally millions of total identifiers in the head's. – Better approach: <fasta> on single entries, but chunking input on '>' removes it as a leading charactor. – Making it optional with <.start>? fixes the problem: local $/ = '>'; while( my $chunk = readline ) { chomp; length $chunk or do { --$.; next }; $chunk =~ $nr_gz; # process single fasta record in %/ }
  • 56. Fasta base grammar: 3 lines of code qr { <grammar: Parse::Fasta> <nocontext:> <rule: fasta > <.start>? <head> <.ws> <[body]>+ (?{ $MATCH{ body } = $MATCH{ body }[0]; }) <rule: head > .+ <.ws> <rule: body > ( <[seq]> | <.comment> ) <.ws> (?{ $MATCH = join '' => @{ delete $MATCH{ seq } }; $MATCH =~ tr/n//d; }) <token: start > ^ [>] <token: comment > ^ [;] .+ <token: seq > ^ ( [nw-]+ ) }xm;
  • 57. Extension to Fasta: 6 lines of code. qr { <nocontext:> <extends: Parse::Fasta> <fasta> (?{ my $identz = delete $MATCH{ fasta }{ head }{ ident }; for( @$identz ) { my $pairz = $_->{ taxa }; my $desc = pop @$pairz; $_ = { @$pairz, desc => $desc }; } $MATCH{ fasta }{ head } = $identz; }) <rule: head > <[ident]>+ % [cA] <rule: ident > <[taxa]>+ % (?: s* [|] s* ) <token: taxa > .+? }xm
  • 58. Result: Use grammars ● Most of the “real” work is done under the hood. – Regexp::Grammars does the lexing, basic compilation. – Code only needed for cleanups or re-arranging structs. ● Code can simplify your grammar. – Too much code makes them hard to maintain. – Trick is keeping the balance between simplicity in the grammar and cleanup in the code. ● Either way, the result is going to be more maintainable than hardwiring the grammar into code.
  • 59. Aside: KwikFix for Perl v5.18 ● v5.17 changed how the regex engine handles inline code. ● Code that used to be eval-ed in the regex is now compiled up front. – This requires “use re 'eval'” and “no strict 'vars'”. – One for the Perl code, the other for $MATCH and friends. ● The immediate fix for this is in the last few lines of R::G::import, which push the pragmas into the caller: ● Look up $^H in perlvars to see how it works. require re; re->import( 'eval' ); require strict; strict->unimport( 'vars' );
  • 60. Use Regexp::Grammars ● Unless you have old YACC BNF grammars to convert, the newer facility for defining the grammars is cleaner. – Frankly, even if you do have old grammars... ● Regexp::Grammars avoids the performance pitfalls of P::RD. – It is worth taking time to learn how to optimize NDF regexen, however. ● Or, better yet, use Perl6 grammars, available today at your local copy of Rakudo Perl6.
  • 61. More info on Regexp::Grammars ● The POD is thorough and quite descriptive [comfortable chair, enjoyable beverage suggested]. ● The ./demo directory has a number of working – if un-annotated – examples. ● “perldoc perlre” shows how recursive matching in v5.10+. ● PerlMonks has plenty of good postings. ● Perl Review article by brian d foy on recursive matching in Perl 5.10.