SlideShare a Scribd company logo
1 of 75
RReegguullaarr EExxpprreessssiioonnss 
TThhee LLiittttllee EEnnggiinnee TThhaatt CCoouulldd((nn''tt))??
Twitter 
● #saintcon ● #perlreintro
Salt Lake Perl Mongers 
● The local “Perl Community” 
– Monthly meetings. 
– Partnership discounts. 
– Job announcements. 
– Everyone learns and grows. 
– For the love of Perl! 
● http://saltlake.pm.org
YAPC::NA::2015 – Salt Lake!
YAPC::NA::2015 – Salt Lake! 
● Yet Another Perl Conference, North America 
● Coming to Salt Lake City in June 2015 
● Check http://saltlake.pm.org for emerging 
details.
Who am I? 
● Dave Oswald – A Propetual Hobbiest. 
● Studied Economics and Computer Science at U of U. 
– Also CS in High School, SLCC, LAVC, and self-guided. 
● Independent software developer and consultant. 
– Focus on Perl, C++, and server-side development. 
● Solving problems is my hobby, passion ...and my work. 
● daoswald@gmail.com 
● Salt Lake Perl Mongers 
– http://saltlake.pm.org 
Aspiring to be Lazy, Impatient, and Hubristic.
This Is Our Goal Today 
https://xkcd.com/208/
oO(um...) This Is ^H^H^H^H^H^H^H^H^H^H^H^H^H^H
This Is NOT Our Goal Today
Examples will be in Perl 
$_ = 'Just another Perl 
hacker,'; 
s/Perl/$your_preference/; 
● Because regexes are an integral part of Perl's syntax. 
● Because I get to use some cool tools unique to Perl. 
● Because it doesn't matter (PCRE is nearly ubiquitous). 
● Because Perl's regexes are Unicode enabled (modern Perls). 
● Because it's my talk.
Some Definitions 
● Literal Characters 
abcdefghijklmnopqrstuvw 
xyz ABCDEFGJIHKLMNOP... 
1234567890 
Metacharacters 
 | ( ) [ { ^ $ * + ? . 
Metasymbols 
b D t 3 s n 
...and many others 
● Operators 
m// (match) 
s/// (substitute) 
=~ or !~ (bind)
A trivial example 
$string = “Just another Perl hacker,”; 
# (Target) (Bound to) (Pattern) 
say “Match!” if $string =~ m/Perl/; 
Match!
Syntactic Shortcuts 
$_ = “Just another Perl hacker,”; 
# (Target) (Bound to) (Pattern) 
say “Match!” if /Perl/; 
Match!
NFA? 
DFA? 
Hybrid?
/(non)?deterministic finite automata/ 
● Deterministic Finite Automata 
– Text-directed match 
– No backtracking, more limited semantics. 
– awk, egrep, flex, lex, MySQL, Procmail 
● Non-deterministic Finite Automata 
– Regex-directed match 
– Backtracking, more flexible semantics 
– GNU Emacs, Java, grep, less, more, .NET, PCRE library, Perl, 
PHP, Python, Ruby, sed, vi, C++11
Our focus... 
● NFA – Nondeterministic Finite Automata 
– It's more interesting. 
– We tend to use it in more places. 
– Perl's regular expression engine is based on NFA.
Our focus... 
● NFA – Nondeterministic Finite Automata 
– It's more interesting. 
– We tend to use it in more places. 
– Perl's regular expression engine is based on NFA. 
– AAnndd ssoo aarree mmoosstt ootthheerr ggeenneerraall--ppuurrppoossee 
iimmpplleemmeennttaattiioonnss..
Some Basics 
● Literals match literals 
“Hello world!” =~ m/Hello/; # true. 
● Alternation 
“Hello world!” =~ m/earth|world/; # true (world)
Meta-symbols 
● Some meta-symbols match classes of 
characters. 
● “Hello world” =~ m/ws/w/; # true: (o w) 
● Common symbols 
w (an “identifier” character) 
s (a “space” character) 
. (anything except newline – and sometimes newline too) 
d (a numeric digit) 
● See perldoc perlrecharclass
Quantifiers 
● Quantifiers allow for atoms to match repeatedly. 
“Loooong day” =~ m/o+/; # true (oooo) 
● Common quantifiers 
+ (One or more): /o+/ 
* (Zero or more): /Lo*/ 
{2} (Exactly 2): /o{2}/ 
{2,6} (2 to 6 times): /o{2,4}/ 
{2,} (2 or more times): /o{2,}/ 
? (0 or 1 times): /o?/
Controlling Greed 
● Greedy is the default. 
“looong” =~ m/o+/; # ooo 
● ? after a quantifier makes it lazy, or non-greedy. 
“looong” =~ m/o+?/; # o
Greedy and Non-greedy Quantifiers 
● Greedy 
*, +, {…}, {… , …}, ? 
'aaaaa' =~ /w+a/ # aaaaa 
● Non-Greedy 
*?, +?, {…}?, {… , …}?, ?? 
'aaaaa' =~ /w+?a/ # aa
Anchors / Zero-width assertions. 
“Hello world” =~ /^world/; # false. 
“Hello world =~ /world$/; # true. 
● Common anchoring assertions 
– ^ (Beginning of string or line – /m dependent) 
– $ (End of string or line – /m dependent) 
– A (Beginning of string, always.) 
– z (End of string, always.) 
– b (Boundary between wW): “Apple+” =~ /wb/
Grouping 
● (?: … ) – Non-capturing. 
● “Daniel” =~ m/^(?:Dan|Nathan)iel$/; #true 
● “Daniel” =~ m/^Dan|Nathaniel$/; # false 
● ( … ) – Group and capture. 
● “Daniel” =~ m/^(Dan|Nathan)iel$/; 
# Captures “Dan” into $1.
Captures 
● ( … ) captures populate $1, $2, $3... 
● Also 1, 2, 3 within regexp. 
● Named captures: (?<name> … ) 
– Populates $+{name} 
– Also g{name} within regexp.
Capturing 
while( 
'abc def ghi' =~ m/(?<trio>w{3})/g 
) { print “$+{trio}n”; }
Grouping creates composite atoms 
● “eieio” =~ /(?:ei)+/; # Matches “eiei”
Custom character classes 
● [ … ] (An affirmitive characer class) 
“Hello” =~ m/[el]+/; # ell 
● [^ … ] (A negated character class) 
“Hello” =~ m/[^el]+/; # H
Character Class Ranges 
● - (hyphen) is special within character classes. 
“12345” =~ m/[2-4]+/; # 234 
● A literal hyphen must be escaped, or placed at the end: 
“123-5” =~ m/[345-]/; # 3-5 
● A literal ^ (carat) must be escaped, or must not be at the beginning. 
“12^7” =~ m/[0-9^]+/; # 12^7 
“12^7” =~ m/[^0-9]+/; # ^
Character Class Ranges in 2014 
● Unicode means this is probably wrong 
m/A[a-z]*z/i 
# Contains only letters (wrong) 
# 52 possibilities. 
● This is probably better 
m/Ap{Alpha}*z/ 
# Contains only Alphabetic characters. 
# 102,159 possibilities.
Character Class Ranges in 2014 
● Broken.... A BUG! 
m/^[a-zA-Z]*$/i 
● You meant to say... 
m/Ap{Alpha}*z/
Or to put it another way... 
my $user_city 
= "São João da Madeira"; 
reject() unless 
$user_city =~ m/^[A­Za­z 
s]+$/; 
21000 people on the west coast of Portugal are 
now unable to specify a valid billing address.
Character classes may contain most 
metasymbols 
“1, 2, 3 Clap your hands for me” 
=~ m/^[ws,]{12}/ # 1, 2, 3 Clap 
● Metasymbols that represent two or more code points are 
usually illegal inside character classes: 
X, R, for example. 
● Dot (.) is literal in character classes. 
● Quantifiers and alternation don't exist in character classes.
Escape “special characters” 
● Literal [ must be escaped with  
“John [Brown]” =~ m/[(w+)]/; 
– Captures “Brown” 
● Adding a  escapes any special character: 
w 
^ 
{2} 
(...)
Quotemeta 
● Q and E escape special characters between. 
“O(n^2)” =~ m/Q(n^E/; # (n^
Zero-width Assertions 
● b Match a word boundary 
m/wbW/ 
● (?= … ), (?! … ), (?<= … ), (?<! … ) 
'%a' =~ m/(?<!%)w/; # false
Avoiding Leaning Toothpick 
Syndrome
Avoid leaning toothpicks 
● Alternate delimiters 
“/usr/bin/perl” =~ m#^/([^/]+)/#; 
– Captures usr 
– Most non-identifier characters are fine as delimiters. 
● A bad example 
“/usr/bin/perl” =~ m/^/([^/])//; 
– Still captures usr, but ugly and prone to mistakes.
Deep breath...
Two big rules 
● The Match That Begins Earliest Wins 
'The dragging belly indicates your cat is too fat' 
/fat|cat|belly|your/ 
● The Standard Quantifiers Are Greedy 
'to be, or not to be' 
/(to.*)(or not to be)*/ 
$1 == 'to be, or not to be' 
$2 == ''
Backtracking 
'hot tonic tonight!' 
/to(nite|knight|night)/ 
$1 == 'night' 
Matched “tonight” 
● First tries to match “tonic” with “nite|knight|night” 
● Then backtracked, advanced the position, attempted at 'o'
Forcing greedy quantifiers to give up ground 
'to be, or not to be' 
/(to.*)(or not to be)/ 
$1 == 'to be, ' 
$2 == 'or not to be' 
Watch the backtracking happen... 
...twelve times.
Backtracking... 
'aaaaaab' 
/(a*)*[^Bb]$/
Backtracking out of control 
'aaaaaab' 
/(a*)*[^Bb]$/ 
“Regex failed to match after 213 steps”
Backtracking under control 
'aaaaaab' 
/(a*)*+[^Bb]$/ 
“Regex failed to match after 79 steps” 
*+, ++, ?+, {n,m}+: possessive quantifiers.
Possessive Quantifiers 
● A + symbol after a quantifier makes it possessive. 
● (?> … ) 
– Another possessive construct. 
● Possessive quantifiers stand their ground. 
– Backtracking through a possessive quantifier is disallowed.
An extreme example 
'a' x 64 
/a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[Bb]/ 
● This will run for septillions of septillions of years (or until you kill the 
process). 
'a' x 64 
/(?> 
a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a* 
)[Bb]/x 
● This will not (4550 iterations). 
(?> … ) is another possessive construct.
Longest Leftmost? 
● Not necessarily... 
'oneselfsufficient' 
/one(self)?(selfsufficient)?/ 
● Matches 
oneself 
● Captures 
self 
● Greedy quantifiers only give up if forced.
Greedy, Lazy 
'I said foo' 
/.*foo/ # Greedy; backtracks backwards. 
/.*?foo/ # Lazy; backtracks forward. 
'CamelCase' # (We want up to two captures.) 
/([A-Z].*?)([A-Z].*)?/ # $1:'C' GOTCHA! 
/([A-Z].*)([A-Z].*)?/ # $1:'CamelCase' GOTCHA! 
/([A-Z][^A-Z]*)([A-Z][^A-Z]*)?/ # ok (kinda)
More NFA rules 
● Matches occur as far left as possible. 
● Alternation has left-to-right precedence. 
● If an assertion doesn't match, backtracking occurs to try higher-pecking- 
order assertions with different choices (such as 
quantifier values, or alternatives). 
● Quantifiers must be satisfied within their permissible range. 
● Each atom matches according to its designated semantics. If it 
fails, the engine backtracks and twiddles the atom's quantifier 
within the quantifier's permissible range.
The golden rule of programming 
Break the problem into manageable (smaller) 
problems.
Shorter segments are often easier 
'Brian and John attended' 
if( /Brian/ && /John/ ) { … } 
...is much easier to understand than... 
if( /Brian.*John|John.*Brian/ ) { … }
Short-circuiting may be more 
runtime efficient. 
if( m/(john|guillermo)/i ) … 
if( m/john/ || m/guillermo/ ) … 
● The former has trie optimization. 
● The latter may still win if you live in North America.
Modifiers 
● /g (Match iteratively, or repeatedly) 
● /m (Alters semantics of ^ and $) 
● /s (Alters semantics of . (dot) ) 
● /x (Allow freeform whitespace)
Unicode semantic modifiers 
● ASCII Semantics: a 
● ASCII Really Really only: aa 
● Dual personality: d 
– The Pre-5.14 standard. 
● Unicode Semantics: u 
– use v5.14 or newer.
Freeform modifer 
● /x ignores most whitespace. 
m/(Now)s # Comments. 
(is)s 
(the)s 
(time.+)z 
/x
/g modifier 
while( “string” =~ m/(.)/g ) { 
print “$1n”; 
} 
s 
t 
r 
...
Validation
The Prussian Stance 
Whitelist ● Allow what you trust.
The American Stance 
Blacklist ● Reject what you distrust
The stances 
● American (Blacklist) 
reject() 
if m/.../ 
● Prussian (Whitelist) 
accept() 
if m/.../
Some people, when confronted with a problem, 
think "I know, I'll use regular expressions." Now 
they have two problems. 
– Jamie Zawinski
Perl's nature encourages the use of regular 
expressions almost to the exclusion of all other 
techniques; they are far and away the most 
"obvious" (at least, to people who don't know any 
better) way to get from point A to point B. 
– Jamie Zawinski
This issue is no longer unique to Perl
Know your problem. 
(And know when not to use regexes.)
RegExes are for matcing patterns 
● This should be obvious, but... 
– HTML? (Probably not...) 
● Tom Christiansen wrote an HTML parser 
– He recommends against it.
RegExes are for matcing patterns 
● This should be obvious, but... 
– HTML? (Probably not...) 
– JSON? (Um, no...) 
● Merlyn wrote a regex JSON parser. 
● JSON::Tiny provides a more robust solution, yet still 
compact enough for embedding.
RegExes are for matcing patterns 
● This should be obvious, but... 
– HTML? (Probably not...) 
– JSON? (Um, no...) 
– Email Addresses? (Don't waste your time...) 
● Mastering Regular Expressions, 1st Edition demonstrates 
a regular expression for matching email addresses. 
– It was two pages long, not fully compliant, and was omitted from 
the 2nd and 3rd editions.
“Regexes optimal for small HTML parsing 
problems, pessimal for large ones” 
“...it is much, much, much harder than almost anyone 
ever thinks it is.” 
“...you will eventually reach a point where you have to 
work harder to effect a solution that uses regexes than 
you would have to using a parsing class.” 
– Tom Christiansen
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to 
correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will 
not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the 
constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex 
queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even 
enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me 
crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot 
parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child 
weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the 
realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too 
late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If 
you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the 
One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the ne rves of 
the sentient whilst you observe, your psyche withering in the onslaught of horror. Regẻx̔̿-based HTML parsers are the cancer 
that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chil͡d ensures regex will 
consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone 
survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using 
regex as a tool to process HTML establishes a breach between this world and the dread realm of cͪ͒oͪͪ rrupt entities (like 
SGML entities, but more corrupt) a mere glimpse of the world of rege x parsers for HTML will inst antly transport a 
programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour 
your HTM L parser, application and existence for all time like Visual Basic only worse he comes he comes do not fig ht he 
come̡s̶, h̕i̵s unh̨ol͞y radianće ́destro҉ying all enliĝ̍̈́̈́htenment, HTML tags leak͠iņg͘ fro̶m̨ y̡ou ͟r eyes͢ ̸l̛i̕ke͏ liqu id pain, the song of 
reg̸ular expr ession parsing will extin guish the voices of mort al man from the sph ere I can see it can you see ̖̙̲͚ͪît̩̩̱̲͎́́ͪ̋̀ it is 
beautiful th e final snuffing of the lies of Man ALL IS LOŚ̩ͪ̏̈́T̗̪ ͇ ALL IS̷ LOST the pony̶ he comes he co̮mes he comes the icho r 
permeates all MY FACE MY FACE ᵒh god no NO NOOO̼O NΘ stop the an* ̶̅̾̾ ͑ͪg̙̤͏ͪͪ̑̾͆l̫͇̗̟̩̳̍͆ͪe͉̅s̠a̧͎͈ͪre̽̾̈́͒͑ no t rèͪ̌̑a͂ͪl̃ͪ ̘̙̝̆̾ZAL̡͊͠͝GΌ ISͪ̂҉̯͈ͪ ̘̱̹ 
TO̹̺͇ͅƝ̴ȳȳ TH̳̘Ë́̉ͪ ͠P̭̯O͍͊̚ N̐Y̡ H̸̡̪̯ͪ̅̎̽̾Ȩ̩̬ͪ̾̈́̾̀́͘ ̶̧̨̭̯̱̹ͪ̏͟C̷̙̝̲̮ͪ͏O̝̪ͪM̴͍̖̲͊ͪ̒̑̚̚͜E̞̟̟͌ͪ̿̔͝S̨̥̫͎̭ͪ̀ͅ 
Have you tried using an XML parser instead? 
-- Famous StackOverflow Rant
Appropriate Alternatives 
● Complex grammars 
– Parsing classes. 
● Fixed-width fields 
– unpack, substr. 
● Comma Separated Values 
– CSV libraries. 
● Uncomplicated, predictably patterned data. 
– Regular Expressions!
Abuse! 
● Check if a number is prime: 
% perl -E 'say "Prime" if (1 x shift) !~ /^1?$|^(11+?)1+$/' 1234567 
– Attributed to Abigail: 
● http://www.cpan.org/misc/japh 
– brian d foy (Author of Mastering Perl) dissects it: 
● http://www.masteringperl.org/2013/06/how-abigails-prime-number- 
checker-works/
“Driving home last night, I started 
realizing that the problem is solvable 
with pure regexes.” 
● N Queens Problem: A pure-regexp solution. 
– Abigail, again: http://perlmonks.org/?node=297616
References 
● Programming Perl, 4th Edition (OReilly) 
● Mastering Regular Expressions, 3rd Edition 
(OReilly) 
● Mastering Perl, 2nd Edition (OReilly) 
● Regexp::Debugger – Damian Conway 
● perlre, perlretut, perlrecharclass
Dave Oswald 
daoswald@gmail.com 
http://saltlake.pm.org (PerlMongers) 
http://www.slideshare.net/daoswald/regex-talk-30408635 (SlideShare)

More Related Content

What's hot

DSL's with Groovy
DSL's with GroovyDSL's with Groovy
DSL's with Groovypaulbowler
 
Groovy Domain Specific Languages - SpringOne2GX 2012
Groovy Domain Specific Languages - SpringOne2GX 2012Groovy Domain Specific Languages - SpringOne2GX 2012
Groovy Domain Specific Languages - SpringOne2GX 2012Guillaume Laforge
 
Modern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersModern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersDave Cross
 
Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)mircodotta
 
Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for DummiesElizabeth Smith
 
Yapc::NA::2009 - Command Line Perl
Yapc::NA::2009 - Command Line PerlYapc::NA::2009 - Command Line Perl
Yapc::NA::2009 - Command Line PerlBruce Gray
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersGlenn De Backer
 
Crystal internals (part 1)
Crystal internals (part 1)Crystal internals (part 1)
Crystal internals (part 1)Ary Borenszweig
 
Effective Scala (JavaDay Riga 2013)
Effective Scala (JavaDay Riga 2013)Effective Scala (JavaDay Riga 2013)
Effective Scala (JavaDay Riga 2013)mircodotta
 
Ruby and rails - Advanced Training (Cybage)
Ruby and rails - Advanced Training (Cybage)Ruby and rails - Advanced Training (Cybage)
Ruby and rails - Advanced Training (Cybage)Gautam Rege
 

What's hot (20)

Lecture7
Lecture7Lecture7
Lecture7
 
Let's Play Dart
Let's Play DartLet's Play Dart
Let's Play Dart
 
DSL's with Groovy
DSL's with GroovyDSL's with Groovy
DSL's with Groovy
 
Introducing Ruby
Introducing RubyIntroducing Ruby
Introducing Ruby
 
Groovy Domain Specific Languages - SpringOne2GX 2012
Groovy Domain Specific Languages - SpringOne2GX 2012Groovy Domain Specific Languages - SpringOne2GX 2012
Groovy Domain Specific Languages - SpringOne2GX 2012
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Modern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersModern Perl for Non-Perl Programmers
Modern Perl for Non-Perl Programmers
 
Groovy DSL
Groovy DSLGroovy DSL
Groovy DSL
 
Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)
 
Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for Dummies
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Yapc::NA::2009 - Command Line Perl
Yapc::NA::2009 - Command Line PerlYapc::NA::2009 - Command Line Perl
Yapc::NA::2009 - Command Line Perl
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopers
 
DSLs in JavaScript
DSLs in JavaScriptDSLs in JavaScript
DSLs in JavaScript
 
Java script
Java scriptJava script
Java script
 
Cfphp Zce 01 Basics
Cfphp Zce 01 BasicsCfphp Zce 01 Basics
Cfphp Zce 01 Basics
 
Crystal internals (part 1)
Crystal internals (part 1)Crystal internals (part 1)
Crystal internals (part 1)
 
Hacking with hhvm
Hacking with hhvmHacking with hhvm
Hacking with hhvm
 
Effective Scala (JavaDay Riga 2013)
Effective Scala (JavaDay Riga 2013)Effective Scala (JavaDay Riga 2013)
Effective Scala (JavaDay Riga 2013)
 
Ruby and rails - Advanced Training (Cybage)
Ruby and rails - Advanced Training (Cybage)Ruby and rails - Advanced Training (Cybage)
Ruby and rails - Advanced Training (Cybage)
 

Similar to Regular Expressions: Backtracking, and The Little Engine that Could(n't)?

Code Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured ExceptionsCode Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured ExceptionsJohn Anderson
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In PerlKang-min Liu
 
An OCaml newbie meets Camlp4 parser
An OCaml newbie meets Camlp4 parserAn OCaml newbie meets Camlp4 parser
An OCaml newbie meets Camlp4 parserKiwamu Okabe
 
Dealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter ScottDealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter ScottO'Reilly Media
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?osfameron
 
WTFin Perl
WTFin PerlWTFin Perl
WTFin Perllechupl
 
Code Fast, Die Young, Throw Structured Exceptions
Code Fast, Die Young, Throw Structured ExceptionsCode Fast, Die Young, Throw Structured Exceptions
Code Fast, Die Young, Throw Structured ExceptionsJohn Anderson
 
Advanced Regular Expressions Redux
Advanced Regular Expressions ReduxAdvanced Regular Expressions Redux
Advanced Regular Expressions ReduxJakub Nesetril
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perlworr1244
 
Regexp secrets
Regexp secretsRegexp secrets
Regexp secretsHiro Asari
 
Perl 5.10
Perl 5.10Perl 5.10
Perl 5.10acme
 

Similar to Regular Expressions: Backtracking, and The Little Engine that Could(n't)? (20)

Code Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured ExceptionsCode Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured Exceptions
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Perl Moderno
Perl ModernoPerl Moderno
Perl Moderno
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
An OCaml newbie meets Camlp4 parser
An OCaml newbie meets Camlp4 parserAn OCaml newbie meets Camlp4 parser
An OCaml newbie meets Camlp4 parser
 
Dealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter ScottDealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter Scott
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Bioinformatica p4-io
Bioinformatica p4-ioBioinformatica p4-io
Bioinformatica p4-io
 
WTFin Perl
WTFin PerlWTFin Perl
WTFin Perl
 
Code Fast, Die Young, Throw Structured Exceptions
Code Fast, Die Young, Throw Structured ExceptionsCode Fast, Die Young, Throw Structured Exceptions
Code Fast, Die Young, Throw Structured Exceptions
 
Advanced Regular Expressions Redux
Advanced Regular Expressions ReduxAdvanced Regular Expressions Redux
Advanced Regular Expressions Redux
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Regexp secrets
Regexp secretsRegexp secrets
Regexp secrets
 
Goodparts
GoodpartsGoodparts
Goodparts
 
Syntax
SyntaxSyntax
Syntax
 
Perl 5.10
Perl 5.10Perl 5.10
Perl 5.10
 

More from daoswald

Perl: Setting Up An Internal Darkpan
Perl: Setting Up An Internal DarkpanPerl: Setting Up An Internal Darkpan
Perl: Setting Up An Internal Darkpandaoswald
 
Speaking at Tech Events
Speaking at Tech EventsSpeaking at Tech Events
Speaking at Tech Eventsdaoswald
 
Perl one-liners
Perl one-linersPerl one-liners
Perl one-linersdaoswald
 
Whatsnew in-perl
Whatsnew in-perlWhatsnew in-perl
Whatsnew in-perldaoswald
 
Perls Functional functions
Perls Functional functionsPerls Functional functions
Perls Functional functionsdaoswald
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPANdaoswald
 
Deploying Perl apps on dotCloud
Deploying Perl apps on dotCloudDeploying Perl apps on dotCloud
Deploying Perl apps on dotClouddaoswald
 

More from daoswald (7)

Perl: Setting Up An Internal Darkpan
Perl: Setting Up An Internal DarkpanPerl: Setting Up An Internal Darkpan
Perl: Setting Up An Internal Darkpan
 
Speaking at Tech Events
Speaking at Tech EventsSpeaking at Tech Events
Speaking at Tech Events
 
Perl one-liners
Perl one-linersPerl one-liners
Perl one-liners
 
Whatsnew in-perl
Whatsnew in-perlWhatsnew in-perl
Whatsnew in-perl
 
Perls Functional functions
Perls Functional functionsPerls Functional functions
Perls Functional functions
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
 
Deploying Perl apps on dotCloud
Deploying Perl apps on dotCloudDeploying Perl apps on dotCloud
Deploying Perl apps on dotCloud
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Regular Expressions: Backtracking, and The Little Engine that Could(n't)?

  • 1. RReegguullaarr EExxpprreessssiioonnss TThhee LLiittttllee EEnnggiinnee TThhaatt CCoouulldd((nn''tt))??
  • 2. Twitter ● #saintcon ● #perlreintro
  • 3. Salt Lake Perl Mongers ● The local “Perl Community” – Monthly meetings. – Partnership discounts. – Job announcements. – Everyone learns and grows. – For the love of Perl! ● http://saltlake.pm.org
  • 5. YAPC::NA::2015 – Salt Lake! ● Yet Another Perl Conference, North America ● Coming to Salt Lake City in June 2015 ● Check http://saltlake.pm.org for emerging details.
  • 6. Who am I? ● Dave Oswald – A Propetual Hobbiest. ● Studied Economics and Computer Science at U of U. – Also CS in High School, SLCC, LAVC, and self-guided. ● Independent software developer and consultant. – Focus on Perl, C++, and server-side development. ● Solving problems is my hobby, passion ...and my work. ● daoswald@gmail.com ● Salt Lake Perl Mongers – http://saltlake.pm.org Aspiring to be Lazy, Impatient, and Hubristic.
  • 7. This Is Our Goal Today https://xkcd.com/208/
  • 8. oO(um...) This Is ^H^H^H^H^H^H^H^H^H^H^H^H^H^H
  • 9. This Is NOT Our Goal Today
  • 10. Examples will be in Perl $_ = 'Just another Perl hacker,'; s/Perl/$your_preference/; ● Because regexes are an integral part of Perl's syntax. ● Because I get to use some cool tools unique to Perl. ● Because it doesn't matter (PCRE is nearly ubiquitous). ● Because Perl's regexes are Unicode enabled (modern Perls). ● Because it's my talk.
  • 11. Some Definitions ● Literal Characters abcdefghijklmnopqrstuvw xyz ABCDEFGJIHKLMNOP... 1234567890 Metacharacters | ( ) [ { ^ $ * + ? . Metasymbols b D t 3 s n ...and many others ● Operators m// (match) s/// (substitute) =~ or !~ (bind)
  • 12. A trivial example $string = “Just another Perl hacker,”; # (Target) (Bound to) (Pattern) say “Match!” if $string =~ m/Perl/; Match!
  • 13. Syntactic Shortcuts $_ = “Just another Perl hacker,”; # (Target) (Bound to) (Pattern) say “Match!” if /Perl/; Match!
  • 15. /(non)?deterministic finite automata/ ● Deterministic Finite Automata – Text-directed match – No backtracking, more limited semantics. – awk, egrep, flex, lex, MySQL, Procmail ● Non-deterministic Finite Automata – Regex-directed match – Backtracking, more flexible semantics – GNU Emacs, Java, grep, less, more, .NET, PCRE library, Perl, PHP, Python, Ruby, sed, vi, C++11
  • 16. Our focus... ● NFA – Nondeterministic Finite Automata – It's more interesting. – We tend to use it in more places. – Perl's regular expression engine is based on NFA.
  • 17. Our focus... ● NFA – Nondeterministic Finite Automata – It's more interesting. – We tend to use it in more places. – Perl's regular expression engine is based on NFA. – AAnndd ssoo aarree mmoosstt ootthheerr ggeenneerraall--ppuurrppoossee iimmpplleemmeennttaattiioonnss..
  • 18. Some Basics ● Literals match literals “Hello world!” =~ m/Hello/; # true. ● Alternation “Hello world!” =~ m/earth|world/; # true (world)
  • 19. Meta-symbols ● Some meta-symbols match classes of characters. ● “Hello world” =~ m/ws/w/; # true: (o w) ● Common symbols w (an “identifier” character) s (a “space” character) . (anything except newline – and sometimes newline too) d (a numeric digit) ● See perldoc perlrecharclass
  • 20. Quantifiers ● Quantifiers allow for atoms to match repeatedly. “Loooong day” =~ m/o+/; # true (oooo) ● Common quantifiers + (One or more): /o+/ * (Zero or more): /Lo*/ {2} (Exactly 2): /o{2}/ {2,6} (2 to 6 times): /o{2,4}/ {2,} (2 or more times): /o{2,}/ ? (0 or 1 times): /o?/
  • 21. Controlling Greed ● Greedy is the default. “looong” =~ m/o+/; # ooo ● ? after a quantifier makes it lazy, or non-greedy. “looong” =~ m/o+?/; # o
  • 22. Greedy and Non-greedy Quantifiers ● Greedy *, +, {…}, {… , …}, ? 'aaaaa' =~ /w+a/ # aaaaa ● Non-Greedy *?, +?, {…}?, {… , …}?, ?? 'aaaaa' =~ /w+?a/ # aa
  • 23. Anchors / Zero-width assertions. “Hello world” =~ /^world/; # false. “Hello world =~ /world$/; # true. ● Common anchoring assertions – ^ (Beginning of string or line – /m dependent) – $ (End of string or line – /m dependent) – A (Beginning of string, always.) – z (End of string, always.) – b (Boundary between wW): “Apple+” =~ /wb/
  • 24. Grouping ● (?: … ) – Non-capturing. ● “Daniel” =~ m/^(?:Dan|Nathan)iel$/; #true ● “Daniel” =~ m/^Dan|Nathaniel$/; # false ● ( … ) – Group and capture. ● “Daniel” =~ m/^(Dan|Nathan)iel$/; # Captures “Dan” into $1.
  • 25. Captures ● ( … ) captures populate $1, $2, $3... ● Also 1, 2, 3 within regexp. ● Named captures: (?<name> … ) – Populates $+{name} – Also g{name} within regexp.
  • 26. Capturing while( 'abc def ghi' =~ m/(?<trio>w{3})/g ) { print “$+{trio}n”; }
  • 27. Grouping creates composite atoms ● “eieio” =~ /(?:ei)+/; # Matches “eiei”
  • 28. Custom character classes ● [ … ] (An affirmitive characer class) “Hello” =~ m/[el]+/; # ell ● [^ … ] (A negated character class) “Hello” =~ m/[^el]+/; # H
  • 29. Character Class Ranges ● - (hyphen) is special within character classes. “12345” =~ m/[2-4]+/; # 234 ● A literal hyphen must be escaped, or placed at the end: “123-5” =~ m/[345-]/; # 3-5 ● A literal ^ (carat) must be escaped, or must not be at the beginning. “12^7” =~ m/[0-9^]+/; # 12^7 “12^7” =~ m/[^0-9]+/; # ^
  • 30. Character Class Ranges in 2014 ● Unicode means this is probably wrong m/A[a-z]*z/i # Contains only letters (wrong) # 52 possibilities. ● This is probably better m/Ap{Alpha}*z/ # Contains only Alphabetic characters. # 102,159 possibilities.
  • 31. Character Class Ranges in 2014 ● Broken.... A BUG! m/^[a-zA-Z]*$/i ● You meant to say... m/Ap{Alpha}*z/
  • 32. Or to put it another way... my $user_city = "São João da Madeira"; reject() unless $user_city =~ m/^[A­Za­z s]+$/; 21000 people on the west coast of Portugal are now unable to specify a valid billing address.
  • 33. Character classes may contain most metasymbols “1, 2, 3 Clap your hands for me” =~ m/^[ws,]{12}/ # 1, 2, 3 Clap ● Metasymbols that represent two or more code points are usually illegal inside character classes: X, R, for example. ● Dot (.) is literal in character classes. ● Quantifiers and alternation don't exist in character classes.
  • 34. Escape “special characters” ● Literal [ must be escaped with “John [Brown]” =~ m/[(w+)]/; – Captures “Brown” ● Adding a escapes any special character: w ^ {2} (...)
  • 35. Quotemeta ● Q and E escape special characters between. “O(n^2)” =~ m/Q(n^E/; # (n^
  • 36. Zero-width Assertions ● b Match a word boundary m/wbW/ ● (?= … ), (?! … ), (?<= … ), (?<! … ) '%a' =~ m/(?<!%)w/; # false
  • 38. Avoid leaning toothpicks ● Alternate delimiters “/usr/bin/perl” =~ m#^/([^/]+)/#; – Captures usr – Most non-identifier characters are fine as delimiters. ● A bad example “/usr/bin/perl” =~ m/^/([^/])//; – Still captures usr, but ugly and prone to mistakes.
  • 40. Two big rules ● The Match That Begins Earliest Wins 'The dragging belly indicates your cat is too fat' /fat|cat|belly|your/ ● The Standard Quantifiers Are Greedy 'to be, or not to be' /(to.*)(or not to be)*/ $1 == 'to be, or not to be' $2 == ''
  • 41. Backtracking 'hot tonic tonight!' /to(nite|knight|night)/ $1 == 'night' Matched “tonight” ● First tries to match “tonic” with “nite|knight|night” ● Then backtracked, advanced the position, attempted at 'o'
  • 42. Forcing greedy quantifiers to give up ground 'to be, or not to be' /(to.*)(or not to be)/ $1 == 'to be, ' $2 == 'or not to be' Watch the backtracking happen... ...twelve times.
  • 44. Backtracking out of control 'aaaaaab' /(a*)*[^Bb]$/ “Regex failed to match after 213 steps”
  • 45. Backtracking under control 'aaaaaab' /(a*)*+[^Bb]$/ “Regex failed to match after 79 steps” *+, ++, ?+, {n,m}+: possessive quantifiers.
  • 46. Possessive Quantifiers ● A + symbol after a quantifier makes it possessive. ● (?> … ) – Another possessive construct. ● Possessive quantifiers stand their ground. – Backtracking through a possessive quantifier is disallowed.
  • 47. An extreme example 'a' x 64 /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[Bb]/ ● This will run for septillions of septillions of years (or until you kill the process). 'a' x 64 /(?> a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a* )[Bb]/x ● This will not (4550 iterations). (?> … ) is another possessive construct.
  • 48. Longest Leftmost? ● Not necessarily... 'oneselfsufficient' /one(self)?(selfsufficient)?/ ● Matches oneself ● Captures self ● Greedy quantifiers only give up if forced.
  • 49. Greedy, Lazy 'I said foo' /.*foo/ # Greedy; backtracks backwards. /.*?foo/ # Lazy; backtracks forward. 'CamelCase' # (We want up to two captures.) /([A-Z].*?)([A-Z].*)?/ # $1:'C' GOTCHA! /([A-Z].*)([A-Z].*)?/ # $1:'CamelCase' GOTCHA! /([A-Z][^A-Z]*)([A-Z][^A-Z]*)?/ # ok (kinda)
  • 50. More NFA rules ● Matches occur as far left as possible. ● Alternation has left-to-right precedence. ● If an assertion doesn't match, backtracking occurs to try higher-pecking- order assertions with different choices (such as quantifier values, or alternatives). ● Quantifiers must be satisfied within their permissible range. ● Each atom matches according to its designated semantics. If it fails, the engine backtracks and twiddles the atom's quantifier within the quantifier's permissible range.
  • 51. The golden rule of programming Break the problem into manageable (smaller) problems.
  • 52. Shorter segments are often easier 'Brian and John attended' if( /Brian/ && /John/ ) { … } ...is much easier to understand than... if( /Brian.*John|John.*Brian/ ) { … }
  • 53. Short-circuiting may be more runtime efficient. if( m/(john|guillermo)/i ) … if( m/john/ || m/guillermo/ ) … ● The former has trie optimization. ● The latter may still win if you live in North America.
  • 54. Modifiers ● /g (Match iteratively, or repeatedly) ● /m (Alters semantics of ^ and $) ● /s (Alters semantics of . (dot) ) ● /x (Allow freeform whitespace)
  • 55. Unicode semantic modifiers ● ASCII Semantics: a ● ASCII Really Really only: aa ● Dual personality: d – The Pre-5.14 standard. ● Unicode Semantics: u – use v5.14 or newer.
  • 56. Freeform modifer ● /x ignores most whitespace. m/(Now)s # Comments. (is)s (the)s (time.+)z /x
  • 57. /g modifier while( “string” =~ m/(.)/g ) { print “$1n”; } s t r ...
  • 59. The Prussian Stance Whitelist ● Allow what you trust.
  • 60. The American Stance Blacklist ● Reject what you distrust
  • 61. The stances ● American (Blacklist) reject() if m/.../ ● Prussian (Whitelist) accept() if m/.../
  • 62. Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. – Jamie Zawinski
  • 63. Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most "obvious" (at least, to people who don't know any better) way to get from point A to point B. – Jamie Zawinski
  • 64. This issue is no longer unique to Perl
  • 65. Know your problem. (And know when not to use regexes.)
  • 66. RegExes are for matcing patterns ● This should be obvious, but... – HTML? (Probably not...) ● Tom Christiansen wrote an HTML parser – He recommends against it.
  • 67. RegExes are for matcing patterns ● This should be obvious, but... – HTML? (Probably not...) – JSON? (Um, no...) ● Merlyn wrote a regex JSON parser. ● JSON::Tiny provides a more robust solution, yet still compact enough for embedding.
  • 68. RegExes are for matcing patterns ● This should be obvious, but... – HTML? (Probably not...) – JSON? (Um, no...) – Email Addresses? (Don't waste your time...) ● Mastering Regular Expressions, 1st Edition demonstrates a regular expression for matching email addresses. – It was two pages long, not fully compliant, and was omitted from the 2nd and 3rd editions.
  • 69. “Regexes optimal for small HTML parsing problems, pessimal for large ones” “...it is much, much, much harder than almost anyone ever thinks it is.” “...you will eventually reach a point where you have to work harder to effect a solution that uses regexes than you would have to using a parsing class.” – Tom Christiansen
  • 70. You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the ne rves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Regẻx̔̿-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chil͡d ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of cͪ͒oͪͪ rrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of rege x parsers for HTML will inst antly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTM L parser, application and existence for all time like Visual Basic only worse he comes he comes do not fig ht he come̡s̶, h̕i̵s unh̨ol͞y radianće ́destro҉ying all enliĝ̍̈́̈́htenment, HTML tags leak͠iņg͘ fro̶m̨ y̡ou ͟r eyes͢ ̸l̛i̕ke͏ liqu id pain, the song of reg̸ular expr ession parsing will extin guish the voices of mort al man from the sph ere I can see it can you see ̖̙̲͚ͪît̩̩̱̲͎́́ͪ̋̀ it is beautiful th e final snuffing of the lies of Man ALL IS LOŚ̩ͪ̏̈́T̗̪ ͇ ALL IS̷ LOST the pony̶ he comes he co̮mes he comes the icho r permeates all MY FACE MY FACE ᵒh god no NO NOOO̼O NΘ stop the an* ̶̅̾̾ ͑ͪg̙̤͏ͪͪ̑̾͆l̫͇̗̟̩̳̍͆ͪe͉̅s̠a̧͎͈ͪre̽̾̈́͒͑ no t rèͪ̌̑a͂ͪl̃ͪ ̘̙̝̆̾ZAL̡͊͠͝GΌ ISͪ̂҉̯͈ͪ ̘̱̹ TO̹̺͇ͅƝ̴ȳȳ TH̳̘Ë́̉ͪ ͠P̭̯O͍͊̚ N̐Y̡ H̸̡̪̯ͪ̅̎̽̾Ȩ̩̬ͪ̾̈́̾̀́͘ ̶̧̨̭̯̱̹ͪ̏͟C̷̙̝̲̮ͪ͏O̝̪ͪM̴͍̖̲͊ͪ̒̑̚̚͜E̞̟̟͌ͪ̿̔͝S̨̥̫͎̭ͪ̀ͅ Have you tried using an XML parser instead? -- Famous StackOverflow Rant
  • 71. Appropriate Alternatives ● Complex grammars – Parsing classes. ● Fixed-width fields – unpack, substr. ● Comma Separated Values – CSV libraries. ● Uncomplicated, predictably patterned data. – Regular Expressions!
  • 72. Abuse! ● Check if a number is prime: % perl -E 'say "Prime" if (1 x shift) !~ /^1?$|^(11+?)1+$/' 1234567 – Attributed to Abigail: ● http://www.cpan.org/misc/japh – brian d foy (Author of Mastering Perl) dissects it: ● http://www.masteringperl.org/2013/06/how-abigails-prime-number- checker-works/
  • 73. “Driving home last night, I started realizing that the problem is solvable with pure regexes.” ● N Queens Problem: A pure-regexp solution. – Abigail, again: http://perlmonks.org/?node=297616
  • 74. References ● Programming Perl, 4th Edition (OReilly) ● Mastering Regular Expressions, 3rd Edition (OReilly) ● Mastering Perl, 2nd Edition (OReilly) ● Regexp::Debugger – Damian Conway ● perlre, perlretut, perlrecharclass
  • 75. Dave Oswald daoswald@gmail.com http://saltlake.pm.org (PerlMongers) http://www.slideshare.net/daoswald/regex-talk-30408635 (SlideShare)