SlideShare a Scribd company logo
Stop overusing regular expressions!
Franklin Chen
http://franklinchen.com/
Pittsburgh Tech Fest 2013
June 1, 2013
Famous quote
In 1997, Jamie Zawinski famously wrote:
Some people, when confronted with a problem, think,
“I know, I’ll use regular expressions.”
Now they have two problems.
Purpose of this talk
Assumption: you already have experience using regexes
Goals:
Change how you think about and use regexes
Introduce you to advantages of using parser combinators
Show a smooth way to transition from regexes to parsers
Discuss practical tradeoffs
Show only tested, running code:
https://github.com/franklinchen/
talk-on-overusing-regular-expressions
Non-goals:
Will not discuss computer science theory
Will not discuss parser generators such as yacc and ANTLR
Example code in which languages?
Considerations:
This is a polyglot conference
Time limitations
Decision: focus primarily on two representative languages.
Ruby: dynamic typing
Perl, Python, JavaScript, Clojure, etc.
Scala: static typing:
Java, C++, C#, F#, ML, Haskell, etc.
GitHub repository also has similar-looking code for
Perl
JavaScript
An infamous regex for email
The reason for my talk!
A big Perl regex for email address based on RFC 822 grammar:
/(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+
)+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:r
rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[]
?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]
t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@
31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[
](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:
(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^
(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:
|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[
?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 0
rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|
t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-
?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*
)*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-03
t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]
Personal story: my email address fiasco
To track and prevent spam: FranklinChen+spam@cmu.edu
Some sites wrongly claimed invalid (because of +)
Other sites did allow registration
I caught spam
Unsubscribing failed!
Some wrong claimed invalid (!?!)
Some silently failed to unsubscribe
Had to set up spam filter
Problem: different regexes for email?
Examples: which one is better?
/A[^@]+@[^@]+z/
vs.
/A[a-zA-Z]+@([^@.]+.)+[^@.]+z/
Readability: first example
Use x for readability!
/A # match string begin
[^@]+ # local part: 1+ chars of not @
@ # @
[^@]+ # domain: 1+ chars of not @
z/x # match string end
matches
FranklinChen+spam
@
cmu.edu
Advice: please write regexes in this formatted style!
Readability: second example
/A # match string begin
[a-zA-Z]+ # local part: 1+ of Roman alphabetic
@ # @
( # 1+ of this group
[^@.]+ # 1+ of not (@ or dot)
. # dot
)+
[^@.]+ # 1+ of not (@ or dot)
z/x # match string end
does not match
FranklinChen+spam
@
cmu.edu
Don’t Do It Yourself: find libraries
Infamous regex revisited: was automatically generated into the Perl
module Mail::RFC822::Address based on the RFC 822 spec.
If you use Perl and need to validate email addresses:
# Install the library
$ cpanm Mail::RFC822::Address
Other Perl regexes: Regexp::Common
Regex match success vs. failure
Parentheses for captured grouping:
if ($address =~ /A # match string begin
([a-zA-Z]+) # $1: local part: 1+ Roman alphabetic
@ # @
( # $2: entire domain
(?: # 1+ of this non-capturing group
[^@.]+ # 1+ of not (@ or dot)
. # dot
)+
([^@.]+) # $3: 1+ of not (@ or dot)
)
z/x) { # match string end
print "local part: $1; domain: $2; top level: $3n";
}
else {
print "Regex match failed!n";
}
Regexes do not report errors usefully
Success:
$ perl/extract_parts.pl ’prez@whitehouse.gov’
local part: prez; domain: whitehouse.gov; top level: gov
But failure:
$ perl/extract_parts.pl ’FranklinChen+spam@cmu.edu’
Regex match failed!
We have a dilemma:
Bug in the regex?
The data really doesn’t match?
Better error reporting: how?
Would prefer something at least minimally like:
[1.13] failure: ‘@’ expected but ‘+’ found
FranklinChen+spam@cmu.edu
^
Features of parser libraries:
Extraction of line, column information of error
Automatically generated error explanation
Hooks for customizing error explanation
Hooks for error recovery
Moving from regexes to grammar parsers
Only discuss parser combinators, not generators
Use the second email regex as a starting point
Use modularization
Modularized regex: Ruby
Ruby allows interpolation of regexes into other regexes:
class EmailValidator::PrettyRegexMatcher
LOCAL_PART = /[a-zA-Z]+/x
DOMAIN_CHAR = /[^@.]/x
SUB_DOMAIN = /#{DOMAIN_CHAR}+/x
AT = /@/x
DOT = /./x
DOMAIN = /( #{SUB_DOMAIN} #{DOT} )+ #{SUB_DOMAIN}/x
EMAIL = /A #{LOCAL_PART} #{AT} #{DOMAIN} z/x
def match(s)
s =~ EMAIL
end
end
Modularized regex: Scala
object PrettyRegexMatcher {
val user = """[a-zA-Z]+"""
val domainChar = """[^@.]"""
val domainSegment = raw"""(?x) $domainChar+"""
val at = """@"""
val dot = """."""
val domain = raw"""(?x)
($domainSegment $dot)+ $domainSegment"""
val email = raw"""(?x) A $user $at $domain z"""
val emailRegex = email.r
def matches(s: String): Boolean =
(emailRegex findFirstMatchIn s).nonEmpty
end
Triple quotes are for raw strings (no backslash interpretation)
raw allows raw variable interpolation
.r is a method converting String to Regex
Email parser: Ruby
Ruby does not have a standard parser combinator library. One that
is popular is Parslet.
require ’parslet’
class EmailValidator::Parser < Parslet::Parser
rule(:local_part) { match[’a-zA-Z’].repeat(1) }
rule(:domain_char) { match[’^@.’] }
rule(:at) { str(’@’) }
rule(:dot) { str(’.’) }
rule(:sub_domain) { domain_char.repeat(1) }
rule(:domain) { (sub_domain >> dot).repeat(1) >>
sub_domain }
rule(:email) { local_part >> at >> domain }
end
Email parser: Scala
Scala comes with standard parser combinator library.
import scala.util.parsing.combinator.RegexParsers
object EmailParsers extends RegexParsers {
override def skipWhitespace = false
def localPart: Parser[String] = """[a-zA-Z]+""".r
def domainChar = """[^@.]""".r
def at = "@"
def dot = "."
def subDomain = rep1(domainChar)
def domain = rep1(subDomain ~ dot) ~ subDomain
def email = localPart ~ at ~ domain
}
Inheriting from RegexParsers allows the implicit conversions from
regexes into parsers.
Running and reporting errors: Ruby
parser = EmailValidator::Parser.new
begin
parser.email.parse(address)
puts "Successfully parsed #{address}"
rescue Parslet::ParseFailed => error
puts "Parse failed, and here’s why:"
puts error.cause.ascii_tree
end
$ validate_email ’FranklinChen+spam@cmu.edu’
Parse failed, and here’s why:
Failed to match sequence (LOCAL_PART AT DOMAIN) at
line 1 char 13.
‘- Expected "@", but got "+" at line 1 char 13.
Running and reporting errors: Scala
def runParser(address: String) {
import EmailParsers._
// parseAll is method inherited from RegexParsers
parseAll(email, address) match {
case Success(_, _) =>
println(s"Successfully parsed $address")
case failure: NoSuccess =>
println("Parse failed, and here’s why:")
println(failure)
}
}
Parse failed, and here’s why:
[1.13] failure: ‘@’ expected but ‘+’ found
FranklinChen+spam@cmu.edu
^
Infamous email regex revisited: conversion!
Mail::RFC822::Address Perl regex: actually manually
back-converted from a parser!
Original module used Perl parser library Parse::RecDescent
Regex author turned the grammar rules into a modularized
regex
Reason: speed
Perl modularized regex source code excerpt:
# ...
my $localpart = "$word(?:.$lwsp*$word)*";
my $sub_domain = "(?:$atom|$domain_literal)";
my $domain = "$sub_domain(?:.$lwsp*$sub_domain)*";
my $addr_spec = "$localpart@$lwsp*$domain";
# ...
Why validate email address anyway?
Software development: always look at the bigger picture
This guy advocates simply sending a user an activation email
Engineering tradeoff: the email sending and receiving
programs need to handle the email address anyway
Email example wrapup
It is possible to write a regex
But first try to use someone else’s tested regex
If you do write your own regex, modularize it
For error reporting, use a parser: convert from modularized
regex
Do you even need to solve the problem?
More complex parsing: toy JSON
{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}
Trick question: could you use a regex to parse this?
Recursive regular expressions?
2007: Perl 5.10 introduced “recursive regular expressions” that can
parse context-free grammars that have rules embedding a reference
to themselves.
$regex = qr/
( # match an open paren (
( # followed by
[^()]+ # 1+ non-paren characters
| # OR
(??{$regex}) # the regex itself
)* # repeated 0+ times
) # followed by a close paren )
/x;
Not recommended!
But, gateway to context-free grammars
Toy JSON parser: Ruby
To save time:
No more Ruby code in this presentation
Continuing with Scala since most concise
Remember: concepts are language-independent!
Full code for a toy JSON parser is available in the Parslet web site
at https://github.com/kschiess/parslet/blob/master/
example/json.rb
Toy JSON parser: Scala
object ToyJSONParsers extends JavaTokenParsers {
def value: Parser[Any] = obj | arr |
stringLiteral | floatingPointNumber |
"null" | "true" | "false"
def obj = "{" ~ repsep(member, ",") ~ "}"
def arr = "[" ~ repsep(value, ",") ~ "]"
def member = stringLiteral ~ ":" ~ value
}
Inherit from JavaTokenParsers: reuse stringLiteral and
floatingPointNumber parsers
~ is overloaded operator to mean sequencing of parsers
value parser returns Any (equivalent to Java Object)
because we have not yet refined the parser with our own
domain model
Fancier toy JSON parser: domain modeling
Want to actually query the data upon parsing, to use.
{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}
We may want to traverse the JSON to address and then to
the second neighbor, to check whether it is "Y"
Pseudocode after storing in domain model:
data.address.neighbors[1] == "Y"
Domain modeling as objects
{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}
A natural domain model:
JObject(
Map(distance -> JFloat(5.6),
address -> JObject(
Map(street -> JString(0 Nowhere Road),
neighbors ->
JArray(List(JString(X), JString(Y))),
garage -> JNull))))
Domain modeling: Scala
// sealed means: can’t extend outside the source file
sealed abstract class ToyJSON
case class JObject(map: Map[String, ToyJSON]) extends
ToyJSON
case class JArray(list: List[ToyJSON]) extends ToyJSON
case class JString(string: String) extends ToyJSON
case class JFloat(float: Float) extends ToyJSON
case object JNull extends ToyJSON
case class JBoolean(boolean: Boolean) extends ToyJSON
Fancier toy JSON parser: Scala
def value: Parser[ToyJSON] = obj | arr |
stringLiteralStripped ^^ { s => JString(s) } |
floatingPointNumber ^^ { s => JFloat(s.toFloat) } |
"null" ^^^ JNull | "true" ^^^ JBoolean(true) |
"false" ^^^ JBoolean(false)
def stringLiteralStripped: Parser[String] =
stringLiteral ^^ { s => s.substring(1, s.length-1) }
def obj: Parser[JObject] = "{" ~> (repsep(member, ",") ^^
{ aList => JObject(aList.toMap) }) <~ "}"
def arr: Parser[JArray] = "[" ~> (repsep(value, ",") ^^
{ aList => JArray(aList) }) <~ "]"
def member: Parser[(String, ToyJSON)] =
((stringLiteralStripped <~ ":") ~ value) ^^
{ case s ~ v => s -> v }
Running the fancy toy JSON parser
val j = """{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}"""
parseAll(value, j) match {
case Success(result, _) =>
println(result)
case failure: NoSuccess =>
println("Parse failed, and here’s why:")
println(failure)
}
Output: what we proposed when data modeling!
JSON: reminder not to reinvent
Use a standard JSON parsing library!
Scala has one.
All languages have a standard JSON parsing library.
Shop around: alternate libraries have different tradeoffs.
Other standard formats: HTML, XML, CSV, etc.
JSON wrapup
Parsing can be simple
Domain modeling is trickier but still fit on one page
Use an existing JSON parsing library already!
Final regex example
These slides use the Python library Pygments for syntax
highlighting of all code (highly recommended)
Experienced bugs in the highlighting of regexes in Perl code
Found out why, in source code for class PerlLexer
#TODO: give this to a perl guy who knows how to parse perl
tokens = {
’balanced-regex’: [
(r’/(|[^]|[^/])*/[egimosx]*’, String.Regex, ’#p
(r’!(|[^]|[^!])*![egimosx]*’, String.Regex, ’#p
(r’(|[^])*[egimosx]*’, String.Regex, ’#pop’),
(r’{(|[^]|[^}])*}[egimosx]*’, String.Regex, ’#p
(r’<(|[^]|[^>])*>[egimosx]*’, String.Regex, ’#p
(r’[(|[^]|[^]])*][egimosx]*’, String.Regex,
(r’((|[^]|[^)])*)[egimosx]*’, String.Regex,
(r’@(|[^]|[^@])*@[egimosx]*’, String.Regex, ’#
(r’%(|[^]|[^%])*%[egimosx]*’, String.Regex, ’#
(r’$(|[^]|[^$])*$[egimosx]*’, String.Regex,
Conclusion
Regular expressions
No error reporting
Flat data
More general grammars
Composable structure (using combinator parsers)
Hierarchical, nested data
Avoid reinventing
Concepts are language-independent: find and use a parser
library for your language
All materials for this talk available at https://github.com/
franklinchen/talk-on-overusing-regular-expressions.
The hyperlinks on the slide PDFs are clickable.

More Related Content

What's hot

Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
Dave Cross
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and YouJames Armes
 
Ruby for Java Developers
Ruby for Java DevelopersRuby for Java Developers
Ruby for Java Developers
Robert Reiz
 
Php extensions
Php extensionsPhp extensions
Php extensions
Elizabeth Smith
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Perl Programming - 02 Regular Expression
Perl Programming - 02 Regular ExpressionPerl Programming - 02 Regular Expression
Perl Programming - 02 Regular Expression
Danairat Thanabodithammachari
 
Perl
PerlPerl
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsingBoy Baukema
 
Control structure
Control structureControl structure
Control structure
baran19901990
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational BiologyAtreyiB
 
Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)
FrescatiStory
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1
Rakesh Mukundan
 
Modern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersModern Perl for Non-Perl Programmers
Modern Perl for Non-Perl Programmers
Dave Cross
 
Introduction to Modern Perl
Introduction to Modern PerlIntroduction to Modern Perl
Introduction to Modern Perl
Dave Cross
 
Pearl
PearlPearl
Unit 1-introduction to perl
Unit 1-introduction to perlUnit 1-introduction to perl
Unit 1-introduction to perl
sana mateen
 
New Stuff In Php 5.3
New Stuff In Php 5.3New Stuff In Php 5.3
New Stuff In Php 5.3
Chris Chubb
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
Aiman Hud
 

What's hot (20)

Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 
Ruby for Java Developers
Ruby for Java DevelopersRuby for Java Developers
Ruby for Java Developers
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Perl Programming - 02 Regular Expression
Perl Programming - 02 Regular ExpressionPerl Programming - 02 Regular Expression
Perl Programming - 02 Regular Expression
 
Perl
PerlPerl
Perl
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
 
Control structure
Control structureControl structure
Control structure
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
 
Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1
 
Cs3430 lecture 15
Cs3430 lecture 15Cs3430 lecture 15
Cs3430 lecture 15
 
Modern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersModern Perl for Non-Perl Programmers
Modern Perl for Non-Perl Programmers
 
Introduction to Modern Perl
Introduction to Modern PerlIntroduction to Modern Perl
Introduction to Modern Perl
 
Perl_Part1
Perl_Part1Perl_Part1
Perl_Part1
 
Pearl
PearlPearl
Pearl
 
Unit 1-introduction to perl
Unit 1-introduction to perlUnit 1-introduction to perl
Unit 1-introduction to perl
 
New Stuff In Php 5.3
New Stuff In Php 5.3New Stuff In Php 5.3
New Stuff In Php 5.3
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 

Viewers also liked

Halifax Index 2012 Presentation
Halifax Index 2012 PresentationHalifax Index 2012 Presentation
Halifax Index 2012 Presentation
Halifax Partnership
 
Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2
mshenry
 
Basicgrammar1
Basicgrammar1Basicgrammar1
Basicgrammar1
Vicky
 
Presentación1
Presentación1Presentación1
Presentación1Vicky
 
オールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせオールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせ
pgcafe
 
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Verina Ingram
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docsjmingma
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075BenjamasS
 
F:\Itag48 (53011810065)
F:\Itag48 (53011810065)F:\Itag48 (53011810065)
F:\Itag48 (53011810065)
BenjamasS
 
L.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting BootL.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting Bootjessradio
 
Real Estate / Special Assets Seminar
Real Estate / Special Assets SeminarReal Estate / Special Assets Seminar
Real Estate / Special Assets Seminar
Stites & Harbison
 
Nme magazine presentation slideshare
Nme magazine presentation slideshareNme magazine presentation slideshare
Nme magazine presentation slideshareBecca McPartland
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptación
CESSI ArgenTIna
 

Viewers also liked (20)

Halifax Index 2012 Presentation
Halifax Index 2012 PresentationHalifax Index 2012 Presentation
Halifax Index 2012 Presentation
 
Appcode
AppcodeAppcode
Appcode
 
Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2
 
Dentist appointment
Dentist appointmentDentist appointment
Dentist appointment
 
Basicgrammar1
Basicgrammar1Basicgrammar1
Basicgrammar1
 
Presentación1
Presentación1Presentación1
Presentación1
 
Pres
PresPres
Pres
 
Ludmi y mika
Ludmi y mikaLudmi y mika
Ludmi y mika
 
オールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせオールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせ
 
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docs
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075
 
F:\Itag48 (53011810065)
F:\Itag48 (53011810065)F:\Itag48 (53011810065)
F:\Itag48 (53011810065)
 
L.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting BootL.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting Boot
 
Real Estate / Special Assets Seminar
Real Estate / Special Assets SeminarReal Estate / Special Assets Seminar
Real Estate / Special Assets Seminar
 
Flat plans
Flat plansFlat plans
Flat plans
 
Nme magazine presentation slideshare
Nme magazine presentation slideshareNme magazine presentation slideshare
Nme magazine presentation slideshare
 
Pres
PresPres
Pres
 
Gic2012 aula2-ingles
Gic2012 aula2-inglesGic2012 aula2-ingles
Gic2012 aula2-ingles
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptación
 

Similar to Stop overusing regular expressions!

Scala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love GameScala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love GameAntony Stubbs
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.
Poznań Ruby User Group
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
keeyre
 
Ruby quick ref
Ruby quick refRuby quick ref
Ruby quick ref
Tharcius Silva
 
ppt7
ppt7ppt7
ppt7
callroom
 
ppt2
ppt2ppt2
ppt2
callroom
 
name name2 n
name name2 nname name2 n
name name2 n
callroom
 
ppt9
ppt9ppt9
ppt9
callroom
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
amiable_indian
 
name name2 n2
name name2 n2name name2 n2
name name2 n2
callroom
 
test ppt
test ppttest ppt
test ppt
callroom
 
name name2 n
name name2 nname name2 n
name name2 n
callroom
 
ppt21
ppt21ppt21
ppt21
callroom
 
name name2 n
name name2 nname name2 n
name name2 n
callroom
 
ppt17
ppt17ppt17
ppt17
callroom
 
ppt30
ppt30ppt30
ppt30
callroom
 
name name2 n2.ppt
name name2 n2.pptname name2 n2.ppt
name name2 n2.ppt
callroom
 
ppt18
ppt18ppt18
ppt18
callroom
 
JavaScript.pptx
JavaScript.pptxJavaScript.pptx
JavaScript.pptx
Govardhan Bhavani
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
brettflorio
 

Similar to Stop overusing regular expressions! (20)

Scala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love GameScala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love Game
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Ruby quick ref
Ruby quick refRuby quick ref
Ruby quick ref
 
ppt7
ppt7ppt7
ppt7
 
ppt2
ppt2ppt2
ppt2
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt9
ppt9ppt9
ppt9
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
 
name name2 n2
name name2 n2name name2 n2
name name2 n2
 
test ppt
test ppttest ppt
test ppt
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt21
ppt21ppt21
ppt21
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt17
ppt17ppt17
ppt17
 
ppt30
ppt30ppt30
ppt30
 
name name2 n2.ppt
name name2 n2.pptname name2 n2.ppt
name name2 n2.ppt
 
ppt18
ppt18ppt18
ppt18
 
JavaScript.pptx
JavaScript.pptxJavaScript.pptx
JavaScript.pptx
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 

Recently uploaded

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 

Recently uploaded (20)

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 

Stop overusing regular expressions!

  • 1. Stop overusing regular expressions! Franklin Chen http://franklinchen.com/ Pittsburgh Tech Fest 2013 June 1, 2013
  • 2. Famous quote In 1997, Jamie Zawinski famously wrote: Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.
  • 3. Purpose of this talk Assumption: you already have experience using regexes Goals: Change how you think about and use regexes Introduce you to advantages of using parser combinators Show a smooth way to transition from regexes to parsers Discuss practical tradeoffs Show only tested, running code: https://github.com/franklinchen/ talk-on-overusing-regular-expressions Non-goals: Will not discuss computer science theory Will not discuss parser generators such as yacc and ANTLR
  • 4. Example code in which languages? Considerations: This is a polyglot conference Time limitations Decision: focus primarily on two representative languages. Ruby: dynamic typing Perl, Python, JavaScript, Clojure, etc. Scala: static typing: Java, C++, C#, F#, ML, Haskell, etc. GitHub repository also has similar-looking code for Perl JavaScript
  • 5. An infamous regex for email The reason for my talk! A big Perl regex for email address based on RFC 822 grammar: /(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+ )+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:r rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] ?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r] t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@ 31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[ ](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;: (?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^ (?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?: |(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ ?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 0 rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]| t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000- ?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)* )*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-03 t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]
  • 6. Personal story: my email address fiasco To track and prevent spam: FranklinChen+spam@cmu.edu Some sites wrongly claimed invalid (because of +) Other sites did allow registration I caught spam Unsubscribing failed! Some wrong claimed invalid (!?!) Some silently failed to unsubscribe Had to set up spam filter
  • 7. Problem: different regexes for email? Examples: which one is better? /A[^@]+@[^@]+z/ vs. /A[a-zA-Z]+@([^@.]+.)+[^@.]+z/
  • 8. Readability: first example Use x for readability! /A # match string begin [^@]+ # local part: 1+ chars of not @ @ # @ [^@]+ # domain: 1+ chars of not @ z/x # match string end matches FranklinChen+spam @ cmu.edu Advice: please write regexes in this formatted style!
  • 9. Readability: second example /A # match string begin [a-zA-Z]+ # local part: 1+ of Roman alphabetic @ # @ ( # 1+ of this group [^@.]+ # 1+ of not (@ or dot) . # dot )+ [^@.]+ # 1+ of not (@ or dot) z/x # match string end does not match FranklinChen+spam @ cmu.edu
  • 10. Don’t Do It Yourself: find libraries Infamous regex revisited: was automatically generated into the Perl module Mail::RFC822::Address based on the RFC 822 spec. If you use Perl and need to validate email addresses: # Install the library $ cpanm Mail::RFC822::Address Other Perl regexes: Regexp::Common
  • 11. Regex match success vs. failure Parentheses for captured grouping: if ($address =~ /A # match string begin ([a-zA-Z]+) # $1: local part: 1+ Roman alphabetic @ # @ ( # $2: entire domain (?: # 1+ of this non-capturing group [^@.]+ # 1+ of not (@ or dot) . # dot )+ ([^@.]+) # $3: 1+ of not (@ or dot) ) z/x) { # match string end print "local part: $1; domain: $2; top level: $3n"; } else { print "Regex match failed!n"; }
  • 12. Regexes do not report errors usefully Success: $ perl/extract_parts.pl ’prez@whitehouse.gov’ local part: prez; domain: whitehouse.gov; top level: gov But failure: $ perl/extract_parts.pl ’FranklinChen+spam@cmu.edu’ Regex match failed! We have a dilemma: Bug in the regex? The data really doesn’t match?
  • 13. Better error reporting: how? Would prefer something at least minimally like: [1.13] failure: ‘@’ expected but ‘+’ found FranklinChen+spam@cmu.edu ^ Features of parser libraries: Extraction of line, column information of error Automatically generated error explanation Hooks for customizing error explanation Hooks for error recovery
  • 14. Moving from regexes to grammar parsers Only discuss parser combinators, not generators Use the second email regex as a starting point Use modularization
  • 15. Modularized regex: Ruby Ruby allows interpolation of regexes into other regexes: class EmailValidator::PrettyRegexMatcher LOCAL_PART = /[a-zA-Z]+/x DOMAIN_CHAR = /[^@.]/x SUB_DOMAIN = /#{DOMAIN_CHAR}+/x AT = /@/x DOT = /./x DOMAIN = /( #{SUB_DOMAIN} #{DOT} )+ #{SUB_DOMAIN}/x EMAIL = /A #{LOCAL_PART} #{AT} #{DOMAIN} z/x def match(s) s =~ EMAIL end end
  • 16. Modularized regex: Scala object PrettyRegexMatcher { val user = """[a-zA-Z]+""" val domainChar = """[^@.]""" val domainSegment = raw"""(?x) $domainChar+""" val at = """@""" val dot = """.""" val domain = raw"""(?x) ($domainSegment $dot)+ $domainSegment""" val email = raw"""(?x) A $user $at $domain z""" val emailRegex = email.r def matches(s: String): Boolean = (emailRegex findFirstMatchIn s).nonEmpty end Triple quotes are for raw strings (no backslash interpretation) raw allows raw variable interpolation .r is a method converting String to Regex
  • 17. Email parser: Ruby Ruby does not have a standard parser combinator library. One that is popular is Parslet. require ’parslet’ class EmailValidator::Parser < Parslet::Parser rule(:local_part) { match[’a-zA-Z’].repeat(1) } rule(:domain_char) { match[’^@.’] } rule(:at) { str(’@’) } rule(:dot) { str(’.’) } rule(:sub_domain) { domain_char.repeat(1) } rule(:domain) { (sub_domain >> dot).repeat(1) >> sub_domain } rule(:email) { local_part >> at >> domain } end
  • 18. Email parser: Scala Scala comes with standard parser combinator library. import scala.util.parsing.combinator.RegexParsers object EmailParsers extends RegexParsers { override def skipWhitespace = false def localPart: Parser[String] = """[a-zA-Z]+""".r def domainChar = """[^@.]""".r def at = "@" def dot = "." def subDomain = rep1(domainChar) def domain = rep1(subDomain ~ dot) ~ subDomain def email = localPart ~ at ~ domain } Inheriting from RegexParsers allows the implicit conversions from regexes into parsers.
  • 19. Running and reporting errors: Ruby parser = EmailValidator::Parser.new begin parser.email.parse(address) puts "Successfully parsed #{address}" rescue Parslet::ParseFailed => error puts "Parse failed, and here’s why:" puts error.cause.ascii_tree end $ validate_email ’FranklinChen+spam@cmu.edu’ Parse failed, and here’s why: Failed to match sequence (LOCAL_PART AT DOMAIN) at line 1 char 13. ‘- Expected "@", but got "+" at line 1 char 13.
  • 20. Running and reporting errors: Scala def runParser(address: String) { import EmailParsers._ // parseAll is method inherited from RegexParsers parseAll(email, address) match { case Success(_, _) => println(s"Successfully parsed $address") case failure: NoSuccess => println("Parse failed, and here’s why:") println(failure) } } Parse failed, and here’s why: [1.13] failure: ‘@’ expected but ‘+’ found FranklinChen+spam@cmu.edu ^
  • 21. Infamous email regex revisited: conversion! Mail::RFC822::Address Perl regex: actually manually back-converted from a parser! Original module used Perl parser library Parse::RecDescent Regex author turned the grammar rules into a modularized regex Reason: speed Perl modularized regex source code excerpt: # ... my $localpart = "$word(?:.$lwsp*$word)*"; my $sub_domain = "(?:$atom|$domain_literal)"; my $domain = "$sub_domain(?:.$lwsp*$sub_domain)*"; my $addr_spec = "$localpart@$lwsp*$domain"; # ...
  • 22. Why validate email address anyway? Software development: always look at the bigger picture This guy advocates simply sending a user an activation email Engineering tradeoff: the email sending and receiving programs need to handle the email address anyway
  • 23. Email example wrapup It is possible to write a regex But first try to use someone else’s tested regex If you do write your own regex, modularize it For error reporting, use a parser: convert from modularized regex Do you even need to solve the problem?
  • 24. More complex parsing: toy JSON { "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } } Trick question: could you use a regex to parse this?
  • 25. Recursive regular expressions? 2007: Perl 5.10 introduced “recursive regular expressions” that can parse context-free grammars that have rules embedding a reference to themselves. $regex = qr/ ( # match an open paren ( ( # followed by [^()]+ # 1+ non-paren characters | # OR (??{$regex}) # the regex itself )* # repeated 0+ times ) # followed by a close paren ) /x; Not recommended! But, gateway to context-free grammars
  • 26. Toy JSON parser: Ruby To save time: No more Ruby code in this presentation Continuing with Scala since most concise Remember: concepts are language-independent! Full code for a toy JSON parser is available in the Parslet web site at https://github.com/kschiess/parslet/blob/master/ example/json.rb
  • 27. Toy JSON parser: Scala object ToyJSONParsers extends JavaTokenParsers { def value: Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false" def obj = "{" ~ repsep(member, ",") ~ "}" def arr = "[" ~ repsep(value, ",") ~ "]" def member = stringLiteral ~ ":" ~ value } Inherit from JavaTokenParsers: reuse stringLiteral and floatingPointNumber parsers ~ is overloaded operator to mean sequencing of parsers value parser returns Any (equivalent to Java Object) because we have not yet refined the parser with our own domain model
  • 28. Fancier toy JSON parser: domain modeling Want to actually query the data upon parsing, to use. { "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } } We may want to traverse the JSON to address and then to the second neighbor, to check whether it is "Y" Pseudocode after storing in domain model: data.address.neighbors[1] == "Y"
  • 29. Domain modeling as objects { "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } } A natural domain model: JObject( Map(distance -> JFloat(5.6), address -> JObject( Map(street -> JString(0 Nowhere Road), neighbors -> JArray(List(JString(X), JString(Y))), garage -> JNull))))
  • 30. Domain modeling: Scala // sealed means: can’t extend outside the source file sealed abstract class ToyJSON case class JObject(map: Map[String, ToyJSON]) extends ToyJSON case class JArray(list: List[ToyJSON]) extends ToyJSON case class JString(string: String) extends ToyJSON case class JFloat(float: Float) extends ToyJSON case object JNull extends ToyJSON case class JBoolean(boolean: Boolean) extends ToyJSON
  • 31. Fancier toy JSON parser: Scala def value: Parser[ToyJSON] = obj | arr | stringLiteralStripped ^^ { s => JString(s) } | floatingPointNumber ^^ { s => JFloat(s.toFloat) } | "null" ^^^ JNull | "true" ^^^ JBoolean(true) | "false" ^^^ JBoolean(false) def stringLiteralStripped: Parser[String] = stringLiteral ^^ { s => s.substring(1, s.length-1) } def obj: Parser[JObject] = "{" ~> (repsep(member, ",") ^^ { aList => JObject(aList.toMap) }) <~ "}" def arr: Parser[JArray] = "[" ~> (repsep(value, ",") ^^ { aList => JArray(aList) }) <~ "]" def member: Parser[(String, ToyJSON)] = ((stringLiteralStripped <~ ":") ~ value) ^^ { case s ~ v => s -> v }
  • 32. Running the fancy toy JSON parser val j = """{ "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } }""" parseAll(value, j) match { case Success(result, _) => println(result) case failure: NoSuccess => println("Parse failed, and here’s why:") println(failure) } Output: what we proposed when data modeling!
  • 33. JSON: reminder not to reinvent Use a standard JSON parsing library! Scala has one. All languages have a standard JSON parsing library. Shop around: alternate libraries have different tradeoffs. Other standard formats: HTML, XML, CSV, etc.
  • 34. JSON wrapup Parsing can be simple Domain modeling is trickier but still fit on one page Use an existing JSON parsing library already!
  • 35. Final regex example These slides use the Python library Pygments for syntax highlighting of all code (highly recommended) Experienced bugs in the highlighting of regexes in Perl code Found out why, in source code for class PerlLexer #TODO: give this to a perl guy who knows how to parse perl tokens = { ’balanced-regex’: [ (r’/(|[^]|[^/])*/[egimosx]*’, String.Regex, ’#p (r’!(|[^]|[^!])*![egimosx]*’, String.Regex, ’#p (r’(|[^])*[egimosx]*’, String.Regex, ’#pop’), (r’{(|[^]|[^}])*}[egimosx]*’, String.Regex, ’#p (r’<(|[^]|[^>])*>[egimosx]*’, String.Regex, ’#p (r’[(|[^]|[^]])*][egimosx]*’, String.Regex, (r’((|[^]|[^)])*)[egimosx]*’, String.Regex, (r’@(|[^]|[^@])*@[egimosx]*’, String.Regex, ’# (r’%(|[^]|[^%])*%[egimosx]*’, String.Regex, ’# (r’$(|[^]|[^$])*$[egimosx]*’, String.Regex,
  • 36. Conclusion Regular expressions No error reporting Flat data More general grammars Composable structure (using combinator parsers) Hierarchical, nested data Avoid reinventing Concepts are language-independent: find and use a parser library for your language All materials for this talk available at https://github.com/ franklinchen/talk-on-overusing-regular-expressions. The hyperlinks on the slide PDFs are clickable.