Stop overusing regular expressions!


Published on

Regular expressions are very commonly used to process and validate text data. Unfortunately, they suffer from various limitations. I will discuss the limitations and illustrate how using grammars instead can be an improvement. I will make the examples oncrete using parsing libraries in a couple of representative languages, although the ideas are language-independent.

I will emphasize ease of use, since one reason for the overuse of regular expressions is that they are so easy to pull out of one's toolbox.

Published in: Software
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Stop overusing regular expressions!

  1. 1. Stop overusing regular expressions!Franklin Chen Tech Fest 2013June 1, 2013
  2. 2. Famous quoteIn 1997, Jamie Zawinski famously wrote:Some people, when confronted with a problem, think,“I know, I’ll use regular expressions.”Now they have two problems.
  3. 3. Purpose of this talkAssumption: you already have experience using regexesGoals:Change how you think about and use regexesIntroduce you to advantages of using parser combinatorsShow a smooth way to transition from regexes to parsersDiscuss practical tradeoffsShow only tested, running code: not discuss computer science theoryWill not discuss parser generators such as yacc and ANTLR
  4. 4. Example code in which languages?Considerations:This is a polyglot conferenceTime limitationsDecision: focus primarily on two representative languages.Ruby: dynamic typingPerl, Python, JavaScript, Clojure, etc.Scala: static typing:Java, C++, C#, F#, ML, Haskell, etc.GitHub repository also has similar-looking code forPerlJavaScript
  5. 5. An infamous regex for emailThe reason for my talk!A big Perl regex for email address based on RFC 822 grammar:/(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+)+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rrn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[]?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 0rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*)*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-03t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]
  6. 6. Personal story: my email address fiascoTo track and prevent spam: FranklinChen+spam@cmu.eduSome sites wrongly claimed invalid (because of +)Other sites did allow registrationI caught spamUnsubscribing failed!Some wrong claimed invalid (!?!)Some silently failed to unsubscribeHad to set up spam filter
  7. 7. Problem: different regexes for email?Examples: which one is better?/A[^@]+@[^@]+z/vs./A[a-zA-Z]+@([^@.]+.)+[^@.]+z/
  8. 8. Readability: first exampleUse x for readability!/A # match string begin[^@]+ # local part: 1+ chars of not @@ # @[^@]+ # domain: 1+ chars of not @z/x # match string endmatchesFranklinChen+spam@cmu.eduAdvice: please write regexes in this formatted style!
  9. 9. Readability: second example/A # match string begin[a-zA-Z]+ # local part: 1+ of Roman alphabetic@ # @( # 1+ of this group[^@.]+ # 1+ of not (@ or dot). # dot)+[^@.]+ # 1+ of not (@ or dot)z/x # match string enddoes not
  10. 10. Don’t Do It Yourself: find librariesInfamous regex revisited: was automatically generated into the Perlmodule Mail::RFC822::Address based on the RFC 822 spec.If you use Perl and need to validate email addresses:# Install the library$ cpanm Mail::RFC822::AddressOther Perl regexes: Regexp::Common
  11. 11. Regex match success vs. failureParentheses for captured grouping:if ($address =~ /A # match string begin([a-zA-Z]+) # $1: local part: 1+ Roman alphabetic@ # @( # $2: entire domain(?: # 1+ of this non-capturing group[^@.]+ # 1+ of not (@ or dot). # dot)+([^@.]+) # $3: 1+ of not (@ or dot))z/x) { # match string endprint "local part: $1; domain: $2; top level: $3n";}else {print "Regex match failed!n";}
  12. 12. Regexes do not report errors usefullySuccess:$ perl/ ’’local part: prez; domain:; top level: govBut failure:$ perl/ ’’Regex match failed!We have a dilemma:Bug in the regex?The data really doesn’t match?
  13. 13. Better error reporting: how?Would prefer something at least minimally like:[1.13] failure: ‘@’ expected but ‘+’^Features of parser libraries:Extraction of line, column information of errorAutomatically generated error explanationHooks for customizing error explanationHooks for error recovery
  14. 14. Moving from regexes to grammar parsersOnly discuss parser combinators, not generatorsUse the second email regex as a starting pointUse modularization
  15. 15. Modularized regex: RubyRuby allows interpolation of regexes into other regexes:class EmailValidator::PrettyRegexMatcherLOCAL_PART = /[a-zA-Z]+/xDOMAIN_CHAR = /[^@.]/xSUB_DOMAIN = /#{DOMAIN_CHAR}+/xAT = /@/xDOT = /./xDOMAIN = /( #{SUB_DOMAIN} #{DOT} )+ #{SUB_DOMAIN}/xEMAIL = /A #{LOCAL_PART} #{AT} #{DOMAIN} z/xdef match(s)s =~ EMAILendend
  16. 16. Modularized regex: Scalaobject PrettyRegexMatcher {val user = """[a-zA-Z]+"""val domainChar = """[^@.]"""val domainSegment = raw"""(?x) $domainChar+"""val at = """@"""val dot = """."""val domain = raw"""(?x)($domainSegment $dot)+ $domainSegment"""val email = raw"""(?x) A $user $at $domain z"""val emailRegex = email.rdef matches(s: String): Boolean =(emailRegex findFirstMatchIn s).nonEmptyendTriple quotes are for raw strings (no backslash interpretation)raw allows raw variable interpolation.r is a method converting String to Regex
  17. 17. Email parser: RubyRuby does not have a standard parser combinator library. One thatis popular is Parslet.require ’parslet’class EmailValidator::Parser < Parslet::Parserrule(:local_part) { match[’a-zA-Z’].repeat(1) }rule(:domain_char) { match[’^@.’] }rule(:at) { str(’@’) }rule(:dot) { str(’.’) }rule(:sub_domain) { domain_char.repeat(1) }rule(:domain) { (sub_domain >> dot).repeat(1) >>sub_domain }rule(:email) { local_part >> at >> domain }end
  18. 18. Email parser: ScalaScala comes with standard parser combinator library.import scala.util.parsing.combinator.RegexParsersobject EmailParsers extends RegexParsers {override def skipWhitespace = falsedef localPart: Parser[String] = """[a-zA-Z]+""".rdef domainChar = """[^@.]""".rdef at = "@"def dot = "."def subDomain = rep1(domainChar)def domain = rep1(subDomain ~ dot) ~ subDomaindef email = localPart ~ at ~ domain}Inheriting from RegexParsers allows the implicit conversions fromregexes into parsers.
  19. 19. Running and reporting errors: Rubyparser = "Successfully parsed #{address}"rescue Parslet::ParseFailed => errorputs "Parse failed, and here’s why:"puts error.cause.ascii_treeend$ validate_email ’’Parse failed, and here’s why:Failed to match sequence (LOCAL_PART AT DOMAIN) atline 1 char 13.‘- Expected "@", but got "+" at line 1 char 13.
  20. 20. Running and reporting errors: Scaladef runParser(address: String) {import EmailParsers._// parseAll is method inherited from RegexParsersparseAll(email, address) match {case Success(_, _) =>println(s"Successfully parsed $address")case failure: NoSuccess =>println("Parse failed, and here’s why:")println(failure)}}Parse failed, and here’s why:[1.13] failure: ‘@’ expected but ‘+’^
  21. 21. Infamous email regex revisited: conversion!Mail::RFC822::Address Perl regex: actually manuallyback-converted from a parser!Original module used Perl parser library Parse::RecDescentRegex author turned the grammar rules into a modularizedregexReason: speedPerl modularized regex source code excerpt:# $localpart = "$word(?:.$lwsp*$word)*";my $sub_domain = "(?:$atom|$domain_literal)";my $domain = "$sub_domain(?:.$lwsp*$sub_domain)*";my $addr_spec = "$localpart@$lwsp*$domain";# ...
  22. 22. Why validate email address anyway?Software development: always look at the bigger pictureThis guy advocates simply sending a user an activation emailEngineering tradeoff: the email sending and receivingprograms need to handle the email address anyway
  23. 23. Email example wrapupIt is possible to write a regexBut first try to use someone else’s tested regexIf you do write your own regex, modularize itFor error reporting, use a parser: convert from modularizedregexDo you even need to solve the problem?
  24. 24. More complex parsing: toy JSON{"distance" : 5.6,"address" : {"street" : "0 Nowhere Road","neighbors" : ["X", "Y"],"garage" : null}}Trick question: could you use a regex to parse this?
  25. 25. Recursive regular expressions?2007: Perl 5.10 introduced “recursive regular expressions” that canparse context-free grammars that have rules embedding a referenceto themselves.$regex = qr/( # match an open paren (( # followed by[^()]+ # 1+ non-paren characters| # OR(??{$regex}) # the regex itself)* # repeated 0+ times) # followed by a close paren )/x;Not recommended!But, gateway to context-free grammars
  26. 26. Toy JSON parser: RubyTo save time:No more Ruby code in this presentationContinuing with Scala since most conciseRemember: concepts are language-independent!Full code for a toy JSON parser is available in the Parslet web siteat
  27. 27. Toy JSON parser: Scalaobject ToyJSONParsers extends JavaTokenParsers {def value: Parser[Any] = obj | arr |stringLiteral | floatingPointNumber |"null" | "true" | "false"def obj = "{" ~ repsep(member, ",") ~ "}"def arr = "[" ~ repsep(value, ",") ~ "]"def member = stringLiteral ~ ":" ~ value}Inherit from JavaTokenParsers: reuse stringLiteral andfloatingPointNumber parsers~ is overloaded operator to mean sequencing of parsersvalue parser returns Any (equivalent to Java Object)because we have not yet refined the parser with our owndomain model
  28. 28. Fancier toy JSON parser: domain modelingWant to actually query the data upon parsing, to use.{"distance" : 5.6,"address" : {"street" : "0 Nowhere Road","neighbors" : ["X", "Y"],"garage" : null}}We may want to traverse the JSON to address and then tothe second neighbor, to check whether it is "Y"Pseudocode after storing in domain model:data.address.neighbors[1] == "Y"
  29. 29. Domain modeling as objects{"distance" : 5.6,"address" : {"street" : "0 Nowhere Road","neighbors" : ["X", "Y"],"garage" : null}}A natural domain model:JObject(Map(distance -> JFloat(5.6),address -> JObject(Map(street -> JString(0 Nowhere Road),neighbors ->JArray(List(JString(X), JString(Y))),garage -> JNull))))
  30. 30. Domain modeling: Scala// sealed means: can’t extend outside the source filesealed abstract class ToyJSONcase class JObject(map: Map[String, ToyJSON]) extendsToyJSONcase class JArray(list: List[ToyJSON]) extends ToyJSONcase class JString(string: String) extends ToyJSONcase class JFloat(float: Float) extends ToyJSONcase object JNull extends ToyJSONcase class JBoolean(boolean: Boolean) extends ToyJSON
  31. 31. Fancier toy JSON parser: Scaladef value: Parser[ToyJSON] = obj | arr |stringLiteralStripped ^^ { s => JString(s) } |floatingPointNumber ^^ { s => JFloat(s.toFloat) } |"null" ^^^ JNull | "true" ^^^ JBoolean(true) |"false" ^^^ JBoolean(false)def stringLiteralStripped: Parser[String] =stringLiteral ^^ { s => s.substring(1, s.length-1) }def obj: Parser[JObject] = "{" ~> (repsep(member, ",") ^^{ aList => JObject(aList.toMap) }) <~ "}"def arr: Parser[JArray] = "[" ~> (repsep(value, ",") ^^{ aList => JArray(aList) }) <~ "]"def member: Parser[(String, ToyJSON)] =((stringLiteralStripped <~ ":") ~ value) ^^{ case s ~ v => s -> v }
  32. 32. Running the fancy toy JSON parserval j = """{"distance" : 5.6,"address" : {"street" : "0 Nowhere Road","neighbors" : ["X", "Y"],"garage" : null}}"""parseAll(value, j) match {case Success(result, _) =>println(result)case failure: NoSuccess =>println("Parse failed, and here’s why:")println(failure)}Output: what we proposed when data modeling!
  33. 33. JSON: reminder not to reinventUse a standard JSON parsing library!Scala has one.All languages have a standard JSON parsing library.Shop around: alternate libraries have different tradeoffs.Other standard formats: HTML, XML, CSV, etc.
  34. 34. JSON wrapupParsing can be simpleDomain modeling is trickier but still fit on one pageUse an existing JSON parsing library already!
  35. 35. Final regex exampleThese slides use the Python library Pygments for syntaxhighlighting of all code (highly recommended)Experienced bugs in the highlighting of regexes in Perl codeFound out why, in source code for class PerlLexer#TODO: give this to a perl guy who knows how to parse perltokens = {’balanced-regex’: [(r’/(|[^]|[^/])*/[egimosx]*’, String.Regex, ’#p(r’!(|[^]|[^!])*![egimosx]*’, String.Regex, ’#p(r’(|[^])*[egimosx]*’, String.Regex, ’#pop’),(r’{(|[^]|[^}])*}[egimosx]*’, String.Regex, ’#p(r’<(|[^]|[^>])*>[egimosx]*’, String.Regex, ’#p(r’[(|[^]|[^]])*][egimosx]*’, String.Regex,(r’((|[^]|[^)])*)[egimosx]*’, String.Regex,(r’@(|[^]|[^@])*@[egimosx]*’, String.Regex, ’#(r’%(|[^]|[^%])*%[egimosx]*’, String.Regex, ’#(r’$(|[^]|[^$])*$[egimosx]*’, String.Regex,
  36. 36. ConclusionRegular expressionsNo error reportingFlat dataMore general grammarsComposable structure (using combinator parsers)Hierarchical, nested dataAvoid reinventingConcepts are language-independent: find and use a parserlibrary for your languageAll materials for this talk available at hyperlinks on the slide PDFs are clickable.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.