Regular expressions, Alex Perry, Google, PyCon2014

601 views

Published on

Published in: Data & Analytics, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
601
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Regular expressions, Alex Perry, Google, PyCon2014

  1. 1. Memorable uses for a Regular Expression library Learning the syntax by examples Alex Perry SRE, Google, Los Angeles April 2014
  2. 2. Outline ● Simple Regular Expressions ● import re ○ http://docs.python.org/2/library/re.html ● Parsing ● import sre ● Formatting ● import sre_yield ● Arithmetic ● Performance uncertainty ● import re2
  3. 3. Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc? “ab” “abc” abc* “ab” “abc” “abcc” ... abc+ “abc” “abcc” “abccc” ... abc{3,4} “abccc” “abcccc” ab|c+ “ab” “c+” ab. “ab.” “ab1” … “abn”DOTALL
  4. 4. The standard library - compiling >>> import re >>> o = re.compile(“abc?”) >>> [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]
  5. 5. The standard library - endings >>> o = re.compile("^abc?$") >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
  6. 6. Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)" r"(?P<col>[$]?[0-9]+)") >>> m = cell.search("Spreadsheet cell aa$15") >>> m <_sre.SRE_Match object at 0x7f220a8e9360> >>> m.groupdict() {'col': '$15', 'row': 'aa'}
  7. 7. Formatting after parsing using a regular expression >>> rc = m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'
  8. 8. Secret (labs) RE engine - internals ● Originally separate from module “re” ○ As of version 2.0 onwards they’re equivalent ○ Call it “sre” in any backward compatible code >>> import sre_parse >>> sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]
  9. 9. Secret Regular Expression Yield ● New module called sre_yield ○ https://github.com/google/sre_yield ● def Values(regex, flags=0, charset=CHARSET) ○ Examines output from sre_parse.parse() ○ Returns a convenient sequence like object ● Sequence has an efficient membership test ○ We were given a regex describing its content ● Some features (lookahead, etc) still missing ○ Easy to add if sequence can contain None
  10. 10. Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:] ['123', '1234', '14', '149'] >>> len(sre_yield.Values('.')) 256 >>> sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
  11. 11. What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536 # Yes, really. sre library can only specify 65534 max >>> a77k = 'a' * 77000 >>> len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 >>> len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 >>> len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000
  12. 12. How many matching strings >>> import sre_yield >>> bits = sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)
  13. 13. Python does understand working with large numbers >>> import sre_yield >>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')
  14. 14. But why bother yielding from a regex ● It can be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 ● That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 ● On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) ● Some things would better be machine generated: 192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}
  15. 15. ● Implementation uses backtracking, i.e. PCRE ○ So it is fast providing it never guesses wrong ○ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library
  16. 16. The RE2 library ● https://code.google.com/p/re2 ● https://github.com/axiak/pyre2 ● RE2 tries all possible code paths in parallel ○ never backtracks, so omits features that need it ● drops support for backreferences ○ and generalized zero-width assertions ● Predictable worst case performance for any input ○ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
  17. 17. Summary ●Regular expressions are built into Python ○re_obj = re.compile(pattern) ○print re_obj.pattern ●They can parse strings into a dictionary ○Or iteratively many dictionaries ●They can compactly represent large lists ○Without expanding the whole iterator out ●For reliable performance, use RE2 ○Especially if users are supplying patterns
  18. 18. Questions? ●mail -s us.pycon.org/2014 ○Alex.Perry@Google.com ● Nothing to do with me, but pretty good: ○ http://qntm.org/files/re/re.html

×