Your SlideShare is downloading. ×
0
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Regular expressions, Alex Perry, Google, PyCon2014

293

Published on

Published in: Data & Analytics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
293
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Memorable uses for a Regular Expression library Learning the syntax by examples Alex Perry SRE, Google, Los Angeles April 2014
  • 2. Outline ● Simple Regular Expressions ● import re ○ http://docs.python.org/2/library/re.html ● Parsing ● import sre ● Formatting ● import sre_yield ● Arithmetic ● Performance uncertainty ● import re2
  • 3. Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc? “ab” “abc” abc* “ab” “abc” “abcc” ... abc+ “abc” “abcc” “abccc” ... abc{3,4} “abccc” “abcccc” ab|c+ “ab” “c+” ab. “ab.” “ab1” … “abn”DOTALL
  • 4. The standard library - compiling >>> import re >>> o = re.compile(“abc?”) >>> [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]
  • 5. The standard library - endings >>> o = re.compile("^abc?$") >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
  • 6. Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)" r"(?P<col>[$]?[0-9]+)") >>> m = cell.search("Spreadsheet cell aa$15") >>> m <_sre.SRE_Match object at 0x7f220a8e9360> >>> m.groupdict() {'col': '$15', 'row': 'aa'}
  • 7. Formatting after parsing using a regular expression >>> rc = m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'
  • 8. Secret (labs) RE engine - internals ● Originally separate from module “re” ○ As of version 2.0 onwards they’re equivalent ○ Call it “sre” in any backward compatible code >>> import sre_parse >>> sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]
  • 9. Secret Regular Expression Yield ● New module called sre_yield ○ https://github.com/google/sre_yield ● def Values(regex, flags=0, charset=CHARSET) ○ Examines output from sre_parse.parse() ○ Returns a convenient sequence like object ● Sequence has an efficient membership test ○ We were given a regex describing its content ● Some features (lookahead, etc) still missing ○ Easy to add if sequence can contain None
  • 10. Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:] ['123', '1234', '14', '149'] >>> len(sre_yield.Values('.')) 256 >>> sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
  • 11. What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536 # Yes, really. sre library can only specify 65534 max >>> a77k = 'a' * 77000 >>> len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 >>> len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 >>> len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000
  • 12. How many matching strings >>> import sre_yield >>> bits = sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)
  • 13. Python does understand working with large numbers >>> import sre_yield >>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')
  • 14. But why bother yielding from a regex ● It can be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 ● That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 ● On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) ● Some things would better be machine generated: 192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}
  • 15. ● Implementation uses backtracking, i.e. PCRE ○ So it is fast providing it never guesses wrong ○ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library
  • 16. The RE2 library ● https://code.google.com/p/re2 ● https://github.com/axiak/pyre2 ● RE2 tries all possible code paths in parallel ○ never backtracks, so omits features that need it ● drops support for backreferences ○ and generalized zero-width assertions ● Predictable worst case performance for any input ○ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
  • 17. Summary ●Regular expressions are built into Python ○re_obj = re.compile(pattern) ○print re_obj.pattern ●They can parse strings into a dictionary ○Or iteratively many dictionaries ●They can compactly represent large lists ○Without expanding the whole iterator out ●For reliable performance, use RE2 ○Especially if users are supplying patterns
  • 18. Questions? ●mail -s us.pycon.org/2014 ○Alex.Perry@Google.com ● Nothing to do with me, but pretty good: ○ http://qntm.org/files/re/re.html

×