0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

Regular expressions, Alex Perry, Google, PyCon2014

293

Published on

Published in: Data & Analytics, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
293
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
6
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript

• 1. Memorable uses for a Regular Expression library Learning the syntax by examples Alex Perry SRE, Google, Los Angeles April 2014
• 2. Outline &#x25CF; Simple Regular Expressions &#x25CF; import re &#x25CB; http://docs.python.org/2/library/re.html &#x25CF; Parsing &#x25CF; import sre &#x25CF; Formatting &#x25CF; import sre_yield &#x25CF; Arithmetic &#x25CF; Performance uncertainty &#x25CF; import re2
• 3. Basic Regular Expressions abc &#x201C;abc&#x201D; [abc] &#x201C;a&#x201D; &#x201C;b&#x201D; &#x201C;c&#x201D; abc? &#x201C;ab&#x201D; &#x201C;abc&#x201D; abc* &#x201C;ab&#x201D; &#x201C;abc&#x201D; &#x201C;abcc&#x201D; ... abc+ &#x201C;abc&#x201D; &#x201C;abcc&#x201D; &#x201C;abccc&#x201D; ... abc{3,4} &#x201C;abccc&#x201D; &#x201C;abcccc&#x201D; ab|c+ &#x201C;ab&#x201D; &#x201C;c+&#x201D; ab. &#x201C;ab.&#x201D; &#x201C;ab1&#x201D; &#x2026; &#x201C;abn&#x201D;DOTALL
• 4. The standard library - compiling &gt;&gt;&gt; import re &gt;&gt;&gt; o = re.compile(&#x201C;abc?&#x201D;) &gt;&gt;&gt; [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] &gt;&gt;&gt; [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]
• 5. The standard library - endings &gt;&gt;&gt; o = re.compile("^abc?\$") &gt;&gt;&gt; [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] &gt;&gt;&gt; s = re.compile("i*") # yes, that s matches &#x201C;&#x201D; &gt;&gt;&gt; s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] &gt;&gt;&gt; s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
• 6. Parsing strings easily &gt;&gt;&gt; import re &gt;&gt;&gt; cell = re.compile(r"(?P&lt;row&gt;[\$]?[a-z]+)" r"(?P&lt;col&gt;[\$]?[0-9]+)") &gt;&gt;&gt; m = cell.search("Spreadsheet cell aa\$15") &gt;&gt;&gt; m &lt;_sre.SRE_Match object at 0x7f220a8e9360&gt; &gt;&gt;&gt; m.groupdict() {'col': '\$15', 'row': 'aa'}
• 7. Formatting after parsing using a regular expression &gt;&gt;&gt; rc = m.groupdict() &gt;&gt;&gt; rc {'col': '\$15', 'row': 'aa'} &gt;&gt;&gt; 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column \$15' &gt;&gt;&gt; txt = "from a1 2 b\$22 as well as 4 \$c4" &gt;&gt;&gt; f = r"&lt;%(col)s,%(row)s&gt;" &gt;&gt;&gt; ";".join(f % m.groupdict() for m in cell.finditer(txt)) '&lt;1,a&gt;;&lt;\$22,b&gt;;&lt;4,\$c&gt;'
• 8. Secret (labs) RE engine - internals &#x25CF; Originally separate from module &#x201C;re&#x201D; &#x25CB; As of version 2.0 onwards they&#x2019;re equivalent &#x25CB; Call it &#x201C;sre&#x201D; in any backward compatible code &gt;&gt;&gt; import sre_parse &gt;&gt;&gt; sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]
• 9. Secret Regular Expression Yield &#x25CF; New module called sre_yield &#x25CB; https://github.com/google/sre_yield &#x25CF; def Values(regex, flags=0, charset=CHARSET) &#x25CB; Examines output from sre_parse.parse() &#x25CB; Returns a convenient sequence like object &#x25CF; Sequence has an efficient membership test &#x25CB; We were given a regex describing its content &#x25CF; Some features (lookahead, etc) still missing &#x25CB; Easy to add if sequence can contain None
• 10. Iterating over all matching strings &gt;&gt;&gt; import sre_yield &gt;&gt;&gt; sre_yield.Values(r'1(?P&lt;x&gt;234?|49?)')[:] ['123', '1234', '14', '149'] &gt;&gt;&gt; len(sre_yield.Values('.')) 256 &gt;&gt;&gt; sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
• 11. What do we do about infinite repetitions &gt;&gt;&gt; len(sre_yield.Values('0*')) 65536 # Yes, really. sre library can only specify 65534 max &gt;&gt;&gt; a77k = 'a' * 77000 &gt;&gt;&gt; len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 &gt;&gt;&gt; len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 &gt;&gt;&gt; len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000
• 12. How many matching strings &gt;&gt;&gt; import sre_yield &gt;&gt;&gt; bits = sre_yield.Values('[01]*') # All binary nums &gt;&gt;&gt; len(bits) # how many are there? Traceback (most recent call last): File "&lt;stdin&gt;", line 1, in &lt;module&gt; OverflowError: long int too large to convert to int &gt;&gt;&gt; bits.__len__() == 2**65536 - 1 # check the answer True &gt;&gt;&gt; len(str(bits.__len__())) # Is the number that big? 19729 &gt;&gt;&gt; "001001" in bits, "002001" in bits (True, False)
• 13. Python does understand working with large numbers &gt;&gt;&gt; import sre_yield &gt;&gt;&gt; anything = sre_yield.Values('.*') &gt;&gt;&gt; a = 1 &gt;&gt;&gt; for _ in xrange(65535): a = a * 256 + 1 &gt;&gt;&gt; anything.__len__() == a True &gt;&gt;&gt; str_a = str(a) # This does take a while &gt;&gt;&gt; len(str_a) 157825 &gt;&gt;&gt; str_a[:9], str_a[-9:] ('101818453', '945826561')
• 14. But why bother yielding from a regex &#x25CF; It can be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 &#x25CF; That doesn&#x2019;t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 &#x25CF; On the other hand, others are more convenient: www-(?P&lt;replica&gt;[1-8])[.]((:?P&lt;fleet&gt;canary|beta)[.]) widget[.](?P&lt;domain&gt;com|co[.]uk|ch|de) &#x25CF; Some things would better be machine generated: 192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}
• 15. &#x25CF; Implementation uses backtracking, i.e. PCRE &#x25CB; So it is fast providing it never guesses wrong &#x25CB; Trivial to write an expression that is &#x2026; slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the &#x201C;re&#x201D; library
• 16. The RE2 library &#x25CF; https://code.google.com/p/re2 &#x25CF; https://github.com/axiak/pyre2 &#x25CF; RE2 tries all possible code paths in parallel &#x25CB; never backtracks, so omits features that need it &#x25CF; drops support for backreferences &#x25CB; and generalized zero-width assertions &#x25CF; Predictable worst case performance for any input &#x25CB; Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
• 17. Summary &#x25CF;Regular expressions are built into Python &#x25CB;re_obj = re.compile(pattern) &#x25CB;print re_obj.pattern &#x25CF;They can parse strings into a dictionary &#x25CB;Or iteratively many dictionaries &#x25CF;They can compactly represent large lists &#x25CB;Without expanding the whole iterator out &#x25CF;For reliable performance, use RE2 &#x25CB;Especially if users are supplying patterns
• 18. Questions? &#x25CF;mail -s us.pycon.org/2014 &#x25CB;Alex.Perry@Google.com &#x25CF; Nothing to do with me, but pretty good: &#x25CB; http://qntm.org/files/re/re.html