Hacker102 - RegExes w/JavaScript and Python

1,030 views

Published on

Basic introduction to regexes using JavaScript and Python. Developed for code4lib 2010 conference preconf "Hacker 101/102".

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,030
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hacker102 - RegExes w/JavaScript and Python

  1. 1. hacker 102 code4lib 2010 preconference Asheville, NC, USA 2010-02-21
  2. 2. iv. regular expressions JavaScript
  3. 3. if all language looked like “aabaaaabbbabaababa” it’d be easy to parse
  4. 4. parsing “aabaaaabbbabaababa” • there are two elements, “a” and “b” • either may occur in any order • /([ab]+)/
  5. 5. • [] denotes “elements” or “class” • // demarcates regex • + denotes “one or more of previous thing” • () denotes “remember this matched group” • /[ab]/ # an ‘a’ or a ‘b’ • /[ab]+/ # one or more ‘a’s or ‘b’s • /([ab]+)/ # a group of one or more ‘a’s or ‘b’s
  6. 6. to firebug!
  7. 7. • [a-z] is any lower case char bet. a-z • [0-9] is any digit • + is one or more of previous thing • ? is zero or one of previous thing • | is or, e.g. [a|b] is ‘a’ or ‘b’ • * is zero to many of previous thing • . matches any character
  8. 8. • [^a-z] is anything *but* [a-z] • [a-zA-Z0-9] is any of a-z, A-Z, 0-9 • {5} matches only 5 of the preceding thing • {2,} matches at least 2 of the preceding thing • {2,6} matches from 2 to 6 of preceding thing • [d] is like [0-9] (any digit) • [S] is any non-whitespace
  9. 9. try this • visit any web page • open firebug console • title = window.document.title • try regexes to match parts of the title
  10. 10. most every language has regex support
  11. 11. try unix “grep”
  12. 12. v. glue it together Python
  13. 13. problem: Carol’s data
  14. 14. TITLE: ABA journal. BD. HOLDINGS: Vol. 70 (1984) - Vol. 94 (2008) CURRENT VOL.: Vol. 95 (2009) - OTHER LIBRARIES: Miami:v. 68 (1982) - USDC: v. 88 (2002) - Birm.:v. 89 (2003) - (Formerly: American Bar Association Journal) (Bound and on Hein) TITLE: Administrative law review. BD. HOLDINGS: Vol. 22 (1969/1970) - Vol. 60 (2008) CURRENT VOL.: Vol. 61 (2009) - (Bound and on Hein)
  15. 15. starter code for you
  16. 16. #!/usr/bin/env python import re re_tag = re.compile(r'([A-Z .]+):') re_title = re.compile('TITLE: (.*)') for line in open('journals-carol-bean.txt'): line = line.strip() m1 = re_tag.match(line) m2 = re_title.match(line) if line == "": continue print "n->", line, "<-" if m1 or m2: print "MATCH" if m1: print 'tag:', m1.groups() if m2: print 'title:', m2.groups()

×