Your SlideShare is downloading. ×
A Taste of Python - Devdays Toronto 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A Taste of Python - Devdays Toronto 2009

1,645
views

Published on

Explores Peter Norvig's spell corrector written in Python as an example of the language's elegance and readability

Explores Peter Norvig's spell corrector written in Python as an example of the language's elegance and readability

Published in: Technology, Education

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,645
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
31
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. a taste of Presented by Jordan Baker October 23, 2009 DevDays Toronto
  • 2. About Me • Open Source Developer • Founder of Open Source Web Application and CMS service provider: Scryent - www.scryent.com • Founder of Toronto Plone Users Group - www.torontoplone.ca
  • 3. Agenda • About Python • Show me your CODE • A Spell Checker in 21 lines of code • Why Python ROCKS • Resources for further exploration
  • 4. About Python http://www.flickr.com/photos/schoffer/196079076/
  • 5. About Python • Gotta love a language named after Monty Python’s Flying Circus • Used in more places than you might know
  • 6. Significant Whitespace C-like if(x == 2) { do_something(); } do_something_else(); Python if x == 2: do_something() do_something_else()
  • 7. Significant Whitespace • less code clutter • eliminates many common syntax errors • proper code layout • use an indentation aware editor or IDE • Get over it!
  • 8. Python is Interactive Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>
  • 9. FIZZ BUZZ 1 2 FIZZ 4 BUZZ ... 14 FIZZ BUZZ
  • 10. FIZZ BUZZ def fizzbuzz(n):     for i in range(n + 1):         if not i % 3:             print "Fizz",         if not i % 5:             print "Buzz",         if i % 3 and i % 5:             print i,         print fizzbuzz(50)
  • 11. FIZZ BUZZ def fizzbuzz(n):     for i in range(n + 1):         if not i % 3:             print "Fizz",         if not i % 5:             print "Buzz",         if i % 3 and i % 5:             print i,         print fizzbuzz(50)
  • 12. FIZZ BUZZ (OO) class FizzBuzzWriter(object):     def __init__(self, limit):         self.limit = limit             def run(self):         for n in range(1, self.limit + 1):             self.write_number(n)         def write_number(self, n):         if not n % 3:             print "Fizz",         if not n % 5:             print "Buzz",         if n % 3 and n % 5:             print n,         print         fizzbuzz = FizzBuzzWriter(50) fizzbuzz.run()
  • 13. A Spell Checker in 21 Lines of Code • Written by Peter Norvig • Duplicated in many languages • Simple Spellchecking algorithm based on probability • http://norvig.com/spell-correct.html
  • 14. The Approach • Census by frequency • Morph the word (werd) • Insertions: waerd, wberd, werzd • Deletions: wrd, wed, wer • Transpositions: ewrd, wred, wedr • Replacements: aerd, ward, wbrd, word, wzrd, werz • Find the one with the highest frequency: were
  • 15. Norvig Spellchecker import re, collections def words(text):    return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)     for w in words:        model[w] += 1     return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words):    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)
  • 16. Regular Expressions def words(text): return re.findall('[a-z]+', text.lower()) >>> words("The cat in the hat!") ['the', 'cat', 'in', 'the', 'hat']
  • 17. Dictionaries >>> d = {'cat':1} >>> d {'cat': 1} >>> d['cat'] 1 >>> d['cat'] += 1 >>> d {'cat': 2} >>> d['dog'] += 1 Traceback (most recent call last):  File "<stdin>", line 1, in <module> KeyError: 'dog' 
  • 18. defaultdict # Has a factory for missing keys >>> d = collections.defaultdict(int) >>> d['dog'] += 1 >>> d {'dog': 1} >>> int <type 'int'> >>> int() 0 def train(words):    model = collections.defaultdict(int)    for w in words:        model[w] += 1    return model >>> train(words("The cat in the hat!")) {'cat': 1, 'the': 2, 'hat': 1, 'in': 1}              
  • 19. Reading the File    >>> text = file('big.txt').read()    >>> NWORDS = train(words(text))    >>> NWORDS    {'nunnery': 3, 'presnya': 1, 'woods': 22, 'clotted': 1, 'spiders': 1,    'hanging': 42, 'disobeying': 2, 'scold': 3, 'originality': 6,    'grenadiers': 8, 'pigment': 16, 'appropriation': 6, 'strictest': 1,    'bringing': 48, 'revelers': 1, 'wooded': 8, 'wooden': 37,    'wednesday': 13, 'shows': 50, 'immunities': 3, 'guardsmen': 4,    'sooty': 1, 'inevitably': 32, 'clavicular': 9, 'sustaining': 5,    'consenting': 1, 'scraped': 21, 'errors': 16, 'semicircular': 1,    'cooking': 6, 'spiroch': 25, 'designing': 1, 'pawed': 1,    'succumb': 12, 'shocks': 1, 'crouch': 2, 'chins': 1, 'awistocwacy': 1,    'sunbeams': 1, 'perforations': 6, 'china': 43, 'affiliated': 4,    'chunk': 22, 'natured': 34, 'uplifting': 1, 'slaveholders': 2,    'climbed': 13, 'controversy': 33, 'natures': 2, 'climber': 1,    'lency': 2, 'joyousness': 1, 'reproaching': 3, 'insecurity': 1,    'abbreviations': 1, 'definiteness': 1, 'music': 56, 'therefore': 186,    'expeditionary': 3, 'primeval': 1, 'unpack': 1, 'circumstances': 107,    ... (about 6500 more lines) ...    >>> NWORDS['the']    80030    >>> NWORDS['unusual']    32    >>> NWORDS['cephalopod']    0
  • 20. Training the Probability Model import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)    for w in words:    model[w] += 1    return model NWORDS = train(words(file('big.txt').read()))
  • 21. List Comprehensions # These two are equivalent: result = [] for v in iter: if cond:    result.append(expr) [ expr for v in iter if cond ] # You can nest loops also: result = [] for v1 in iter1:    for v2 in iter2:        if cond:            result.append(expr) [ expr for v1 in iter1 for v2 in iter2 if cond ]  
  • 22. String Slicing >>> word = "spam" >>> word[:1] 's' >>> word[1:] 'pam' >>> (word[:1], word[1:]) ('s', 'pam') >>> range(len(word) + 1) [0, 1, 2, 3, 4] >>> [(word[:i], word[i:]) for i in range(len(word) + 1)] [('', 'spam'), ('s', 'pam'), ('sp', 'am'), ('spa', 'm'), ('spam', '')]
  • 23. Deletions >>> word = "spam" >>> s = [(word[:i], word[i:]) for i in range(len(word) + 1)] >>> deletes = [a + b[1:] for a, b in s if b] >>> deletes ['pam', 'sam', 'spm', 'spa'] >>> a, b = ('s', 'pam') >>> a 's' >>> b 'pam' >>> bool('pam') True >>> bool('') False
  • 24. Transpositions For example: teh => the >>> transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] >>> transposes ['psam', 'sapm', 'spma']
  • 25. Replacements >>> alphabet = "abcdefghijklmnopqrstuvwxyz" >>> replaces = [a + c + b[1:]  for a, b in s for c in alphabet if b] >>> replaces ['apam', 'bpam', ..., 'zpam', 'saam', ..., 'szam', ..., 'spaz']
  • 26. Insertion >>> alphabet = "abcdefghijklmnopqrstuvwxyz" >>> inserts = [a + c + b  for a, b in s for c in alphabet] >>> inserts ['aspam', ..., 'zspam', 'sapam', ..., 'szpam', 'spaam', ..., 'spamz']
  • 27. Find all Edits alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts = [a + c + b  for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) >>> edits1("spam") set(['sptm', 'skam', 'spzam', 'vspam', 'spamj', 'zpam', 'sbam', 'spham', 'snam', 'sjpam', 'spma', 'swam', 'spaem', 'tspam', 'spmm', 'slpam', 'upam', 'spaim', 'sppm', 'spnam', 'spem', 'sparm', 'spamr', 'lspam', 'sdpam', 'spams', 'spaml', 'spamm', 'spamn', 'spum', 'spamh', 'spami', 'spatm', 'spamk', 'spamd', ..., 'spcam', 'spamy'])
  • 28. Known Words def known(words):        """ Return the known words from `words`. """        return set(w for w in words if w in NWORDS)
  • 29. Correct def known(words):    """ Return the known words from `words`. """    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or [word]    return max(candidates, key=NWORDS.get) >>> bool(set([])) False >>> correct("computr") 'computer' >>> correct("computor") 'computer' >>> correct("computerr") 'computer'
  • 30. Edit Distance 2 def known_edits2(word):    return set(        e2            for e1 in edits1(word)                for e2 in edits1(e1)                    if e2 in NWORDS        ) def correct(word):    candidates = known([word]) or known(edits1(word)) or        known_edits2(word) or [word]    return max(candidates, key=NWORDS.get) >>> correct("conpuler") 'computer' >>> correct("cmpuler") 'computer'
  • 31. import re, collections def words(text):    return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)     for w in words:        model[w] += 1     return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words):    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)
  • 32. Comparing Python & Java Versions • http://raelcunha.com/spell-correct.php • 35 lines of Java
  • 33. import java.io.*; import java.util.*; import java.util.regex.*; class Spelling { " private final HashMap<String, Integer> nWords = new HashMap<String, Integer>(); " public Spelling(String file) throws IOException { " " BufferedReader in = new BufferedReader(new FileReader(file)); " " Pattern p = Pattern.compile("w+"); " " for(String temp = ""; temp != null; temp = in.readLine()){ " " " Matcher m = p.matcher(temp.toLowerCase()); " " " while(m.find()) nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1); " " } " " in.close(); " } " private final ArrayList<String> edits(String word) { " " ArrayList<String> result = new ArrayList<String>(); " " for(int i=0; i < word.length(); ++i) result.add(word.substring(0, i) + word.substring(i+1)); " " for(int i=0; i < word.length()-1; ++i) result.add(word.substring(0, i) + word.substring(i+1, i+2) + word.substring(i, i+1) + word.substring(i+2)); " " for(int i=0; i < word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1)); " " for(int i=0; i <= word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i)); " " return result; " } " public final String correct(String word) { " " if(nWords.containsKey(word)) return word; " " ArrayList<String> list = edits(word); " " HashMap<Integer, String> candidates = new HashMap<Integer, String>(); " " for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s); " " if(candidates.size() > 0) return candidates.get(Collections.max(candidates.keySet())); " " for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w); " " return candidates.size() > 0 ? candidates.get(Collections.max(candidates.keySet())) : word; " } " public static void main(String args[]) throws IOException { " " if(args.length > 0) System.out.println((new Spelling("big.txt")).correct(args[0])); " } }
  • 34. import re, collections def words(text):    return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)     for w in words:        model[w] += 1     return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words):    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)
  • 35. IDE for Python • IDE’s for Python include: • PyDev for Eclipse • WingIDE • IDLE for Windows/ Linux/ Mac • there’s more
  • 36. Why Python ROCKS • Elegant and readable language - “Executable Pseudocode” • Standard Libraries - “Batteries Included” • Very High level Datatypes • Dynamically Typed • It’s FUN!
  • 37. An Open Source Community • Projects: Plone, Zope, Grok, BFG, Django, SciPy & NumPy, Google App Engine, PyGame • PyCon
  • 38. Resources • PyGTA • Toronto Plone Users • Toronto Django Users • Stackoverflow • Dive into Python • Python Tutorial
  • 39. Thanks • I’d love to hear your questions or comments on this presentation. Reach me at: • jbb@scryent.com • http://twitter.com/hexsprite