• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A Taste of Python - Devdays Toronto 2009
 

A Taste of Python - Devdays Toronto 2009

on

  • 2,724 views

Explores Peter Norvig's spell corrector written in Python as an example of the language's elegance and readability

Explores Peter Norvig's spell corrector written in Python as an example of the language's elegance and readability

Statistics

Views

Total Views
2,724
Views on SlideShare
2,391
Embed Views
333

Actions

Likes
2
Downloads
30
Comments
0

5 Embeds 333

http://www.globalnerdy.com 263
http://blogs.msdn.com 34
http://www.onhuge.com 31
http://www.linkedin.com 3
http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A Taste of Python - Devdays Toronto 2009 A Taste of Python - Devdays Toronto 2009 Presentation Transcript

    • a taste of Presented by Jordan Baker October 23, 2009 DevDays Toronto
    • About Me • Open Source Developer • Founder of Open Source Web Application and CMS service provider: Scryent - www.scryent.com • Founder of Toronto Plone Users Group - www.torontoplone.ca
    • Agenda • About Python • Show me your CODE • A Spell Checker in 21 lines of code • Why Python ROCKS • Resources for further exploration
    • About Python http://www.flickr.com/photos/schoffer/196079076/
    • About Python • Gotta love a language named after Monty Python’s Flying Circus • Used in more places than you might know
    • Significant Whitespace C-like if(x == 2) { do_something(); } do_something_else(); Python if x == 2: do_something() do_something_else()
    • Significant Whitespace • less code clutter • eliminates many common syntax errors • proper code layout • use an indentation aware editor or IDE • Get over it!
    • Python is Interactive Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>
    • FIZZ BUZZ 1 2 FIZZ 4 BUZZ ... 14 FIZZ BUZZ
    • FIZZ BUZZ def fizzbuzz(n):     for i in range(n + 1):         if not i % 3:             print "Fizz",         if not i % 5:             print "Buzz",         if i % 3 and i % 5:             print i,         print fizzbuzz(50)
    • FIZZ BUZZ def fizzbuzz(n):     for i in range(n + 1):         if not i % 3:             print "Fizz",         if not i % 5:             print "Buzz",         if i % 3 and i % 5:             print i,         print fizzbuzz(50)
    • FIZZ BUZZ (OO) class FizzBuzzWriter(object):     def __init__(self, limit):         self.limit = limit             def run(self):         for n in range(1, self.limit + 1):             self.write_number(n)         def write_number(self, n):         if not n % 3:             print "Fizz",         if not n % 5:             print "Buzz",         if n % 3 and n % 5:             print n,         print         fizzbuzz = FizzBuzzWriter(50) fizzbuzz.run()
    • A Spell Checker in 21 Lines of Code • Written by Peter Norvig • Duplicated in many languages • Simple Spellchecking algorithm based on probability • http://norvig.com/spell-correct.html
    • The Approach • Census by frequency • Morph the word (werd) • Insertions: waerd, wberd, werzd • Deletions: wrd, wed, wer • Transpositions: ewrd, wred, wedr • Replacements: aerd, ward, wbrd, word, wzrd, werz • Find the one with the highest frequency: were
    • Norvig Spellchecker import re, collections def words(text):    return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)     for w in words:        model[w] += 1     return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words):    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)
    • Regular Expressions def words(text): return re.findall('[a-z]+', text.lower()) >>> words("The cat in the hat!") ['the', 'cat', 'in', 'the', 'hat']
    • Dictionaries >>> d = {'cat':1} >>> d {'cat': 1} >>> d['cat'] 1 >>> d['cat'] += 1 >>> d {'cat': 2} >>> d['dog'] += 1 Traceback (most recent call last):  File "<stdin>", line 1, in <module> KeyError: 'dog' 
    • defaultdict # Has a factory for missing keys >>> d = collections.defaultdict(int) >>> d['dog'] += 1 >>> d {'dog': 1} >>> int <type 'int'> >>> int() 0 def train(words):    model = collections.defaultdict(int)    for w in words:        model[w] += 1    return model >>> train(words("The cat in the hat!")) {'cat': 1, 'the': 2, 'hat': 1, 'in': 1}              
    • Reading the File    >>> text = file('big.txt').read()    >>> NWORDS = train(words(text))    >>> NWORDS    {'nunnery': 3, 'presnya': 1, 'woods': 22, 'clotted': 1, 'spiders': 1,    'hanging': 42, 'disobeying': 2, 'scold': 3, 'originality': 6,    'grenadiers': 8, 'pigment': 16, 'appropriation': 6, 'strictest': 1,    'bringing': 48, 'revelers': 1, 'wooded': 8, 'wooden': 37,    'wednesday': 13, 'shows': 50, 'immunities': 3, 'guardsmen': 4,    'sooty': 1, 'inevitably': 32, 'clavicular': 9, 'sustaining': 5,    'consenting': 1, 'scraped': 21, 'errors': 16, 'semicircular': 1,    'cooking': 6, 'spiroch': 25, 'designing': 1, 'pawed': 1,    'succumb': 12, 'shocks': 1, 'crouch': 2, 'chins': 1, 'awistocwacy': 1,    'sunbeams': 1, 'perforations': 6, 'china': 43, 'affiliated': 4,    'chunk': 22, 'natured': 34, 'uplifting': 1, 'slaveholders': 2,    'climbed': 13, 'controversy': 33, 'natures': 2, 'climber': 1,    'lency': 2, 'joyousness': 1, 'reproaching': 3, 'insecurity': 1,    'abbreviations': 1, 'definiteness': 1, 'music': 56, 'therefore': 186,    'expeditionary': 3, 'primeval': 1, 'unpack': 1, 'circumstances': 107,    ... (about 6500 more lines) ...    >>> NWORDS['the']    80030    >>> NWORDS['unusual']    32    >>> NWORDS['cephalopod']    0
    • Training the Probability Model import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)    for w in words:    model[w] += 1    return model NWORDS = train(words(file('big.txt').read()))
    • List Comprehensions # These two are equivalent: result = [] for v in iter: if cond:    result.append(expr) [ expr for v in iter if cond ] # You can nest loops also: result = [] for v1 in iter1:    for v2 in iter2:        if cond:            result.append(expr) [ expr for v1 in iter1 for v2 in iter2 if cond ]  
    • String Slicing >>> word = "spam" >>> word[:1] 's' >>> word[1:] 'pam' >>> (word[:1], word[1:]) ('s', 'pam') >>> range(len(word) + 1) [0, 1, 2, 3, 4] >>> [(word[:i], word[i:]) for i in range(len(word) + 1)] [('', 'spam'), ('s', 'pam'), ('sp', 'am'), ('spa', 'm'), ('spam', '')]
    • Deletions >>> word = "spam" >>> s = [(word[:i], word[i:]) for i in range(len(word) + 1)] >>> deletes = [a + b[1:] for a, b in s if b] >>> deletes ['pam', 'sam', 'spm', 'spa'] >>> a, b = ('s', 'pam') >>> a 's' >>> b 'pam' >>> bool('pam') True >>> bool('') False
    • Transpositions For example: teh => the >>> transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] >>> transposes ['psam', 'sapm', 'spma']
    • Replacements >>> alphabet = "abcdefghijklmnopqrstuvwxyz" >>> replaces = [a + c + b[1:]  for a, b in s for c in alphabet if b] >>> replaces ['apam', 'bpam', ..., 'zpam', 'saam', ..., 'szam', ..., 'spaz']
    • Insertion >>> alphabet = "abcdefghijklmnopqrstuvwxyz" >>> inserts = [a + c + b  for a, b in s for c in alphabet] >>> inserts ['aspam', ..., 'zspam', 'sapam', ..., 'szpam', 'spaam', ..., 'spamz']
    • Find all Edits alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts = [a + c + b  for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) >>> edits1("spam") set(['sptm', 'skam', 'spzam', 'vspam', 'spamj', 'zpam', 'sbam', 'spham', 'snam', 'sjpam', 'spma', 'swam', 'spaem', 'tspam', 'spmm', 'slpam', 'upam', 'spaim', 'sppm', 'spnam', 'spem', 'sparm', 'spamr', 'lspam', 'sdpam', 'spams', 'spaml', 'spamm', 'spamn', 'spum', 'spamh', 'spami', 'spatm', 'spamk', 'spamd', ..., 'spcam', 'spamy'])
    • Known Words def known(words):        """ Return the known words from `words`. """        return set(w for w in words if w in NWORDS)
    • Correct def known(words):    """ Return the known words from `words`. """    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or [word]    return max(candidates, key=NWORDS.get) >>> bool(set([])) False >>> correct("computr") 'computer' >>> correct("computor") 'computer' >>> correct("computerr") 'computer'
    • Edit Distance 2 def known_edits2(word):    return set(        e2            for e1 in edits1(word)                for e2 in edits1(e1)                    if e2 in NWORDS        ) def correct(word):    candidates = known([word]) or known(edits1(word)) or        known_edits2(word) or [word]    return max(candidates, key=NWORDS.get) >>> correct("conpuler") 'computer' >>> correct("cmpuler") 'computer'
    • import re, collections def words(text):    return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)     for w in words:        model[w] += 1     return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words):    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)
    • Comparing Python & Java Versions • http://raelcunha.com/spell-correct.php • 35 lines of Java
    • import java.io.*; import java.util.*; import java.util.regex.*; class Spelling { " private final HashMap<String, Integer> nWords = new HashMap<String, Integer>(); " public Spelling(String file) throws IOException { " " BufferedReader in = new BufferedReader(new FileReader(file)); " " Pattern p = Pattern.compile("w+"); " " for(String temp = ""; temp != null; temp = in.readLine()){ " " " Matcher m = p.matcher(temp.toLowerCase()); " " " while(m.find()) nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1); " " } " " in.close(); " } " private final ArrayList<String> edits(String word) { " " ArrayList<String> result = new ArrayList<String>(); " " for(int i=0; i < word.length(); ++i) result.add(word.substring(0, i) + word.substring(i+1)); " " for(int i=0; i < word.length()-1; ++i) result.add(word.substring(0, i) + word.substring(i+1, i+2) + word.substring(i, i+1) + word.substring(i+2)); " " for(int i=0; i < word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1)); " " for(int i=0; i <= word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i)); " " return result; " } " public final String correct(String word) { " " if(nWords.containsKey(word)) return word; " " ArrayList<String> list = edits(word); " " HashMap<Integer, String> candidates = new HashMap<Integer, String>(); " " for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s); " " if(candidates.size() > 0) return candidates.get(Collections.max(candidates.keySet())); " " for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w); " " return candidates.size() > 0 ? candidates.get(Collections.max(candidates.keySet())) : word; " } " public static void main(String args[]) throws IOException { " " if(args.length > 0) System.out.println((new Spelling("big.txt")).correct(args[0])); " } }
    • import re, collections def words(text):    return re.findall('[a-z]+', text.lower()) def train(words):    model = collections.defaultdict(int)     for w in words:        model[w] += 1     return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts) def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words):    return set(w for w in words if w in NWORDS) def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)
    • IDE for Python • IDE’s for Python include: • PyDev for Eclipse • WingIDE • IDLE for Windows/ Linux/ Mac • there’s more
    • Why Python ROCKS • Elegant and readable language - “Executable Pseudocode” • Standard Libraries - “Batteries Included” • Very High level Datatypes • Dynamically Typed • It’s FUN!
    • An Open Source Community • Projects: Plone, Zope, Grok, BFG, Django, SciPy & NumPy, Google App Engine, PyGame • PyCon
    • Resources • PyGTA • Toronto Plone Users • Toronto Django Users • Stackoverflow • Dive into Python • Python Tutorial
    • Thanks • I’d love to hear your questions or comments on this presentation. Reach me at: • jbb@scryent.com • http://twitter.com/hexsprite