SlideShare a Scribd company logo
Generator Tricks
        For Systems Programmers
                                                 David Beazley
                                            http://www.dabeaz.com

                                            Presented at PyCon'2008



Copyright (C) 2008, http://www.dabeaz.com                             1- 1




                                     An Introduction

              • Generators are cool!
              • But what are they?
              • And what are they good for?
              • That's what this tutorial is about

Copyright (C) 2008, http://www.dabeaz.com                             1- 2
About Me

              • I'm a long-time Pythonista
              • First started using Python with version 1.3
              • Author : Python Essential Reference
              • Responsible for a number of open source
                     Python-related packages (Swig, PLY, etc.)



Copyright (C) 2008, http://www.dabeaz.com                          1- 3




                                            My Story
               My addiction to generators started innocently
                      enough. I was just a happy Python
                 programmer working away in my secret lair
                 when I got quot;the call.quot; A call to sort through
                  1.5 Terabytes of C++ source code (~800
                weekly snapshots of a million line application).
                   That's when I discovered the os.walk()
               function. I knew this wasn't going to end well...



Copyright (C) 2008, http://www.dabeaz.com                          1- 4
Back Story

              • I think generators are wicked cool
              • An extremely useful language feature
              • Yet, they still seem a rather exotic
              • I still don't think I've fully wrapped my brain
                      around the best approach to using them



Copyright (C) 2008, http://www.dabeaz.com                         1- 5




                                            A Complaint
               • The coverage of generators in most Python
                      books is lame (mine included)
               • Look at all of these cool examples!
                  • Fibonacci Numbers
                  • Squaring a list of numbers
                  • Randomized sequences
               • Wow! Blow me over!
Copyright (C) 2008, http://www.dabeaz.com                         1- 6
This Tutorial
              • Some more practical uses of generators
              • Focus is quot;systems programmingquot;
              • Which loosely includes files, file systems,
                      parsing, networking, threads, etc.
              • My goal : To provide some more compelling
                      examples of using generators
              • Planting some seeds
Copyright (C) 2008, http://www.dabeaz.com                             1- 7




                                            Support Files

               • Files used in this tutorial are available here:
                                  http://www.dabeaz.com/generators/

               • Go there to follow along with the examples


Copyright (C) 2008, http://www.dabeaz.com                             1- 8
Disclaimer
               • This isn't meant to be an exhaustive tutorial
                      on generators and related theory
               • Will be looking at a series of examples
               • I don't know if the code I've written is the
                      quot;bestquot; way to solve any of these problems.
               • So, let's have a discussion

Copyright (C) 2008, http://www.dabeaz.com                          1- 9




                         Performance Details
              • There are some later performance numbers
              • Python 2.5.1 on OS X 10.4.11
              • All tests were conducted on the following:
                   • Mac Pro 2x2.66 Ghz Dual-Core Xeon
                   • 3 Gbytes RAM
                   • WDC WD2500JS-41SGB0 Disk (250G)
              • Timings are 3-run average of 'time' command
Copyright (C) 2008, http://www.dabeaz.com                          1- 10
Part I
                           Introduction to Iterators and Generators




Copyright (C) 2008, http://www.dabeaz.com                                1- 11




                                            Iteration
                  • As you know, Python has a quot;forquot; statement
                  • You use it to loop over a collection of items
                             >>> for x in [1,4,5,10]:
                             ...      print x,
                             ...
                             1 4 5 10
                             >>>


                   • And, as you have probably noticed, you can
                          iterate over many different kinds of objects
                          (not just lists)


Copyright (C) 2008, http://www.dabeaz.com                                1- 12
Iterating over a Dict
                  • If you loop over a dictionary you get keys
                           >>> prices = { 'GOOG' : 490.10,
                           ...            'AAPL' : 145.23,
                           ...            'YHOO' : 21.71 }
                           ...
                           >>> for key in prices:
                           ...     print key
                           ...
                           YHOO
                           GOOG
                           AAPL
                           >>>




Copyright (C) 2008, http://www.dabeaz.com                           1- 13




                     Iterating over a String
                  • If you loop over a string, you get characters
                          >>> s = quot;Yow!quot;
                          >>> for c in s:
                          ...     print c
                          ...
                          Y
                          o
                          w
                          !
                          >>>




Copyright (C) 2008, http://www.dabeaz.com                           1- 14
Iterating over a File
                   • If you loop over a file you get lines
                        >>> for line in open(quot;real.txtquot;):
                        ...     print line,
                        ...
                                 Real Programmers write in FORTRAN

                                    Maybe they do now,
                                    in this decadent era of
                                    Lite beer, hand calculators, and quot;user-friendlyquot; software
                                    but back in the Good Old Days,
                                    when the term quot;softwarequot; sounded funny
                                    and Real Computers were made out of drums and vacuum tubes,
                                    Real Programmers wrote in machine code.
                                    Not FORTRAN. Not RATFOR. Not, even, assembly language.
                                    Machine Code.
                                    Raw, unadorned, inscrutable hexadecimal numbers.
                                    Directly.

Copyright (C) 2008, http://www.dabeaz.com                                             1- 15




                        Consuming Iterables
                   • Many functions consume an quot;iterablequot; object
                   • Reductions:
                               sum(s), min(s), max(s)

                    • Constructors
                               list(s), tuple(s), set(s), dict(s)


                    • in operator
                               item in s


                    • Many others in the library
Copyright (C) 2008, http://www.dabeaz.com                                             1- 16
Iteration Protocol
               • The reason why you can iterate over different
                       objects is that there is a specific protocol
                         >>> items = [1, 4, 5]
                         >>> it = iter(items)
                         >>> it.next()
                         1
                         >>> it.next()
                         4
                         >>> it.next()
                         5
                         >>> it.next()
                         Traceback (most recent call last):
                           File quot;<stdin>quot;, line 1, in <module>
                         StopIteration
                         >>>



Copyright (C) 2008, http://www.dabeaz.com                                      1- 17




                                 Iteration Protocol
                   • An inside look at the for statement
                           for x in obj:
                               # statements


                  • Underneath the covers
                          _iter = iter(obj)            # Get iterator object
                          while 1:
                              try:
                                   x = _iter.next()    # Get next item
                              except StopIteration:    # No more items
                                   break
                              # statements
                              ...

                  • Any object that supports iter() and next() is
                          said to be quot;iterable.quot;

Copyright (C) 2008, http://www.dabeaz.com                                      1-18
Supporting Iteration
                 • User-defined objects can support iteration
                 • Example: Counting down...
                          >>> for x in countdown(10):
                          ...     print x,
                          ...
                          10 9 8 7 6 5 4 3 2 1
                          >>>


                • To do this, you just have to make the object
                        implement __iter__() and next()



Copyright (C) 2008, http://www.dabeaz.com                        1-19




                           Supporting Iteration
               • Sample implementation
                            class countdown(object):
                                def __init__(self,start):
                                    self.count = start
                                def __iter__(self):
                                    return self
                                def next(self):
                                    if self.count <= 0:
                                        raise StopIteration
                                    r = self.count
                                    self.count -= 1
                                    return r




Copyright (C) 2008, http://www.dabeaz.com                        1-20
Iteration Example

                  • Example use:
                               >>> c =      countdown(5)
                               >>> for      i in c:
                               ...          print i,
                               ...
                               5 4 3 2      1
                               >>>




Copyright (C) 2008, http://www.dabeaz.com                          1-21




                    Iteration Commentary

                  • There are many subtle details involving the
                         design of iterators for various objects
                  • However, we're not going to cover that
                  • This isn't a tutorial on quot;iteratorsquot;
                  • We're talking about generators...

Copyright (C) 2008, http://www.dabeaz.com                          1-22
Generators
                  • A generator is a function that produces a
                         sequence of results instead of a single value
                            def countdown(n):
                                while n > 0:
                                    yield n
                                    n -= 1
                            >>> for i in countdown(5):
                            ...     print i,
                            ...
                            5 4 3 2 1
                            >>>


                  • Instead of returning a value, you generate a
                         series of values (using the yield statement)

Copyright (C) 2008, http://www.dabeaz.com                                1-23




                                            Generators
                  • Behavior is quite different than normal func
                  • Calling a generator function creates an
                         generator object. However, it does not start
                         running the function.
                         def countdown(n):
                             print quot;Counting down fromquot;, n
                             while n > 0:
                                 yield n
                                 n -= 1              Notice that no
                                                       output was
                         >>> x = countdown(10)          produced
                         >>> x
                         <generator object at 0x58490>
                         >>>


Copyright (C) 2008, http://www.dabeaz.com                                1-24
Generator Functions
                 • The function only executes on next()
                         >>> x = countdown(10)
                         >>> x
                         <generator object at 0x58490>
                         >>> x.next()
                         Counting down from 10           Function starts
                         10                              executing here
                         >>>

                 • yield produces a value, but suspends the function
                 • Function resumes on next call to next()
                         >>> x.next()
                         9
                         >>> x.next()
                         8
                         >>>


Copyright (C) 2008, http://www.dabeaz.com                                  1-25




                         Generator Functions

                 • When the generator returns, iteration stops
                         >>> x.next()
                         1
                         >>> x.next()
                         Traceback (most recent call last):
                           File quot;<stdin>quot;, line 1, in ?
                         StopIteration
                         >>>




Copyright (C) 2008, http://www.dabeaz.com                                  1-26
Generator Functions

                • A generator function is mainly a more
                       convenient way of writing an iterator
                • You don't have to worry about the iterator
                       protocol (.next, .__iter__, etc.)
                • It just works

Copyright (C) 2008, http://www.dabeaz.com                           1-27




                Generators vs. Iterators
               • A generator function is slightly different
                      than an object that supports iteration
               • A generator is a one-time operation. You
                      can iterate over the generated data once,
                      but if you want to do it again, you have to
                      call the generator function again.
               • This is different than a list (which you can
                      iterate over as many times as you want)


Copyright (C) 2008, http://www.dabeaz.com                           1-28
Generator Expressions
                   • A generated version of a list comprehension
                            >>> a = [1,2,3,4]
                            >>> b = (2*x for x in a)
                            >>> b
                            <generator object at 0x58760>
                            >>> for i in b: print b,
                            ...
                            2 4 6 8
                            >>>


                   • This loops over a sequence of items and applies
                          an operation to each item
                   • However, results are produced one at a time
                          using a generator

Copyright (C) 2008, http://www.dabeaz.com                            1-29




                   Generator Expressions
                  • Important differences from a list comp.
                           •      Does not construct a list.

                           •      Only useful purpose is iteration

                           •      Once consumed, can't be reused

                  • Example:
                          >>> a = [1,2,3,4]
                          >>> b = [2*x for x in a]
                          >>> b
                          [2, 4, 6, 8]
                          >>> c = (2*x for x in a)
                          <generator object at 0x58760>
                          >>>

Copyright (C) 2008, http://www.dabeaz.com                            1-30
Generator Expressions
                  • General syntax
                           (expression for i in s if cond1
                                       for j in t if cond2
                                       ...
                                       if condfinal)



                  • What it means    for i in s:
                                         if cond1:
                                             for j in t:
                                                 if cond2:
                                                    ...
                                                    if condfinal: yield expression




Copyright (C) 2008, http://www.dabeaz.com                                            1-31




                                A Note on Syntax
                   • The parens on a generator expression can
                          dropped if used as a single function argument
                   • Example:
                           sum(x*x for x in s)




                               Generator expression




Copyright (C) 2008, http://www.dabeaz.com                                            1-32
Interlude
                  • We now have two basic building blocks
                  • Generator functions:
                            def countdown(n):
                                while n > 0:
                                     yield n
                                     n -= 1

                  • Generator expressions
                            squares = (x*x for x in s)


                  • In both cases, we get an object that
                         generates values (which are typically
                         consumed in a for loop)
Copyright (C) 2008, http://www.dabeaz.com                           1-33




                                                Part 2
                                            Processing Data Files

                                  (Show me your Web Server Logs)




Copyright (C) 2008, http://www.dabeaz.com                           1- 34
Programming Problem
                    Find out how many bytes of data were
                    transferred by summing up the last column
                    of data in this Apache web server log
            81.107.39.38 -                  ...   quot;GET   /ply/ HTTP/1.1quot; 200 7587
            81.107.39.38 -                  ...   quot;GET   /favicon.ico HTTP/1.1quot; 404 133
            81.107.39.38 -                  ...   quot;GET   /ply/bookplug.gif HTTP/1.1quot; 200 23903
            81.107.39.38 -                  ...   quot;GET   /ply/ply.html HTTP/1.1quot; 200 97238
            81.107.39.38 -                  ...   quot;GET   /ply/example.html HTTP/1.1quot; 200 2359
            66.249.72.134 -                 ...   quot;GET   /index.html HTTP/1.1quot; 200 4447



              Oh yeah, and the log file might be huge (Gbytes)


Copyright (C) 2008, http://www.dabeaz.com                                                    1-35




                                            The Log File
                  • Each line of the log looks like this:
                        81.107.39.38 -             ... quot;GET /ply/ply.html HTTP/1.1quot; 200 97238


                  • The number of bytes is the last column
                           bytestr = line.rsplit(None,1)[1]


                  • It's either a number or a missing value (-)
                         81.107.39.38 -            ... quot;GET /ply/ HTTP/1.1quot; 304 -


                  • Converting the value
                           if bytestr != '-':
                              bytes = int(bytestr)



Copyright (C) 2008, http://www.dabeaz.com                                                    1-36
A Non-Generator Soln
                • Just do a simple for-loop
                       wwwlog = open(quot;access-logquot;)
                       total = 0
                       for line in wwwlog:
                           bytestr = line.rsplit(None,1)[1]
                           if bytestr != '-':
                               total += int(bytestr)

                       print quot;Totalquot;, total


                 • We read line-by-line and just update a sum
                 • However, that's so 90s...
Copyright (C) 2008, http://www.dabeaz.com                                        1-37




                      A Generator Solution
                  • Let's use some generator expressions
                          wwwlog     = open(quot;access-logquot;)
                          bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
                          bytes      = (int(x) for x in bytecolumn if x != '-')

                          print quot;Totalquot;, sum(bytes)



                  • Whoa! That's different!
                     • Less code
                     • A completely different programming style
Copyright (C) 2008, http://www.dabeaz.com                                        1-38
Generators as a Pipeline
                   • To understand the solution, think of it as a data
                          processing pipeline

  access-log                     wwwlog     bytecolumn   bytes   sum()   total




                   • Each step is defined by iteration/generation
                          wwwlog     = open(quot;access-logquot;)
                          bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
                          bytes      = (int(x) for x in bytecolumn if x != '-')

                          print quot;Totalquot;, sum(bytes)



Copyright (C) 2008, http://www.dabeaz.com                                        1-39




                                   Being Declarative
            • At each step of the pipeline, we declare an
                   operation that will be applied to the entire
                   input stream
  access-log                     wwwlog     bytecolumn   bytes   sum()   total




              bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)




                            This operation gets applied to
                               every line of the log file

Copyright (C) 2008, http://www.dabeaz.com                                        1-40
Being Declarative

                • Instead of focusing on the problem at a
                       line-by-line level, you just break it down
                       into big operations that operate on the
                       whole file
                • This is very much a quot;declarativequot; style
                • The key : Think big...

Copyright (C) 2008, http://www.dabeaz.com                                                1-41




                           Iteration is the Glue
               • The glue that holds the pipeline together is the
                      iteration that occurs in each step
                       wwwlog               = open(quot;access-logquot;)

                       bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)

                       bytes                = (int(x) for x in bytecolumn if x != '-')

                       print quot;Totalquot;, sum(bytes)


               • The calculation is being driven by the last step
               • The sum() function is consuming values being
                      pushed through the pipeline (via .next() calls)

Copyright (C) 2008, http://www.dabeaz.com                                                1-42
Performance
                       • Surely, this generator approach has all
                              sorts of fancy-dancy magic that is slow.
                       • Let's check it out on a 1.3Gb log file...
                  % ls -l big-access-log
                  -rw-r--r-- beazley 1303238000 Feb 29 08:06 big-access-log




Copyright (C) 2008, http://www.dabeaz.com                                     1-43




                       Performance Contest
            wwwlog = open(quot;big-access-logquot;)
            total = 0
            for line in wwwlog:                              Time
                bytestr = line.rsplit(None,1)[1]
                if bytestr != '-':
                    total += int(bytestr)                     27.20
            print quot;Totalquot;, total



           wwwlog     = open(quot;big-access-logquot;)
           bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
           bytes      = (int(x) for x in bytecolumn if x != '-')

           print quot;Totalquot;, sum(bytes)                         Time
                                                              25.96
Copyright (C) 2008, http://www.dabeaz.com                                     1-44
Commentary
                  • Not only was it not slow, it was 5% faster
                  • And it was less code
                  • And it was relatively easy to read
                  • And frankly, I like it a whole better...
               quot;Back in the old days, we used AWK for this and
                 we liked it. Oh, yeah, and get off my lawn!quot;


Copyright (C) 2008, http://www.dabeaz.com                                  1-45




                       Performance Contest
            wwwlog     = open(quot;access-logquot;)
            bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
            bytes      = (int(x) for x in bytecolumn if x != '-')

            print quot;Totalquot;, sum(bytes)                   Time
                                                          25.96


             % awk '{ total += $NF } END { print total }' big-access-log


                                                         Time
                  Note:extracting the last
                   column may not be                       37.33
                    awk's strong point

Copyright (C) 2008, http://www.dabeaz.com                                  1-46
Food for Thought

                    • At no point in our generator solution did
                           we ever create large temporary lists
                    • Thus, not only is that solution faster, it can
                           be applied to enormous data files
                    • It's competitive with traditional tools

Copyright (C) 2008, http://www.dabeaz.com                              1-47




                                            More Thoughts
                    • The generator solution was based on the
                           concept of pipelining data between
                           different components
                    • What if you had more advanced kinds of
                           components to work with?
                    • Perhaps you could perform different kinds
                           of processing by just plugging various
                           pipeline components together


Copyright (C) 2008, http://www.dabeaz.com                              1-48
This Sounds Familiar

                    • The Unix philosophy
                    • Have a collection of useful system utils
                    • Can hook these up to files or each other
                    • Perform complex tasks by piping data


Copyright (C) 2008, http://www.dabeaz.com                               1-49




                                               Part 3
                                       Fun with Files and Directories




Copyright (C) 2008, http://www.dabeaz.com                               1- 50
Programming Problem
           You have hundreds of web server logs scattered
           across various directories. In additional, some of
           the logs are compressed. Modify the last program
           so that you can easily read all of these logs

                              foo/
                                  access-log-012007.gz
                                  access-log-022007.gz
                                  access-log-032007.gz
                                  ...
                                  access-log-012008
                              bar/
                                  access-log-092007.bz2
                                  ...
                                  access-log-022008

Copyright (C) 2008, http://www.dabeaz.com                                1-51




                                            os.walk()
              • A very useful function for searching the
                      file system
                       import os

                       for path, dirlist, filelist in os.walk(topdir):
                           # path     : Current directory
                           # dirlist : List of subdirectories
                           # filelist : List of files
                           ...



              • This utilizes generators to recursively walk
                      through the file system


Copyright (C) 2008, http://www.dabeaz.com                                1-52
find
                • Generate all filenames in a directory tree
                       that match a given filename pattern
                       import os
                       import fnmatch

                       def gen_find(filepat,top):
                           for path, dirlist, filelist in os.walk(top):
                               for name in fnmatch.filter(filelist,filepat):
                                   yield os.path.join(path,name)


                • Examples
                       pyfiles = gen_find(quot;*.pyquot;,quot;/quot;)
                       logs    = gen_find(quot;access-log*quot;,quot;/usr/www/quot;)




Copyright (C) 2008, http://www.dabeaz.com                                      1-53




                       Performance Contest
             pyfiles = gen_find(quot;*.pyquot;,quot;/quot;)
             for name in pyfiles:
                                                       Wall Clock Time
                 print name
                                                              559s


             % find / -name '*.py'
                                                       Wall Clock Time
                                                              468s

          Performed on a 750GB file system
           containing about 140000 .py files

Copyright (C) 2008, http://www.dabeaz.com                                      1-54
A File Opener
               • Open a sequence of filenames
                         import gzip, bz2
                         def gen_open(filenames):
                             for name in filenames:
                                 if name.endswith(quot;.gzquot;):
                                       yield gzip.open(name)
                                 elif name.endswith(quot;.bz2quot;):
                                       yield bz2.BZ2File(name)
                                 else:
                                       yield open(name)


               • This is interesting.... it takes a sequence of
                      filenames as input and yields a sequence of open
                      file objects

Copyright (C) 2008, http://www.dabeaz.com                               1-55




                                                 cat
                • Concatenate items from one or more
                       source into a single sequence of items
                       def gen_cat(sources):
                           for s in sources:
                               for item in s:
                                   yield item


                • Example:
                       lognames = gen_find(quot;access-log*quot;, quot;/usr/wwwquot;)
                       logfiles = gen_open(lognames)
                       loglines = gen_cat(logfiles)




Copyright (C) 2008, http://www.dabeaz.com                               1-56
grep
              • Generate a sequence of lines that contain
                      a given regular expression
                       import re

                       def gen_grep(pat, lines):
                           patc = re.compile(pat)
                           for line in lines:
                               if patc.search(line): yield line


              • Example:
                       lognames             =   gen_find(quot;access-log*quot;, quot;/usr/wwwquot;)
                       logfiles             =   gen_open(lognames)
                       loglines             =   gen_cat(logfiles)
                       patlines             =   gen_grep(pat, loglines)



Copyright (C) 2008, http://www.dabeaz.com                                                      1-57




                                                       Example
              • Find out how many bytes transferred for a
                     specific pattern in a whole directory of logs
                      pat                       = rquot;somepatternquot;
                      logdir                    = quot;/some/dir/quot;

                      filenames                 =   gen_find(quot;access-log*quot;,logdir)
                      logfiles                  =   gen_open(filenames)
                      loglines                  =   gen_cat(logfiles)
                      patlines                  =   gen_grep(pat,loglines)
                      bytecolumn                =   (line.rsplit(None,1)[1] for line in patlines)
                      bytes                     =   (int(x) for x in bytecolumn if x != '-')

                      print quot;Totalquot;, sum(bytes)




Copyright (C) 2008, http://www.dabeaz.com                                                      1-58
Important Concept
                • Generators decouple iteration from the
                       code that uses the results of the iteration
                • In the last example, we're performing a
                       calculation on a sequence of lines
                • It doesn't matter where or how those
                       lines are generated
                • Thus, we can plug any number of
                       components together up front as long as
                       they eventually produce a line sequence

Copyright (C) 2008, http://www.dabeaz.com                                 1-59




                                                  Part 4
                                            Parsing and Processing Data




Copyright (C) 2008, http://www.dabeaz.com                                 1- 60
Programming Problem
           Web server logs consist of different columns of
           data. Parse each line into a useful data structure
           that allows us to easily inspect the different fields.


      81.107.39.38 - - [24/Feb/2008:00:08:59 -0600] quot;GET ...quot; 200 7587




                   host referrer user [datetime] quot;requestquot; status bytes




Copyright (C) 2008, http://www.dabeaz.com                                                 1-61




                              Parsing with Regex
               • Let's route the lines through a regex parser
                        logpats = r'(S+) (S+) (S+) [(.*?)] '
                                  r'quot;(S+) (S+) (S+)quot; (S+) (S+)'

                        logpat = re.compile(logpats)

                        groups              = (logpat.match(line) for line in loglines)
                        tuples              = (g.groups() for g in groups if g)


                 • This generates a sequence of tuples
                       ('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600',
                       'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238')




Copyright (C) 2008, http://www.dabeaz.com                                                 1-62
Tuples to Dictionaries
               • Let's turn tuples into dictionaries
                    colnames                = ('host','referrer','user','datetime',
                                               'method','request','proto','status','bytes')

                    log                     = (dict(zip(colnames,t)) for t in tuples)


               • This generates a sequence of named fields
                          { 'status' :           '200',
                            'proto'   :          'HTTP/1.1',
                            'referrer':          '-',
                            'request' :          '/ply/ply.html',
                            'bytes'   :          '97238',
                            'datetime':          '24/Feb/2008:00:08:59 -0600',
                            'host'    :          '140.180.132.213',
                            'user'    :          '-',
                            'method' :           'GET'}

Copyright (C) 2008, http://www.dabeaz.com                                                 1-63




                                    Field Conversion
               • Map specific dictionary fields through a function
                        def field_map(dictseq,name,func):
                            for d in dictseq:
                                d[name] = func(d[name])
                                yield d



               • Example: Convert a few field values
                        log = field_map(log,quot;statusquot;, int)
                        log = field_map(log,quot;bytesquot;,
                                        lambda s: int(s) if s !='-' else 0)




Copyright (C) 2008, http://www.dabeaz.com                                                 1-64
Field Conversion
            • Creates dictionaries of converted values
                    { 'status': 200,
                      'proto': 'HTTP/1.1',                Note         conversion
                      'referrer': '-',
                      'request': '/ply/ply.html',
                      'datetime': '24/Feb/2008:00:08:59 -0600',
                      'bytes': 97238,
                      'host': '140.180.132.213',
                      'user': '-',
                      'method': 'GET'}


              • Again, this is just one big processing pipeline

Copyright (C) 2008, http://www.dabeaz.com                                           1-65




                                   The Code So Far
        lognames             =   gen_find(quot;access-log*quot;,quot;wwwquot;)
        logfiles             =   gen_open(lognames)
        loglines             =   gen_cat(logfiles)
        groups               =   (logpat.match(line) for line in loglines)
        tuples               =   (g.groups() for g in groups if g)

        colnames = ('host','referrer','user','datetime','method',
                      'request','proto','status','bytes')

        log                  = (dict(zip(colnames,t)) for t in tuples)
        log                  = field_map(log,quot;bytesquot;,
                                         lambda s: int(s) if s != '-' else 0)
        log                  = field_map(log,quot;statusquot;,int)




Copyright (C) 2008, http://www.dabeaz.com                                           1-66
Packaging
            • To make it more sane, you may want to package
                   parts of the code into functions
                      def lines_from_dir(filepat, dirname):
                          names   = gen_find(filepat,dirname)
                          files   = gen_open(names)
                          lines   = gen_cat(files)
                          return lines


            • This is a generate purpose function that reads all
                   lines from a series of files in a directory



Copyright (C) 2008, http://www.dabeaz.com                                                 1-67




                                               Packaging
         • Parse an Apache log
            def apache_log(lines):
                groups     = (logpat.match(line) for line in lines)
                tuples     = (g.groups() for g in groups if g)

                      colnames              = ('host','referrer','user','datetime','method',
                                               'request','proto','status','bytes')

                      log                   = (dict(zip(colnames,t)) for t in tuples)
                      log                   = field_map(log,quot;bytesquot;,
                                                        lambda s: int(s) if s != '-' else 0)
                      log                   = field_map(log,quot;statusquot;,int)

                      return log




Copyright (C) 2008, http://www.dabeaz.com                                                 1-68
Example Use
               • It's easy
                        lines = lines_from_dir(quot;access-log*quot;,quot;wwwquot;)
                        log   = apache_log(lines)

                        for r in log:
                            print r



               • Different components have been subdivided
                       according to the data that they process


Copyright (C) 2008, http://www.dabeaz.com                             1-69




                               A Query Language
            • Now that we have our log, let's do some queries
            • Find the set of all documents that 404
                     stat404 = set(r['request'] for r in log
                                         if r['status'] == 404)


            • Print all requests that transfer over a megabyte
                     large = (r for r in log
                                if r['bytes'] > 1000000)

                     for r in large:
                         print r['request'], r['bytes']



Copyright (C) 2008, http://www.dabeaz.com                             1-70
A Query Language
            • Find the largest data transfer
                    print quot;%d %squot; % max((r['bytes'],r['request'])
                                         for r in log)


            • Collect all unique host IP addresses
                    hosts = set(r['host'] for r in log)


            • Find the number of downloads of a file
                     sum(1 for r in log
                              if r['request'] == '/ply/ply-2.3.tar.gz')




Copyright (C) 2008, http://www.dabeaz.com                                 1-71




                               A Query Language
            • Find out who has been hitting robots.txt
                     addrs = set(r['host'] for r in log
                                   if 'robots.txt' in r['request'])

                     import socket
                     for addr in addrs:
                         try:
                              print socket.gethostbyaddr(addr)[0]
                         except socket.herror:
                              print addr




Copyright (C) 2008, http://www.dabeaz.com                                 1-72
Performance Study
            • Sadly, the last example doesn't run so fast on a
                   huge input file (53 minutes on the 1.3GB log)
            • But, the beauty of generators is that you can plug
                   filters in at almost any stage
                    lines         =   lines_from_dir(quot;big-access-logquot;,quot;.quot;)
                    lines         =   (line for line in lines if 'robots.txt' in line)
                    log           =   apache_log(lines)
                    addrs         =   set(r['host'] for r in log)
                    ...


            • That version takes 93 seconds
Copyright (C) 2008, http://www.dabeaz.com                                                1-73




                                        Some Thoughts
           • I like the idea of using generator expressions as a
                   pipeline query language
           • You can write simple filters, extract data, etc.
           • You you pass dictionaries/objects through the
                   pipeline, it becomes quite powerful
           • Feels similar to writing SQL queries

Copyright (C) 2008, http://www.dabeaz.com                                                1-74
Part 5
                                            Processing Infinite Data




Copyright (C) 2008, http://www.dabeaz.com                              1- 75




                                              Question
                 • Have you ever used 'tail -f' in Unix?
                          % tail -f logfile
                          ...
                          ... lines of output ...
                          ...


                 • This prints the lines written to the end of a file
                 • The quot;standardquot; way to watch a log file
                 • I used this all of the time when working on
                        scientific simulations ten years ago...


Copyright (C) 2008, http://www.dabeaz.com                              1-76
Infinite Sequences

                 • Tailing a log file results in an quot;infinitequot; stream
                 • It constantly watches the file and yields lines as
                        soon as new data is written
                 • But you don't know how much data will actually
                        be written (in advance)
                 • And log files can often be enormous

Copyright (C) 2008, http://www.dabeaz.com                                         1-77




                                            Tailing a File
               • A Python version of 'tail -f'
                         import time
                         def follow(thefile):
                             thefile.seek(0,2)      # Go to the end of the file
                             while True:
                                  line = thefile.readline()
                                  if not line:
                                      time.sleep(0.1)    # Sleep briefly
                                      continue
                                  yield line


               • Idea : Seek to the end of the file and repeatedly
                       try to read new lines. If new data is written to
                       the file, we'll pick it up.
Copyright (C) 2008, http://www.dabeaz.com                                         1-78
Example
               • Using our follow function
                           logfile = open(quot;access-logquot;)
                           loglines = follow(logfile)

                           for line in loglines:
                               print line,



               • This produces the same output as 'tail -f'


Copyright (C) 2008, http://www.dabeaz.com                                1-79




                                            Example
               • Turn the real-time log file into records
                        logfile = open(quot;access-logquot;)
                        loglines = follow(logfile)
                        log      = apache_log(loglines)



               • Print out all 404 requests as they happen
                        r404 = (r for r in log if r['status'] == 404)
                        for r in r404:
                            print r['host'],r['datetime'],r['request']




Copyright (C) 2008, http://www.dabeaz.com                                1-80
Commentary
              • We just plugged this new input scheme onto
                     the front of our processing pipeline
              • Everything else still works, with one caveat-
                     functions that consume an entire iterable won't
                     terminate (min, max, sum, set, etc.)
              • Nevertheless, we can easily write processing
                     steps that operate on an infinite data stream


Copyright (C) 2008, http://www.dabeaz.com                              1-81




                                             Thoughts


              • This data pipeline idea is really quite powerful
              • Captures a lot of common systems problems
              • Especially consumer-producer problems


Copyright (C) 2008, http://www.dabeaz.com                              1-82
Part 6
                                            Feeding the Pipeline




Copyright (C) 2008, http://www.dabeaz.com                            1- 83




                            Feeding Generators
                 • In order to feed a generator processing
                        pipeline, you need to have an input source
                 • So far, we have looked at two file-based inputs
                 • Reading a file
                            lines = open(filename)


                 • Tailing a file
                            lines = follow(open(filename))




Copyright (C) 2008, http://www.dabeaz.com                            1-84
Generating Connections
                 • Generate a sequence of TCP connections
                           import socket
                           def receive_connections(addr):
                               s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
                               s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
                               s.bind(addr)
                               s.listen(5)
                               while True:
                                    client = s.accept()
                                    yield client


                 • Example:
                          for c,a in receive_connections((quot;quot;,9000)):
                              c.send(quot;Hello Worldnquot;)
                              c.close()


Copyright (C) 2008, http://www.dabeaz.com                                        1-85




                         Generating Messages
             • Receive a sequence of UDP messages
                      import socket
                      def receive_messages(addr,maxsize):
                          s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
                          s.bind(addr)
                          while True:
                               msg = s.recvfrom(maxsize)
                               yield msg

             • Example:
                     for msg, addr in receive_messages((quot;quot;,10000),1024):
                         print msg, quot;fromquot;, addr




Copyright (C) 2008, http://www.dabeaz.com                                        1-86
I/O Multiplexing
             • Generating I/O events on a set of sockets
                     import select
                     def gen_events(socks):
                         while True:
                             rdr,wrt,err = select.select(socks,socks,socks,0.1)
                             for r in rdr:
                                 yield quot;readquot;,r
                             for w in wrt:
                                 yield quot;writequot;,w
                             for e in err:
                                 yield quot;errorquot;,e


             • Note: Using this one is little tricky
             • Example : Reading from multiple client sockets
Copyright (C) 2008, http://www.dabeaz.com                                      1-87




                                      I/O Multiplexing
                      clientset = []

                      def acceptor(sockset,addr):
                          for c,a in receive_connections(addr):
                              sockset.append(c)

                      acc_thr = threading.Thread(target=acceptor,
                                                 args=(clientset,(quot;quot;,12000))
                      acc_thr.setDaemon(True)
                      acc_thr.start()

                      for evt,s in gen_events(clientset):
                          if evt == 'read':
                                data = s.recv(1024)
                                if not data:
                                    print quot;Closingquot;, s
                                    s.close()
                                    clientset.remove(s)
                                else:
                                    print s,data
Copyright (C) 2008, http://www.dabeaz.com                                      1-88
Consuming a Queue
             • Generate a sequence of items from a queue
                      def consume_queue(thequeue):
                          while True:
                               item = thequeue.get()
                               if item is StopIteration: break
                               yield item


             • Note: Using StopIteration as a sentinel
             • Might be used to feed a generator pipeline as a
                    consumer thread


Copyright (C) 2008, http://www.dabeaz.com                                      1-89




                          Consuming a Queue
             • Example:
                    import Queue, threading

                    def consumer(q):
                        for item in consume_queue(q):
                            print quot;Consumedquot;, item
                        print quot;Donequot;

                    in_q = Queue.Queue()
                    con_thr = threading.Thread(target=consumer,args=(in_q,))
                    con_thr.start()

                    for i in xrange(100):
                        in_q.put(i)
                    in_q.put(StopIteration)



Copyright (C) 2008, http://www.dabeaz.com                                      1-90
Part 7
                                            Extending the Pipeline




Copyright (C) 2008, http://www.dabeaz.com                                   1- 91




                               Multiple Processes
               • Can you extend a processing pipeline across
                      processes and machines?



                                                                process 2
                                                   socket
                                                    pipe

                     process 1


Copyright (C) 2008, http://www.dabeaz.com                                   1-92
Pickler/Unpickler
               • Turn a generated sequence into pickled objects
                        def gen_pickle(source):
                            for item in source:
                                yield pickle.dumps(item)

                        def gen_unpickle(infile):
                            while True:
                                 try:
                                      item = pickle.load(infile)
                                      yield item
                                 except EOFError:
                                      return


               • Now, attach these to a pipe or socket
Copyright (C) 2008, http://www.dabeaz.com                                          1-93




                                     Sender/Receiver
               • Example: Sender
                        def sendto(source,addr):
                            s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
                            s.connect(addr)
                            for pitem in gen_pickle(source):
                                s.sendall(pitem)
                            s.close()

               • Example: Receiver
                        def receivefrom(addr):
                            s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
                            s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
                            s.bind(addr)
                            s.listen(5)
                            c,a = s.accept()
                            for item in gen_unpickle(c.makefile()):
                                yield item
                            c.close()
Copyright (C) 2008, http://www.dabeaz.com                                          1-94
Example Use
               • Example: Read log lines and parse into records
                        # netprod.py

                        lines = follow(open(quot;access-logquot;))
                        log   = apache_log(lines)
                        sendto(log,(quot;quot;,15000))


               • Example: Pick up the log on another machine
                        # netcons.py
                        for r in receivefrom((quot;quot;,15000)):
                            print r




Copyright (C) 2008, http://www.dabeaz.com                                1-95




                                            Fanning Out
               • In all of our examples, the processing pipeline is
                       driven by a single consumer
                         for item in gen:
                             # Consume item


               • Can you expand the pipeline to multiple
                       consumers?
                                                generator



                               consumer1        consumer2    consumer3

Copyright (C) 2008, http://www.dabeaz.com                                1-96
Broadcasting
               • Consume a generator and send items to a set
                       of consumers
                           def broadcast(source, consumers):
                               for item in source:
                                   for c in consumers:
                                       c.send(item)


               • This changes the control-flow
               • The broadcaster is what consumes items
               • Those items have to be sent to consumers for
                       processing

Copyright (C) 2008, http://www.dabeaz.com                       1-97




                                            Consumers
               • To create a consumer, define an object with a
                       send method on it
                         class Consumer(object):
                             def send(self,item):
                                  print self, quot;gotquot;, item



               • Example:
                         c1 = Consumer()
                         c2 = Consumer()
                         c3 = Consumer()

                         lines = follow(open(quot;access-logquot;))
                         broadcast(lines,[c1,c2,c3])


Copyright (C) 2008, http://www.dabeaz.com                       1-98
Consumers
               • Sadly, inside consumers, it is not possible to
                       continue the same processing pipeline idea
               • In order for it to work, there has to be a single
                       iteration that is driving the pipeline
               • With multiple consumers, you would have to be
                       iterating in more than one location at once
               • You can do this with threads or distributed
                       processes however

Copyright (C) 2008, http://www.dabeaz.com                                        1-99




                          Network Consumer
                 • Example:
                           import socket,pickle
                           class NetConsumer(object):
                               def __init__(self,addr):
                                    self.s = socket.socket(socket.AF_INET,
                                                           socket.SOCK_STREAM)
                                    self.s.connect(addr)
                               def send(self,item):
                                   pitem = pickle.dumps(item)
                                   self.s.sendall(pitem)
                               def close(self):
                                   self.s.close()


                    • This will route items to a network receiver
Copyright (C) 2008, http://www.dabeaz.com                                        1-
                                                                                  100
Network Consumer
                 • Example Usage:
                          class Stat404(NetConsumer):
                              def send(self,item):
                                  if item['status'] == 404:
                                      NetConsumer.send(self,item)

                          lines = follow(open(quot;access-logquot;))
                          log   = apache_log(lines)

                          stat404 = Stat404((quot;somehostquot;,15000))

                          broadcast(log, [stat404])


                    • The 404 entries will go elsewhere...
Copyright (C) 2008, http://www.dabeaz.com                               1-
                                                                         101




                                 Consumer Thread
                 • Example:  import Queue, threading

                             class ConsumerThread(threading.Thread):
                                 def __init__(self,target):
                                      threading.Thread.__init__(self)
                                      self.setDaemon(True)
                                      self.in_queue = Queue.Queue()
                                      self.target = target
                                 def send(self,item):
                                      self.in_queue.put(item)
                                 def generate(self):
                                      while True:
                                          item = self.in_queue.get()
                                          yield item
                                 def run(self):
                                     self.target(self.generate())


Copyright (C) 2008, http://www.dabeaz.com                               1-
                                                                         102
Consumer Thread
                 • Sample usage (building on earlier code)
                        def find_404(log):
                            for r in (r for r in log if r['status'] == 404):
                                 print r['status'],r['datetime'],r['request']

                        def bytes_transferred(log):
                            total = 0
                            for r in log:
                                total += r['bytes']
                                print quot;Total bytesquot;, total

                        c1 = ConsumerThread(find_404)
                        c1.start()
                        c2 = ConsumerThread(bytes_transferred)
                        c2.start()

                        lines = follow(open(quot;access-logquot;)) # Follow a log
                        log   = apache_log(lines)           # Turn into records
                        broadcast(log,[c1,c2])         # Broadcast to consumers
Copyright (C) 2008, http://www.dabeaz.com                                         1-
                                                                                   103




                                     Multiple Sources
               • In all of our examples, the processing pipeline is
                      being fed by a single source
               • But, what if you had multiple sources?
                                  source1   source2          source3




Copyright (C) 2008, http://www.dabeaz.com                                         1-
                                                                                   104
Concatenation
               • Concatenate one source after another
                          def concatenate(sources):
                              for s in sources:
                                  for item in s:
                                      yield item



               • This generates one big sequence
               • Consumes each generator one at a time
               • Only works with generators that terminate

Copyright (C) 2008, http://www.dabeaz.com                    1-
                                                              105




                                     Parallel Iteration
               • Zipping multiple generators together
                       import itertools

                       z = itertools.izip(s1,s2,s3)


               • This one is only marginally useful
               • Requires generators to go lock-step
               • Terminates when the first exits

Copyright (C) 2008, http://www.dabeaz.com                    1-
                                                              106
Multiplexing
               • Consumer from multiple generators in real-
                       time--producing values as they are generated
               • Example use
                       log1 = follow(open(quot;foo/access-logquot;))
                       log2 = follow(open(quot;bar/access-logquot;))

                       lines = gen_multiplex([log1,log2])


               • There is no way to poll a generator. So, how do
                       you do this?

Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               107




               Multiplexing Generators
             def gen_multiplex(genlist):
                 item_q = Queue.Queue()
                 def run_one(source):
                     for item in source: item_q.put(item)

                      def run_all():
                          thrlist = []
                          for source in genlist:
                              t = threading.Thread(target=run_one,args=(source,))
                              t.start()
                              thrlist.append(t)
                          for t in thrlist: t.join()
                          item_q.put(StopIteration)

                      threading.Thread(target=run_all).start()
                      while True:
                          item = item_q.get()
                          if item is StopIteration: return
                          yield item

Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               108
Multiplexing Generators
             def gen_multiplex(genlist):
                 item_q = Queue.Queue()
                 def run_one(source):
                     for item in source: item_q.put(item)

                      def run_all():
                          thrlist = []
                                                       Each generator runs in a
                          for source in genlist:       thread and drops items
                                                            onto a queue
                              t = threading.Thread(target=run_one,args=(source,))
                              t.start()
                              thrlist.append(t)
                          for t in thrlist: t.join()
                          item_q.put(StopIteration)

                      threading.Thread(target=run_all).start()
                      while True:
                          item = item_q.get()
                          if item is StopIteration: return
                          yield item

Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               109




               Multiplexing Generators
             def gen_multiplex(genlist):
                 item_q = Queue.Queue()
                 def run_one(source):
                     for item in source: item_q.put(item)

                      def run_all():
                          thrlist = []
                          for source in genlist:
                              t = threading.Thread(target=run_one,args=(source,))
                              t.start()
                              thrlist.append(t)
                          for t in thrlist: t.join()
                          item_q.put(StopIteration)    Pull items off the queue
                                                           and yield them
                      threading.Thread(target=run_all).start()
                      while True:
                          item = item_q.get()
                          if item is StopIteration: return
                          yield item

Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               110
Multiplexing Generators
             def gen_multiplex(genlist):
                 item_q = Queue.Queue()               Run all of the
                 def run_one(source):           generators, wait for      them
                     for item in source: item_q.put(item)
                                                     to terminate, then put a
                      def run_all():                  sentinel on the queue
                          thrlist = []
                          for source in genlist:
                                                          (StopIteration)
                              t = threading.Thread(target=run_one,args=(source,))
                              t.start()
                              thrlist.append(t)
                          for t in thrlist: t.join()
                          item_q.put(StopIteration)

                      threading.Thread(target=run_all).start()
                      while True:
                          item = item_q.get()
                          if item is StopIteration: return
                          yield item

Copyright (C) 2008, http://www.dabeaz.com                                        1-
                                                                                  111




                                            Part 8
                    Various Programming Tricks (And Debugging)




Copyright (C) 2008, http://www.dabeaz.com                                        1-112
Putting it all Together

               • This data processing pipeline idea is powerful
               • But, it's also potentially mind-boggling
               • Especially when you have dozens of pipeline
                       stages, broadcasting, multiplexing, etc.
               • Let's look at a few useful tricks

Copyright (C) 2008, http://www.dabeaz.com                         1-
                                                                   113




                         Creating Generators
               • Any single-argument function is easy to turn
                       into a generator function
                          def generate(func):
                              def gen_func(s):
                                  for item in s:
                                      yield func(item)
                              return gen_func



               • Example:
                        gen_sqrt = generate(math.sqrt)
                        for x in gen_sqrt(xrange(100)):
                            print x




Copyright (C) 2008, http://www.dabeaz.com                         1-
                                                                   114
Debug Tracing
               • A debugging function that will print items going
                       through a generator
                          def trace(source):
                              for item in source:
                                  print item
                                  yield item


               • This can easily be placed around any generator
                          lines = follow(open(quot;access-logquot;))
                          log   = trace(apache_log(lines))

                          r404          = trace(r for r in log if r['status'] == 404)


               • Note: Might consider logging module for this
Copyright (C) 2008, http://www.dabeaz.com                                               1-
                                                                                         115




                Recording the Last Item
               • Store the last item generated in the generator
                          class storelast(object):
                              def __init__(self,source):
                                  self.source = source
                              def next(self):
                                  item = self.source.next()
                                  self.last = item
                                  return item
                              def __iter__(self):
                                  return self

               • This can be easily wrapped around a generator
                          lines = storelast(follow(open(quot;access-logquot;)))
                          log   = apache_log(lines)

                          for r in log:
                              print r
                              print lines.last
Copyright (C) 2008, http://www.dabeaz.com                                               1-
                                                                                         116
Shutting Down
               • Generators can be shut down using .close()
                         import time
                         def follow(thefile):
                             thefile.seek(0,2)      # Go to the end of the file
                             while True:
                                  line = thefile.readline()
                                  if not line:
                                      time.sleep(0.1)    # Sleep briefly
                                      continue
                                  yield line


               • Example:
                           lines = follow(open(quot;access-logquot;))
                           for i,line in enumerate(lines):
                               print line,
                               if i == 10: lines.close()


Copyright (C) 2008, http://www.dabeaz.com                                          1-
                                                                                    117




                                            Shutting Down
                • In the generator, GeneratorExit is raised
                          import time
                          def follow(thefile):
                              thefile.seek(0,2)      # Go to the end of the file
                              try:
                                   while True:
                                       line = thefile.readline()
                                       if not line:
                                           time.sleep(0.1)    # Sleep briefly
                                           continue
                                       yield line
                              except GeneratorExit:
                                   print quot;Follow: Shutting downquot;


                • This allows for resource cleanup (if needed)
Copyright (C) 2008, http://www.dabeaz.com                                          1-
                                                                                    118
Ignoring Shutdown
                • Question: Can you ignore GeneratorExit?
                          import time
                          def follow(thefile):
                              thefile.seek(0,2)      # Go to the end of the file
                              while True:
                                  try:
                                       line = thefile.readline()
                                       if not line:
                                           time.sleep(0.1)    # Sleep briefly
                                           continue
                                       yield line
                                  except GeneratorExit:
                                       print quot;Forget about itquot;


                • Answer: No. You'll get a RuntimeError
Copyright (C) 2008, http://www.dabeaz.com                                          1-
                                                                                    119




                    Shutdown and Threads
                • Question : Can a thread shutdown a generator
                       running in a different thread?
                  lines = follow(open(quot;foo/test.logquot;))

                  def sleep_and_close(s):
                      time.sleep(s)
                      lines.close()
                  threading.Thread(target=sleep_and_close,args=(30,)).start()

                  for line in lines:
                      print line,




Copyright (C) 2008, http://www.dabeaz.com                                          1-
                                                                                    120
Shutdown and Threads
                • Separate threads can not call .close()
                • Output:
                      Exception in thread Thread-1:
                      Traceback (most recent call last):
                        File quot;/Library/Frameworks/Python.framework/Versions/2.5/
                      lib/python2.5/threading.pyquot;, line 460, in __bootstrap
                          self.run()
                        File quot;/Library/Frameworks/Python.framework/Versions/2.5/
                      lib/python2.5/threading.pyquot;, line 440, in run
                          self.__target(*self.__args, **self.__kwargs)
                        File quot;genfollow.pyquot;, line 31, in sleep_and_close
                          lines.close()
                      ValueError: generator already executing




Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               121




                       Shutdown and Signals
                • Can you shutdown a generator with a signal?
                         import signal
                         def sigusr1(signo,frame):
                             print quot;Closing it downquot;
                             lines.close()

                         signal.signal(signal.SIGUSR1,sigusr1)

                         lines = follow(open(quot;access-logquot;))
                         for line in lines:
                             print line,

                • From the command line
                         % kill -USR1 pid




Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               122
Shutdown and Signals
                • This also fails:
                         Traceback (most recent call last):
                           File quot;genfollow.pyquot;, line 35, in <module>
                             for line in lines:
                           File quot;genfollow.pyquot;, line 8, in follow
                             time.sleep(0.1)
                           File quot;genfollow.pyquot;, line 30, in sigusr1
                             lines.close()
                         ValueError: generator already executing


                • Sigh.

Copyright (C) 2008, http://www.dabeaz.com                                  1-
                                                                            123




                                            Shutdown
                • The only way to externally shutdown a
                       generator would be to instrument with a flag or
                       some kind of check
                         def follow(thefile,shutdown=None):
                             thefile.seek(0,2)
                             while True:
                                 if shutdown and shutdown.isSet(): break
                                 line = thefile.readline()
                                 if not line:
                                    time.sleep(0.1)
                                    continue
                                 yield line




Copyright (C) 2008, http://www.dabeaz.com                                  1-
                                                                            124
Shutdown
                • Example:
                         import threading,signal

                         shutdown = threading.Event()
                         def sigusr1(signo,frame):
                             print quot;Closing it downquot;
                             shutdown.set()
                         signal.signal(signal.SIGUSR1,sigusr1)

                         lines = follow(open(quot;access-logquot;),shutdown)
                         for line in lines:
                             print line,




Copyright (C) 2008, http://www.dabeaz.com                              1-
                                                                        125




                                             Part 9
                                             Co-routines




Copyright (C) 2008, http://www.dabeaz.com                              1-126
The Final Frontier
                 • In Python 2.5, generators picked up the ability
                        to receive values using .send()
                             def recv_count():
                                 try:
                                      while True:
                                           n = (yield)     # Yield expression
                                           print quot;T-minusquot;, n
                                 except GeneratorExit:
                                      print quot;Kaboom!quot;


                 • Think of this function as receiving values rather
                        than generating them


Copyright (C) 2008, http://www.dabeaz.com                                              1-
                                                                                        127




                                            Example Use
                 • Using a receiver
                           >>> r = recv_count()
                           >>> r.next()                     Note: must call .next() here
                           >>> for i in range(5,0,-1):
                           ...       r.send(i)
                           ...
                           T-minus 5
                           T-minus 4
                           T-minus 3
                           T-minus 2
                           T-minus 1
                           >>> r.close()
                           Kaboom!
                           >>>




Copyright (C) 2008, http://www.dabeaz.com                                              1-
                                                                                        128
Co-routines
               • This form of a generator is a quot;co-routinequot;
               • Also sometimes called a quot;reverse-generatorquot;
               • Python books (mine included) do a pretty poor
                      job of explaining how co-routines are supposed
                      to be used
               • I like to think of them as quot;receiversquot; or
                      quot;consumerquot;. They receive values sent to them.


Copyright (C) 2008, http://www.dabeaz.com                                       1-
                                                                                 129




                    Setting up a Coroutine
                 • To get a co-routine to run properly, you have to
                        ping it with a .next() operation first
                             def recv_count():
                                 try:
                                      while True:
                                           n = (yield)     # Yield expression
                                           print quot;T-minusquot;, n
                                 except GeneratorExit:
                                      print quot;Kaboom!quot;


                 • Example:r = recv_count()
                           r.next()


                 • This advances it to the first yield--where it will
                        receive its first value
Copyright (C) 2008, http://www.dabeaz.com                                       1-
                                                                                 130
@consumer decorator
                 • The .next() bit can be handled via decoration
                             def consumer(func):
                                 def start(*args,**kwargs):
                                     c = func(*args,**kwargs)
                                     c.next()
                                     return c
                                 return start

                 • Example:@consumer
                           def recv_count():
                               try:
                                    while True:
                                         n = (yield)     # Yield expression
                                         print quot;T-minusquot;, n
                               except GeneratorExit:
                                    print quot;Kaboom!quot;
Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               131




                   @consumer decorator
                 • Using the decorated version
                            >>> r = recv_count()
                            >>> for i in range(5,0,-1):
                            ...     r.send(i)
                            ...
                            T-minus 5
                            T-minus 4
                            T-minus 3
                            T-minus 2
                            T-minus 1
                            >>> r.close()
                            Kaboom!
                            >>>

                  • Don't need the extra .next() step here
Copyright (C) 2008, http://www.dabeaz.com                                     1-
                                                                               132
Coroutine Pipelines
                 • Co-routines also set up a processing pipeline
                 • Instead of being defining by iteration, it's
                        defining by pushing values into the pipeline
                        using .send()


                        .send()             .send()     .send()


                  • We already saw some of this with broadcasting
Copyright (C) 2008, http://www.dabeaz.com                             1-
                                                                       133




                    Broadcasting (Reprise)
               • Consume a generator and send items to a set
                       of consumers
                           def broadcast(source, consumers):
                               for item in source:
                                   for c in consumers:
                                       c.send(item)



               • Notice that send() operation there
               • The consumers could be co-routines

Copyright (C) 2008, http://www.dabeaz.com                             1-
                                                                       134
Example
              @consumer
              def find_404():
                  while True:
                       r = (yield)
                       if r['status'] == 404:
                           print r['status'],r['datetime'],r['request']

              @consumer
              def bytes_transferred():
                  total = 0
                  while True:
                       r = (yield)
                       total += r['bytes']
                       print quot;Total bytesquot;, total

              lines = follow(open(quot;access-logquot;))
              log   = apache_log(lines)
              broadcast(log,[find_404(),bytes_transferred()])


Copyright (C) 2008, http://www.dabeaz.com                                 1-
                                                                           135




                                            Discussion

                 • In last example, multiple consumers
                 • However, there were no threads
                 • Further exploration along these lines can take
                        you into co-operative multitasking, concurrent
                        programming without using threads
                 • That's an entirely different tutorial!

Copyright (C) 2008, http://www.dabeaz.com                                 1-
                                                                           136
Wrap Up



Copyright (C) 2008, http://www.dabeaz.com                          1-137




                                            The Big Idea
                • Generators are an incredibly useful tool for a
                        variety of quot;systemsquot; related problem
                • Power comes from the ability to set up
                        processing pipelines
                • Can create components that plugged into the
                        pipeline as reusable pieces
                • Can extend the pipeline idea in many directions
                        (networking, threads, co-routines)

Copyright (C) 2008, http://www.dabeaz.com                           1-
                                                                     138
Code Reuse

                • I like the way that code gets reused with
                        generators
                • Small components that just process a data
                        stream
                • Personally, I think this is much easier than what
                        you commonly see with OO patterns



Copyright (C) 2008, http://www.dabeaz.com                                   1-
                                                                             139




                                             Example
             • SocketServer Module (Strategy Pattern)
                    import SocketServer
                    class HelloHandler(SocketServer.BaseRequestHandler):
                         def handle(self):
                             self.request.sendall(quot;Hello Worldnquot;)

                    serv = SocketServer.TCPServer((quot;quot;,8000),HelloHandler)
                    serv.serve_forever()


             • My generator version
                      for c,a in receive_connections((quot;quot;,8000)):
                          c.send(quot;Hello Worldnquot;)
                          c.close()




Copyright (C) 2008, http://www.dabeaz.com                                   1-
                                                                             140
Pitfalls
             • I don't think many programmers really
                    understand generators yet
             • Springing this on the uninitiated might cause
                    their head to explode
             • Error handling is really tricky because you have
                    lots of components chained together
             • Need to pay careful attention to debugging,
                    reliability, and other issues.

Copyright (C) 2008, http://www.dabeaz.com                           1-
                                                                     141




                                                Thanks!

             • I hope you got some new ideas from this class
             • Please feel free to contact me
                                            http://www.dabeaz.com




Copyright (C) 2008, http://www.dabeaz.com                           1-
                                                                     142

More Related Content

What's hot

도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
Sam Kim
 
Generator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersGenerator Tricks for Systems Programmers
Generator Tricks for Systems Programmers
David Beazley (Dabeaz LLC)
 
PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...
PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...
PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...
Puppet
 
TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...
TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...
TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...
tdc-globalcode
 
Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard II
David Beazley (Dabeaz LLC)
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - Routers
Logicaltrust pl
 
Perl-C/C++ Integration with Swig
Perl-C/C++ Integration with SwigPerl-C/C++ Integration with Swig
Perl-C/C++ Integration with Swig
David Beazley (Dabeaz LLC)
 
Authen Free Bsd6 2
Authen Free Bsd6 2Authen Free Bsd6 2
Authen Free Bsd6 2
Kwanchai Charoennet
 
bioinfolec_2nd_20070622
bioinfolec_2nd_20070622bioinfolec_2nd_20070622
bioinfolec_2nd_20070622
sesejun
 
Web Server Free Bsd
Web Server Free BsdWeb Server Free Bsd
Web Server Free Bsd
Kwanchai Charoennet
 
Introduction to VeriFast @ Kyoto
Introduction to VeriFast @ KyotoIntroduction to VeriFast @ Kyoto
Introduction to VeriFast @ Kyoto
Kiwamu Okabe
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
David Beazley (Dabeaz LLC)
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラム
kwatch
 

What's hot (13)

도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
 
Generator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersGenerator Tricks for Systems Programmers
Generator Tricks for Systems Programmers
 
PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...
PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...
PuppetConf 2016: Nice and Secure: Good OpSec Hygiene With Puppet! – Peter Sou...
 
TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...
TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...
TDC2016POA | Trilha Ruby - Stack Level too Deep e Tail Call Optimization: É u...
 
Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard II
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - Routers
 
Perl-C/C++ Integration with Swig
Perl-C/C++ Integration with SwigPerl-C/C++ Integration with Swig
Perl-C/C++ Integration with Swig
 
Authen Free Bsd6 2
Authen Free Bsd6 2Authen Free Bsd6 2
Authen Free Bsd6 2
 
bioinfolec_2nd_20070622
bioinfolec_2nd_20070622bioinfolec_2nd_20070622
bioinfolec_2nd_20070622
 
Web Server Free Bsd
Web Server Free BsdWeb Server Free Bsd
Web Server Free Bsd
 
Introduction to VeriFast @ Kyoto
Introduction to VeriFast @ KyotoIntroduction to VeriFast @ Kyoto
Introduction to VeriFast @ Kyoto
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラム
 

Viewers also liked

Com majorpresentation
Com majorpresentationCom majorpresentation
Com majorpresentation
Tony Huang
 
Force.com_Multitenancy_WP_101508_JP
Force.com_Multitenancy_WP_101508_JPForce.com_Multitenancy_WP_101508_JP
Force.com_Multitenancy_WP_101508_JPHiroshi Ono
 
paper_97
paper_97paper_97
paper_97
Hiroshi Ono
 
Sun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuideSun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuide
Hiroshi Ono
 
ηφαιστεια
ηφαιστειαηφαιστεια
ηφαιστειαksealexa
 
Loctite
LoctiteLoctite

Viewers also liked (6)

Com majorpresentation
Com majorpresentationCom majorpresentation
Com majorpresentation
 
Force.com_Multitenancy_WP_101508_JP
Force.com_Multitenancy_WP_101508_JPForce.com_Multitenancy_WP_101508_JP
Force.com_Multitenancy_WP_101508_JP
 
paper_97
paper_97paper_97
paper_97
 
Sun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuideSun_AmazonEC2_GettingStartedGuide
Sun_AmazonEC2_GettingStartedGuide
 
ηφαιστεια
ηφαιστειαηφαιστεια
ηφαιστεια
 
Loctite
LoctiteLoctite
Loctite
 

Similar to Generator Tricks for Systems Programmers

Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
David Beazley (Dabeaz LLC)
 
Preon (J-Fall 2008)
Preon (J-Fall 2008)Preon (J-Fall 2008)
Preon (J-Fall 2008)
Wilfred Springer
 
Jvm Language Summit Rose 20081016
Jvm Language Summit Rose 20081016Jvm Language Summit Rose 20081016
Jvm Language Summit Rose 20081016
Eduardo Pelegri-Llopart
 
Practical Groovy DSL
Practical Groovy DSLPractical Groovy DSL
Practical Groovy DSL
Guillaume Laforge
 
Practical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in GroovyPractical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in Groovy
Guillaume Laforge
 
Secrets of Top Pentesters
Secrets of Top PentestersSecrets of Top Pentesters
Secrets of Top Pentesters
amiable_indian
 
Performance, Games, and Distributed Testing in JavaScript
Performance, Games, and Distributed Testing in JavaScriptPerformance, Games, and Distributed Testing in JavaScript
Performance, Games, and Distributed Testing in JavaScript
jeresig
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA
guestfdcb8a
 
Python Generator Hacking
Python Generator HackingPython Generator Hacking
Python Generator Hacking
David Beazley (Dabeaz LLC)
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
Jeff Hammerbacher
 
Groovy And Grails Introduction
Groovy And Grails IntroductionGroovy And Grails Introduction
Groovy And Grails Introduction
Eric Weimer
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
Patrick Vergain
 
Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...
Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...
Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...
jasonjwwilliams
 
How to build the Web
How to build the WebHow to build the Web
How to build the Web
Simon Willison
 
Api Design
Api DesignApi Design
Spring ME
Spring MESpring ME
Spring ME
Wilfred Springer
 
Choosing an Application framework for Mobile Linux Device
Choosing an Application framework for Mobile Linux DeviceChoosing an Application framework for Mobile Linux Device
Choosing an Application framework for Mobile Linux Device
sshreyas
 
Intro To Grails
Intro To GrailsIntro To Grails
Intro To Grails
Robert Fischer
 
Best Practices In Implementing Container Image Promotion Pipelines
Best Practices In Implementing Container Image Promotion PipelinesBest Practices In Implementing Container Image Promotion Pipelines
Best Practices In Implementing Container Image Promotion Pipelines
All Things Open
 
Going Live! with Comet
Going Live! with CometGoing Live! with Comet
Going Live! with Comet
Simon Willison
 

Similar to Generator Tricks for Systems Programmers (20)

Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
 
Preon (J-Fall 2008)
Preon (J-Fall 2008)Preon (J-Fall 2008)
Preon (J-Fall 2008)
 
Jvm Language Summit Rose 20081016
Jvm Language Summit Rose 20081016Jvm Language Summit Rose 20081016
Jvm Language Summit Rose 20081016
 
Practical Groovy DSL
Practical Groovy DSLPractical Groovy DSL
Practical Groovy DSL
 
Practical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in GroovyPractical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in Groovy
 
Secrets of Top Pentesters
Secrets of Top PentestersSecrets of Top Pentesters
Secrets of Top Pentesters
 
Performance, Games, and Distributed Testing in JavaScript
Performance, Games, and Distributed Testing in JavaScriptPerformance, Games, and Distributed Testing in JavaScript
Performance, Games, and Distributed Testing in JavaScript
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA
 
Python Generator Hacking
Python Generator HackingPython Generator Hacking
Python Generator Hacking
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
Groovy And Grails Introduction
Groovy And Grails IntroductionGroovy And Grails Introduction
Groovy And Grails Introduction
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
 
Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...
Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...
Ditching Fibre Channel & SCSI: Saying hast la vista to your vendors and "ooh ...
 
How to build the Web
How to build the WebHow to build the Web
How to build the Web
 
Api Design
Api DesignApi Design
Api Design
 
Spring ME
Spring MESpring ME
Spring ME
 
Choosing an Application framework for Mobile Linux Device
Choosing an Application framework for Mobile Linux DeviceChoosing an Application framework for Mobile Linux Device
Choosing an Application framework for Mobile Linux Device
 
Intro To Grails
Intro To GrailsIntro To Grails
Intro To Grails
 
Best Practices In Implementing Container Image Promotion Pipelines
Best Practices In Implementing Container Image Promotion PipelinesBest Practices In Implementing Container Image Promotion Pipelines
Best Practices In Implementing Container Image Promotion Pipelines
 
Going Live! with Comet
Going Live! with CometGoing Live! with Comet
Going Live! with Comet
 

More from Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
Hiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
Hiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
Hiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
Hiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
Hiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
Hiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
Hiroshi Ono
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
Hiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
Hiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
Hiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
Hiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
Hiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
Hiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
Hiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

More from Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Recently uploaded

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 

Recently uploaded (20)

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 

Generator Tricks for Systems Programmers

  • 1. Generator Tricks For Systems Programmers David Beazley http://www.dabeaz.com Presented at PyCon'2008 Copyright (C) 2008, http://www.dabeaz.com 1- 1 An Introduction • Generators are cool! • But what are they? • And what are they good for? • That's what this tutorial is about Copyright (C) 2008, http://www.dabeaz.com 1- 2
  • 2. About Me • I'm a long-time Pythonista • First started using Python with version 1.3 • Author : Python Essential Reference • Responsible for a number of open source Python-related packages (Swig, PLY, etc.) Copyright (C) 2008, http://www.dabeaz.com 1- 3 My Story My addiction to generators started innocently enough. I was just a happy Python programmer working away in my secret lair when I got quot;the call.quot; A call to sort through 1.5 Terabytes of C++ source code (~800 weekly snapshots of a million line application). That's when I discovered the os.walk() function. I knew this wasn't going to end well... Copyright (C) 2008, http://www.dabeaz.com 1- 4
  • 3. Back Story • I think generators are wicked cool • An extremely useful language feature • Yet, they still seem a rather exotic • I still don't think I've fully wrapped my brain around the best approach to using them Copyright (C) 2008, http://www.dabeaz.com 1- 5 A Complaint • The coverage of generators in most Python books is lame (mine included) • Look at all of these cool examples! • Fibonacci Numbers • Squaring a list of numbers • Randomized sequences • Wow! Blow me over! Copyright (C) 2008, http://www.dabeaz.com 1- 6
  • 4. This Tutorial • Some more practical uses of generators • Focus is quot;systems programmingquot; • Which loosely includes files, file systems, parsing, networking, threads, etc. • My goal : To provide some more compelling examples of using generators • Planting some seeds Copyright (C) 2008, http://www.dabeaz.com 1- 7 Support Files • Files used in this tutorial are available here: http://www.dabeaz.com/generators/ • Go there to follow along with the examples Copyright (C) 2008, http://www.dabeaz.com 1- 8
  • 5. Disclaimer • This isn't meant to be an exhaustive tutorial on generators and related theory • Will be looking at a series of examples • I don't know if the code I've written is the quot;bestquot; way to solve any of these problems. • So, let's have a discussion Copyright (C) 2008, http://www.dabeaz.com 1- 9 Performance Details • There are some later performance numbers • Python 2.5.1 on OS X 10.4.11 • All tests were conducted on the following: • Mac Pro 2x2.66 Ghz Dual-Core Xeon • 3 Gbytes RAM • WDC WD2500JS-41SGB0 Disk (250G) • Timings are 3-run average of 'time' command Copyright (C) 2008, http://www.dabeaz.com 1- 10
  • 6. Part I Introduction to Iterators and Generators Copyright (C) 2008, http://www.dabeaz.com 1- 11 Iteration • As you know, Python has a quot;forquot; statement • You use it to loop over a collection of items >>> for x in [1,4,5,10]: ... print x, ... 1 4 5 10 >>> • And, as you have probably noticed, you can iterate over many different kinds of objects (not just lists) Copyright (C) 2008, http://www.dabeaz.com 1- 12
  • 7. Iterating over a Dict • If you loop over a dictionary you get keys >>> prices = { 'GOOG' : 490.10, ... 'AAPL' : 145.23, ... 'YHOO' : 21.71 } ... >>> for key in prices: ... print key ... YHOO GOOG AAPL >>> Copyright (C) 2008, http://www.dabeaz.com 1- 13 Iterating over a String • If you loop over a string, you get characters >>> s = quot;Yow!quot; >>> for c in s: ... print c ... Y o w ! >>> Copyright (C) 2008, http://www.dabeaz.com 1- 14
  • 8. Iterating over a File • If you loop over a file you get lines >>> for line in open(quot;real.txtquot;): ... print line, ... Real Programmers write in FORTRAN Maybe they do now, in this decadent era of Lite beer, hand calculators, and quot;user-friendlyquot; software but back in the Good Old Days, when the term quot;softwarequot; sounded funny and Real Computers were made out of drums and vacuum tubes, Real Programmers wrote in machine code. Not FORTRAN. Not RATFOR. Not, even, assembly language. Machine Code. Raw, unadorned, inscrutable hexadecimal numbers. Directly. Copyright (C) 2008, http://www.dabeaz.com 1- 15 Consuming Iterables • Many functions consume an quot;iterablequot; object • Reductions: sum(s), min(s), max(s) • Constructors list(s), tuple(s), set(s), dict(s) • in operator item in s • Many others in the library Copyright (C) 2008, http://www.dabeaz.com 1- 16
  • 9. Iteration Protocol • The reason why you can iterate over different objects is that there is a specific protocol >>> items = [1, 4, 5] >>> it = iter(items) >>> it.next() 1 >>> it.next() 4 >>> it.next() 5 >>> it.next() Traceback (most recent call last): File quot;<stdin>quot;, line 1, in <module> StopIteration >>> Copyright (C) 2008, http://www.dabeaz.com 1- 17 Iteration Protocol • An inside look at the for statement for x in obj: # statements • Underneath the covers _iter = iter(obj) # Get iterator object while 1: try: x = _iter.next() # Get next item except StopIteration: # No more items break # statements ... • Any object that supports iter() and next() is said to be quot;iterable.quot; Copyright (C) 2008, http://www.dabeaz.com 1-18
  • 10. Supporting Iteration • User-defined objects can support iteration • Example: Counting down... >>> for x in countdown(10): ... print x, ... 10 9 8 7 6 5 4 3 2 1 >>> • To do this, you just have to make the object implement __iter__() and next() Copyright (C) 2008, http://www.dabeaz.com 1-19 Supporting Iteration • Sample implementation class countdown(object): def __init__(self,start): self.count = start def __iter__(self): return self def next(self): if self.count <= 0: raise StopIteration r = self.count self.count -= 1 return r Copyright (C) 2008, http://www.dabeaz.com 1-20
  • 11. Iteration Example • Example use: >>> c = countdown(5) >>> for i in c: ... print i, ... 5 4 3 2 1 >>> Copyright (C) 2008, http://www.dabeaz.com 1-21 Iteration Commentary • There are many subtle details involving the design of iterators for various objects • However, we're not going to cover that • This isn't a tutorial on quot;iteratorsquot; • We're talking about generators... Copyright (C) 2008, http://www.dabeaz.com 1-22
  • 12. Generators • A generator is a function that produces a sequence of results instead of a single value def countdown(n): while n > 0: yield n n -= 1 >>> for i in countdown(5): ... print i, ... 5 4 3 2 1 >>> • Instead of returning a value, you generate a series of values (using the yield statement) Copyright (C) 2008, http://www.dabeaz.com 1-23 Generators • Behavior is quite different than normal func • Calling a generator function creates an generator object. However, it does not start running the function. def countdown(n): print quot;Counting down fromquot;, n while n > 0: yield n n -= 1 Notice that no output was >>> x = countdown(10) produced >>> x <generator object at 0x58490> >>> Copyright (C) 2008, http://www.dabeaz.com 1-24
  • 13. Generator Functions • The function only executes on next() >>> x = countdown(10) >>> x <generator object at 0x58490> >>> x.next() Counting down from 10 Function starts 10 executing here >>> • yield produces a value, but suspends the function • Function resumes on next call to next() >>> x.next() 9 >>> x.next() 8 >>> Copyright (C) 2008, http://www.dabeaz.com 1-25 Generator Functions • When the generator returns, iteration stops >>> x.next() 1 >>> x.next() Traceback (most recent call last): File quot;<stdin>quot;, line 1, in ? StopIteration >>> Copyright (C) 2008, http://www.dabeaz.com 1-26
  • 14. Generator Functions • A generator function is mainly a more convenient way of writing an iterator • You don't have to worry about the iterator protocol (.next, .__iter__, etc.) • It just works Copyright (C) 2008, http://www.dabeaz.com 1-27 Generators vs. Iterators • A generator function is slightly different than an object that supports iteration • A generator is a one-time operation. You can iterate over the generated data once, but if you want to do it again, you have to call the generator function again. • This is different than a list (which you can iterate over as many times as you want) Copyright (C) 2008, http://www.dabeaz.com 1-28
  • 15. Generator Expressions • A generated version of a list comprehension >>> a = [1,2,3,4] >>> b = (2*x for x in a) >>> b <generator object at 0x58760> >>> for i in b: print b, ... 2 4 6 8 >>> • This loops over a sequence of items and applies an operation to each item • However, results are produced one at a time using a generator Copyright (C) 2008, http://www.dabeaz.com 1-29 Generator Expressions • Important differences from a list comp. • Does not construct a list. • Only useful purpose is iteration • Once consumed, can't be reused • Example: >>> a = [1,2,3,4] >>> b = [2*x for x in a] >>> b [2, 4, 6, 8] >>> c = (2*x for x in a) <generator object at 0x58760> >>> Copyright (C) 2008, http://www.dabeaz.com 1-30
  • 16. Generator Expressions • General syntax (expression for i in s if cond1 for j in t if cond2 ... if condfinal) • What it means for i in s: if cond1: for j in t: if cond2: ... if condfinal: yield expression Copyright (C) 2008, http://www.dabeaz.com 1-31 A Note on Syntax • The parens on a generator expression can dropped if used as a single function argument • Example: sum(x*x for x in s) Generator expression Copyright (C) 2008, http://www.dabeaz.com 1-32
  • 17. Interlude • We now have two basic building blocks • Generator functions: def countdown(n): while n > 0: yield n n -= 1 • Generator expressions squares = (x*x for x in s) • In both cases, we get an object that generates values (which are typically consumed in a for loop) Copyright (C) 2008, http://www.dabeaz.com 1-33 Part 2 Processing Data Files (Show me your Web Server Logs) Copyright (C) 2008, http://www.dabeaz.com 1- 34
  • 18. Programming Problem Find out how many bytes of data were transferred by summing up the last column of data in this Apache web server log 81.107.39.38 - ... quot;GET /ply/ HTTP/1.1quot; 200 7587 81.107.39.38 - ... quot;GET /favicon.ico HTTP/1.1quot; 404 133 81.107.39.38 - ... quot;GET /ply/bookplug.gif HTTP/1.1quot; 200 23903 81.107.39.38 - ... quot;GET /ply/ply.html HTTP/1.1quot; 200 97238 81.107.39.38 - ... quot;GET /ply/example.html HTTP/1.1quot; 200 2359 66.249.72.134 - ... quot;GET /index.html HTTP/1.1quot; 200 4447 Oh yeah, and the log file might be huge (Gbytes) Copyright (C) 2008, http://www.dabeaz.com 1-35 The Log File • Each line of the log looks like this: 81.107.39.38 - ... quot;GET /ply/ply.html HTTP/1.1quot; 200 97238 • The number of bytes is the last column bytestr = line.rsplit(None,1)[1] • It's either a number or a missing value (-) 81.107.39.38 - ... quot;GET /ply/ HTTP/1.1quot; 304 - • Converting the value if bytestr != '-': bytes = int(bytestr) Copyright (C) 2008, http://www.dabeaz.com 1-36
  • 19. A Non-Generator Soln • Just do a simple for-loop wwwlog = open(quot;access-logquot;) total = 0 for line in wwwlog: bytestr = line.rsplit(None,1)[1] if bytestr != '-': total += int(bytestr) print quot;Totalquot;, total • We read line-by-line and just update a sum • However, that's so 90s... Copyright (C) 2008, http://www.dabeaz.com 1-37 A Generator Solution • Let's use some generator expressions wwwlog = open(quot;access-logquot;) bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print quot;Totalquot;, sum(bytes) • Whoa! That's different! • Less code • A completely different programming style Copyright (C) 2008, http://www.dabeaz.com 1-38
  • 20. Generators as a Pipeline • To understand the solution, think of it as a data processing pipeline access-log wwwlog bytecolumn bytes sum() total • Each step is defined by iteration/generation wwwlog = open(quot;access-logquot;) bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print quot;Totalquot;, sum(bytes) Copyright (C) 2008, http://www.dabeaz.com 1-39 Being Declarative • At each step of the pipeline, we declare an operation that will be applied to the entire input stream access-log wwwlog bytecolumn bytes sum() total bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) This operation gets applied to every line of the log file Copyright (C) 2008, http://www.dabeaz.com 1-40
  • 21. Being Declarative • Instead of focusing on the problem at a line-by-line level, you just break it down into big operations that operate on the whole file • This is very much a quot;declarativequot; style • The key : Think big... Copyright (C) 2008, http://www.dabeaz.com 1-41 Iteration is the Glue • The glue that holds the pipeline together is the iteration that occurs in each step wwwlog = open(quot;access-logquot;) bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print quot;Totalquot;, sum(bytes) • The calculation is being driven by the last step • The sum() function is consuming values being pushed through the pipeline (via .next() calls) Copyright (C) 2008, http://www.dabeaz.com 1-42
  • 22. Performance • Surely, this generator approach has all sorts of fancy-dancy magic that is slow. • Let's check it out on a 1.3Gb log file... % ls -l big-access-log -rw-r--r-- beazley 1303238000 Feb 29 08:06 big-access-log Copyright (C) 2008, http://www.dabeaz.com 1-43 Performance Contest wwwlog = open(quot;big-access-logquot;) total = 0 for line in wwwlog: Time bytestr = line.rsplit(None,1)[1] if bytestr != '-': total += int(bytestr) 27.20 print quot;Totalquot;, total wwwlog = open(quot;big-access-logquot;) bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print quot;Totalquot;, sum(bytes) Time 25.96 Copyright (C) 2008, http://www.dabeaz.com 1-44
  • 23. Commentary • Not only was it not slow, it was 5% faster • And it was less code • And it was relatively easy to read • And frankly, I like it a whole better... quot;Back in the old days, we used AWK for this and we liked it. Oh, yeah, and get off my lawn!quot; Copyright (C) 2008, http://www.dabeaz.com 1-45 Performance Contest wwwlog = open(quot;access-logquot;) bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print quot;Totalquot;, sum(bytes) Time 25.96 % awk '{ total += $NF } END { print total }' big-access-log Time Note:extracting the last column may not be 37.33 awk's strong point Copyright (C) 2008, http://www.dabeaz.com 1-46
  • 24. Food for Thought • At no point in our generator solution did we ever create large temporary lists • Thus, not only is that solution faster, it can be applied to enormous data files • It's competitive with traditional tools Copyright (C) 2008, http://www.dabeaz.com 1-47 More Thoughts • The generator solution was based on the concept of pipelining data between different components • What if you had more advanced kinds of components to work with? • Perhaps you could perform different kinds of processing by just plugging various pipeline components together Copyright (C) 2008, http://www.dabeaz.com 1-48
  • 25. This Sounds Familiar • The Unix philosophy • Have a collection of useful system utils • Can hook these up to files or each other • Perform complex tasks by piping data Copyright (C) 2008, http://www.dabeaz.com 1-49 Part 3 Fun with Files and Directories Copyright (C) 2008, http://www.dabeaz.com 1- 50
  • 26. Programming Problem You have hundreds of web server logs scattered across various directories. In additional, some of the logs are compressed. Modify the last program so that you can easily read all of these logs foo/ access-log-012007.gz access-log-022007.gz access-log-032007.gz ... access-log-012008 bar/ access-log-092007.bz2 ... access-log-022008 Copyright (C) 2008, http://www.dabeaz.com 1-51 os.walk() • A very useful function for searching the file system import os for path, dirlist, filelist in os.walk(topdir): # path : Current directory # dirlist : List of subdirectories # filelist : List of files ... • This utilizes generators to recursively walk through the file system Copyright (C) 2008, http://www.dabeaz.com 1-52
  • 27. find • Generate all filenames in a directory tree that match a given filename pattern import os import fnmatch def gen_find(filepat,top): for path, dirlist, filelist in os.walk(top): for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name) • Examples pyfiles = gen_find(quot;*.pyquot;,quot;/quot;) logs = gen_find(quot;access-log*quot;,quot;/usr/www/quot;) Copyright (C) 2008, http://www.dabeaz.com 1-53 Performance Contest pyfiles = gen_find(quot;*.pyquot;,quot;/quot;) for name in pyfiles: Wall Clock Time print name 559s % find / -name '*.py' Wall Clock Time 468s Performed on a 750GB file system containing about 140000 .py files Copyright (C) 2008, http://www.dabeaz.com 1-54
  • 28. A File Opener • Open a sequence of filenames import gzip, bz2 def gen_open(filenames): for name in filenames: if name.endswith(quot;.gzquot;): yield gzip.open(name) elif name.endswith(quot;.bz2quot;): yield bz2.BZ2File(name) else: yield open(name) • This is interesting.... it takes a sequence of filenames as input and yields a sequence of open file objects Copyright (C) 2008, http://www.dabeaz.com 1-55 cat • Concatenate items from one or more source into a single sequence of items def gen_cat(sources): for s in sources: for item in s: yield item • Example: lognames = gen_find(quot;access-log*quot;, quot;/usr/wwwquot;) logfiles = gen_open(lognames) loglines = gen_cat(logfiles) Copyright (C) 2008, http://www.dabeaz.com 1-56
  • 29. grep • Generate a sequence of lines that contain a given regular expression import re def gen_grep(pat, lines): patc = re.compile(pat) for line in lines: if patc.search(line): yield line • Example: lognames = gen_find(quot;access-log*quot;, quot;/usr/wwwquot;) logfiles = gen_open(lognames) loglines = gen_cat(logfiles) patlines = gen_grep(pat, loglines) Copyright (C) 2008, http://www.dabeaz.com 1-57 Example • Find out how many bytes transferred for a specific pattern in a whole directory of logs pat = rquot;somepatternquot; logdir = quot;/some/dir/quot; filenames = gen_find(quot;access-log*quot;,logdir) logfiles = gen_open(filenames) loglines = gen_cat(logfiles) patlines = gen_grep(pat,loglines) bytecolumn = (line.rsplit(None,1)[1] for line in patlines) bytes = (int(x) for x in bytecolumn if x != '-') print quot;Totalquot;, sum(bytes) Copyright (C) 2008, http://www.dabeaz.com 1-58
  • 30. Important Concept • Generators decouple iteration from the code that uses the results of the iteration • In the last example, we're performing a calculation on a sequence of lines • It doesn't matter where or how those lines are generated • Thus, we can plug any number of components together up front as long as they eventually produce a line sequence Copyright (C) 2008, http://www.dabeaz.com 1-59 Part 4 Parsing and Processing Data Copyright (C) 2008, http://www.dabeaz.com 1- 60
  • 31. Programming Problem Web server logs consist of different columns of data. Parse each line into a useful data structure that allows us to easily inspect the different fields. 81.107.39.38 - - [24/Feb/2008:00:08:59 -0600] quot;GET ...quot; 200 7587 host referrer user [datetime] quot;requestquot; status bytes Copyright (C) 2008, http://www.dabeaz.com 1-61 Parsing with Regex • Let's route the lines through a regex parser logpats = r'(S+) (S+) (S+) [(.*?)] ' r'quot;(S+) (S+) (S+)quot; (S+) (S+)' logpat = re.compile(logpats) groups = (logpat.match(line) for line in loglines) tuples = (g.groups() for g in groups if g) • This generates a sequence of tuples ('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600', 'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238') Copyright (C) 2008, http://www.dabeaz.com 1-62
  • 32. Tuples to Dictionaries • Let's turn tuples into dictionaries colnames = ('host','referrer','user','datetime', 'method','request','proto','status','bytes') log = (dict(zip(colnames,t)) for t in tuples) • This generates a sequence of named fields { 'status' : '200', 'proto' : 'HTTP/1.1', 'referrer': '-', 'request' : '/ply/ply.html', 'bytes' : '97238', 'datetime': '24/Feb/2008:00:08:59 -0600', 'host' : '140.180.132.213', 'user' : '-', 'method' : 'GET'} Copyright (C) 2008, http://www.dabeaz.com 1-63 Field Conversion • Map specific dictionary fields through a function def field_map(dictseq,name,func): for d in dictseq: d[name] = func(d[name]) yield d • Example: Convert a few field values log = field_map(log,quot;statusquot;, int) log = field_map(log,quot;bytesquot;, lambda s: int(s) if s !='-' else 0) Copyright (C) 2008, http://www.dabeaz.com 1-64
  • 33. Field Conversion • Creates dictionaries of converted values { 'status': 200, 'proto': 'HTTP/1.1', Note conversion 'referrer': '-', 'request': '/ply/ply.html', 'datetime': '24/Feb/2008:00:08:59 -0600', 'bytes': 97238, 'host': '140.180.132.213', 'user': '-', 'method': 'GET'} • Again, this is just one big processing pipeline Copyright (C) 2008, http://www.dabeaz.com 1-65 The Code So Far lognames = gen_find(quot;access-log*quot;,quot;wwwquot;) logfiles = gen_open(lognames) loglines = gen_cat(logfiles) groups = (logpat.match(line) for line in loglines) tuples = (g.groups() for g in groups if g) colnames = ('host','referrer','user','datetime','method', 'request','proto','status','bytes') log = (dict(zip(colnames,t)) for t in tuples) log = field_map(log,quot;bytesquot;, lambda s: int(s) if s != '-' else 0) log = field_map(log,quot;statusquot;,int) Copyright (C) 2008, http://www.dabeaz.com 1-66
  • 34. Packaging • To make it more sane, you may want to package parts of the code into functions def lines_from_dir(filepat, dirname): names = gen_find(filepat,dirname) files = gen_open(names) lines = gen_cat(files) return lines • This is a generate purpose function that reads all lines from a series of files in a directory Copyright (C) 2008, http://www.dabeaz.com 1-67 Packaging • Parse an Apache log def apache_log(lines): groups = (logpat.match(line) for line in lines) tuples = (g.groups() for g in groups if g) colnames = ('host','referrer','user','datetime','method', 'request','proto','status','bytes') log = (dict(zip(colnames,t)) for t in tuples) log = field_map(log,quot;bytesquot;, lambda s: int(s) if s != '-' else 0) log = field_map(log,quot;statusquot;,int) return log Copyright (C) 2008, http://www.dabeaz.com 1-68
  • 35. Example Use • It's easy lines = lines_from_dir(quot;access-log*quot;,quot;wwwquot;) log = apache_log(lines) for r in log: print r • Different components have been subdivided according to the data that they process Copyright (C) 2008, http://www.dabeaz.com 1-69 A Query Language • Now that we have our log, let's do some queries • Find the set of all documents that 404 stat404 = set(r['request'] for r in log if r['status'] == 404) • Print all requests that transfer over a megabyte large = (r for r in log if r['bytes'] > 1000000) for r in large: print r['request'], r['bytes'] Copyright (C) 2008, http://www.dabeaz.com 1-70
  • 36. A Query Language • Find the largest data transfer print quot;%d %squot; % max((r['bytes'],r['request']) for r in log) • Collect all unique host IP addresses hosts = set(r['host'] for r in log) • Find the number of downloads of a file sum(1 for r in log if r['request'] == '/ply/ply-2.3.tar.gz') Copyright (C) 2008, http://www.dabeaz.com 1-71 A Query Language • Find out who has been hitting robots.txt addrs = set(r['host'] for r in log if 'robots.txt' in r['request']) import socket for addr in addrs: try: print socket.gethostbyaddr(addr)[0] except socket.herror: print addr Copyright (C) 2008, http://www.dabeaz.com 1-72
  • 37. Performance Study • Sadly, the last example doesn't run so fast on a huge input file (53 minutes on the 1.3GB log) • But, the beauty of generators is that you can plug filters in at almost any stage lines = lines_from_dir(quot;big-access-logquot;,quot;.quot;) lines = (line for line in lines if 'robots.txt' in line) log = apache_log(lines) addrs = set(r['host'] for r in log) ... • That version takes 93 seconds Copyright (C) 2008, http://www.dabeaz.com 1-73 Some Thoughts • I like the idea of using generator expressions as a pipeline query language • You can write simple filters, extract data, etc. • You you pass dictionaries/objects through the pipeline, it becomes quite powerful • Feels similar to writing SQL queries Copyright (C) 2008, http://www.dabeaz.com 1-74
  • 38. Part 5 Processing Infinite Data Copyright (C) 2008, http://www.dabeaz.com 1- 75 Question • Have you ever used 'tail -f' in Unix? % tail -f logfile ... ... lines of output ... ... • This prints the lines written to the end of a file • The quot;standardquot; way to watch a log file • I used this all of the time when working on scientific simulations ten years ago... Copyright (C) 2008, http://www.dabeaz.com 1-76
  • 39. Infinite Sequences • Tailing a log file results in an quot;infinitequot; stream • It constantly watches the file and yields lines as soon as new data is written • But you don't know how much data will actually be written (in advance) • And log files can often be enormous Copyright (C) 2008, http://www.dabeaz.com 1-77 Tailing a File • A Python version of 'tail -f' import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line • Idea : Seek to the end of the file and repeatedly try to read new lines. If new data is written to the file, we'll pick it up. Copyright (C) 2008, http://www.dabeaz.com 1-78
  • 40. Example • Using our follow function logfile = open(quot;access-logquot;) loglines = follow(logfile) for line in loglines: print line, • This produces the same output as 'tail -f' Copyright (C) 2008, http://www.dabeaz.com 1-79 Example • Turn the real-time log file into records logfile = open(quot;access-logquot;) loglines = follow(logfile) log = apache_log(loglines) • Print out all 404 requests as they happen r404 = (r for r in log if r['status'] == 404) for r in r404: print r['host'],r['datetime'],r['request'] Copyright (C) 2008, http://www.dabeaz.com 1-80
  • 41. Commentary • We just plugged this new input scheme onto the front of our processing pipeline • Everything else still works, with one caveat- functions that consume an entire iterable won't terminate (min, max, sum, set, etc.) • Nevertheless, we can easily write processing steps that operate on an infinite data stream Copyright (C) 2008, http://www.dabeaz.com 1-81 Thoughts • This data pipeline idea is really quite powerful • Captures a lot of common systems problems • Especially consumer-producer problems Copyright (C) 2008, http://www.dabeaz.com 1-82
  • 42. Part 6 Feeding the Pipeline Copyright (C) 2008, http://www.dabeaz.com 1- 83 Feeding Generators • In order to feed a generator processing pipeline, you need to have an input source • So far, we have looked at two file-based inputs • Reading a file lines = open(filename) • Tailing a file lines = follow(open(filename)) Copyright (C) 2008, http://www.dabeaz.com 1-84
  • 43. Generating Connections • Generate a sequence of TCP connections import socket def receive_connections(addr): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1) s.bind(addr) s.listen(5) while True: client = s.accept() yield client • Example: for c,a in receive_connections((quot;quot;,9000)): c.send(quot;Hello Worldnquot;) c.close() Copyright (C) 2008, http://www.dabeaz.com 1-85 Generating Messages • Receive a sequence of UDP messages import socket def receive_messages(addr,maxsize): s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM) s.bind(addr) while True: msg = s.recvfrom(maxsize) yield msg • Example: for msg, addr in receive_messages((quot;quot;,10000),1024): print msg, quot;fromquot;, addr Copyright (C) 2008, http://www.dabeaz.com 1-86
  • 44. I/O Multiplexing • Generating I/O events on a set of sockets import select def gen_events(socks): while True: rdr,wrt,err = select.select(socks,socks,socks,0.1) for r in rdr: yield quot;readquot;,r for w in wrt: yield quot;writequot;,w for e in err: yield quot;errorquot;,e • Note: Using this one is little tricky • Example : Reading from multiple client sockets Copyright (C) 2008, http://www.dabeaz.com 1-87 I/O Multiplexing clientset = [] def acceptor(sockset,addr): for c,a in receive_connections(addr): sockset.append(c) acc_thr = threading.Thread(target=acceptor, args=(clientset,(quot;quot;,12000)) acc_thr.setDaemon(True) acc_thr.start() for evt,s in gen_events(clientset): if evt == 'read': data = s.recv(1024) if not data: print quot;Closingquot;, s s.close() clientset.remove(s) else: print s,data Copyright (C) 2008, http://www.dabeaz.com 1-88
  • 45. Consuming a Queue • Generate a sequence of items from a queue def consume_queue(thequeue): while True: item = thequeue.get() if item is StopIteration: break yield item • Note: Using StopIteration as a sentinel • Might be used to feed a generator pipeline as a consumer thread Copyright (C) 2008, http://www.dabeaz.com 1-89 Consuming a Queue • Example: import Queue, threading def consumer(q): for item in consume_queue(q): print quot;Consumedquot;, item print quot;Donequot; in_q = Queue.Queue() con_thr = threading.Thread(target=consumer,args=(in_q,)) con_thr.start() for i in xrange(100): in_q.put(i) in_q.put(StopIteration) Copyright (C) 2008, http://www.dabeaz.com 1-90
  • 46. Part 7 Extending the Pipeline Copyright (C) 2008, http://www.dabeaz.com 1- 91 Multiple Processes • Can you extend a processing pipeline across processes and machines? process 2 socket pipe process 1 Copyright (C) 2008, http://www.dabeaz.com 1-92
  • 47. Pickler/Unpickler • Turn a generated sequence into pickled objects def gen_pickle(source): for item in source: yield pickle.dumps(item) def gen_unpickle(infile): while True: try: item = pickle.load(infile) yield item except EOFError: return • Now, attach these to a pipe or socket Copyright (C) 2008, http://www.dabeaz.com 1-93 Sender/Receiver • Example: Sender def sendto(source,addr): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(addr) for pitem in gen_pickle(source): s.sendall(pitem) s.close() • Example: Receiver def receivefrom(addr): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1) s.bind(addr) s.listen(5) c,a = s.accept() for item in gen_unpickle(c.makefile()): yield item c.close() Copyright (C) 2008, http://www.dabeaz.com 1-94
  • 48. Example Use • Example: Read log lines and parse into records # netprod.py lines = follow(open(quot;access-logquot;)) log = apache_log(lines) sendto(log,(quot;quot;,15000)) • Example: Pick up the log on another machine # netcons.py for r in receivefrom((quot;quot;,15000)): print r Copyright (C) 2008, http://www.dabeaz.com 1-95 Fanning Out • In all of our examples, the processing pipeline is driven by a single consumer for item in gen: # Consume item • Can you expand the pipeline to multiple consumers? generator consumer1 consumer2 consumer3 Copyright (C) 2008, http://www.dabeaz.com 1-96
  • 49. Broadcasting • Consume a generator and send items to a set of consumers def broadcast(source, consumers): for item in source: for c in consumers: c.send(item) • This changes the control-flow • The broadcaster is what consumes items • Those items have to be sent to consumers for processing Copyright (C) 2008, http://www.dabeaz.com 1-97 Consumers • To create a consumer, define an object with a send method on it class Consumer(object): def send(self,item): print self, quot;gotquot;, item • Example: c1 = Consumer() c2 = Consumer() c3 = Consumer() lines = follow(open(quot;access-logquot;)) broadcast(lines,[c1,c2,c3]) Copyright (C) 2008, http://www.dabeaz.com 1-98
  • 50. Consumers • Sadly, inside consumers, it is not possible to continue the same processing pipeline idea • In order for it to work, there has to be a single iteration that is driving the pipeline • With multiple consumers, you would have to be iterating in more than one location at once • You can do this with threads or distributed processes however Copyright (C) 2008, http://www.dabeaz.com 1-99 Network Consumer • Example: import socket,pickle class NetConsumer(object): def __init__(self,addr): self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) self.s.connect(addr) def send(self,item): pitem = pickle.dumps(item) self.s.sendall(pitem) def close(self): self.s.close() • This will route items to a network receiver Copyright (C) 2008, http://www.dabeaz.com 1- 100
  • 51. Network Consumer • Example Usage: class Stat404(NetConsumer): def send(self,item): if item['status'] == 404: NetConsumer.send(self,item) lines = follow(open(quot;access-logquot;)) log = apache_log(lines) stat404 = Stat404((quot;somehostquot;,15000)) broadcast(log, [stat404]) • The 404 entries will go elsewhere... Copyright (C) 2008, http://www.dabeaz.com 1- 101 Consumer Thread • Example: import Queue, threading class ConsumerThread(threading.Thread): def __init__(self,target): threading.Thread.__init__(self) self.setDaemon(True) self.in_queue = Queue.Queue() self.target = target def send(self,item): self.in_queue.put(item) def generate(self): while True: item = self.in_queue.get() yield item def run(self): self.target(self.generate()) Copyright (C) 2008, http://www.dabeaz.com 1- 102
  • 52. Consumer Thread • Sample usage (building on earlier code) def find_404(log): for r in (r for r in log if r['status'] == 404): print r['status'],r['datetime'],r['request'] def bytes_transferred(log): total = 0 for r in log: total += r['bytes'] print quot;Total bytesquot;, total c1 = ConsumerThread(find_404) c1.start() c2 = ConsumerThread(bytes_transferred) c2.start() lines = follow(open(quot;access-logquot;)) # Follow a log log = apache_log(lines) # Turn into records broadcast(log,[c1,c2]) # Broadcast to consumers Copyright (C) 2008, http://www.dabeaz.com 1- 103 Multiple Sources • In all of our examples, the processing pipeline is being fed by a single source • But, what if you had multiple sources? source1 source2 source3 Copyright (C) 2008, http://www.dabeaz.com 1- 104
  • 53. Concatenation • Concatenate one source after another def concatenate(sources): for s in sources: for item in s: yield item • This generates one big sequence • Consumes each generator one at a time • Only works with generators that terminate Copyright (C) 2008, http://www.dabeaz.com 1- 105 Parallel Iteration • Zipping multiple generators together import itertools z = itertools.izip(s1,s2,s3) • This one is only marginally useful • Requires generators to go lock-step • Terminates when the first exits Copyright (C) 2008, http://www.dabeaz.com 1- 106
  • 54. Multiplexing • Consumer from multiple generators in real- time--producing values as they are generated • Example use log1 = follow(open(quot;foo/access-logquot;)) log2 = follow(open(quot;bar/access-logquot;)) lines = gen_multiplex([log1,log2]) • There is no way to poll a generator. So, how do you do this? Copyright (C) 2008, http://www.dabeaz.com 1- 107 Multiplexing Generators def gen_multiplex(genlist): item_q = Queue.Queue() def run_one(source): for item in source: item_q.put(item) def run_all(): thrlist = [] for source in genlist: t = threading.Thread(target=run_one,args=(source,)) t.start() thrlist.append(t) for t in thrlist: t.join() item_q.put(StopIteration) threading.Thread(target=run_all).start() while True: item = item_q.get() if item is StopIteration: return yield item Copyright (C) 2008, http://www.dabeaz.com 1- 108
  • 55. Multiplexing Generators def gen_multiplex(genlist): item_q = Queue.Queue() def run_one(source): for item in source: item_q.put(item) def run_all(): thrlist = [] Each generator runs in a for source in genlist: thread and drops items onto a queue t = threading.Thread(target=run_one,args=(source,)) t.start() thrlist.append(t) for t in thrlist: t.join() item_q.put(StopIteration) threading.Thread(target=run_all).start() while True: item = item_q.get() if item is StopIteration: return yield item Copyright (C) 2008, http://www.dabeaz.com 1- 109 Multiplexing Generators def gen_multiplex(genlist): item_q = Queue.Queue() def run_one(source): for item in source: item_q.put(item) def run_all(): thrlist = [] for source in genlist: t = threading.Thread(target=run_one,args=(source,)) t.start() thrlist.append(t) for t in thrlist: t.join() item_q.put(StopIteration) Pull items off the queue and yield them threading.Thread(target=run_all).start() while True: item = item_q.get() if item is StopIteration: return yield item Copyright (C) 2008, http://www.dabeaz.com 1- 110
  • 56. Multiplexing Generators def gen_multiplex(genlist): item_q = Queue.Queue() Run all of the def run_one(source): generators, wait for them for item in source: item_q.put(item) to terminate, then put a def run_all(): sentinel on the queue thrlist = [] for source in genlist: (StopIteration) t = threading.Thread(target=run_one,args=(source,)) t.start() thrlist.append(t) for t in thrlist: t.join() item_q.put(StopIteration) threading.Thread(target=run_all).start() while True: item = item_q.get() if item is StopIteration: return yield item Copyright (C) 2008, http://www.dabeaz.com 1- 111 Part 8 Various Programming Tricks (And Debugging) Copyright (C) 2008, http://www.dabeaz.com 1-112
  • 57. Putting it all Together • This data processing pipeline idea is powerful • But, it's also potentially mind-boggling • Especially when you have dozens of pipeline stages, broadcasting, multiplexing, etc. • Let's look at a few useful tricks Copyright (C) 2008, http://www.dabeaz.com 1- 113 Creating Generators • Any single-argument function is easy to turn into a generator function def generate(func): def gen_func(s): for item in s: yield func(item) return gen_func • Example: gen_sqrt = generate(math.sqrt) for x in gen_sqrt(xrange(100)): print x Copyright (C) 2008, http://www.dabeaz.com 1- 114
  • 58. Debug Tracing • A debugging function that will print items going through a generator def trace(source): for item in source: print item yield item • This can easily be placed around any generator lines = follow(open(quot;access-logquot;)) log = trace(apache_log(lines)) r404 = trace(r for r in log if r['status'] == 404) • Note: Might consider logging module for this Copyright (C) 2008, http://www.dabeaz.com 1- 115 Recording the Last Item • Store the last item generated in the generator class storelast(object): def __init__(self,source): self.source = source def next(self): item = self.source.next() self.last = item return item def __iter__(self): return self • This can be easily wrapped around a generator lines = storelast(follow(open(quot;access-logquot;))) log = apache_log(lines) for r in log: print r print lines.last Copyright (C) 2008, http://www.dabeaz.com 1- 116
  • 59. Shutting Down • Generators can be shut down using .close() import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line • Example: lines = follow(open(quot;access-logquot;)) for i,line in enumerate(lines): print line, if i == 10: lines.close() Copyright (C) 2008, http://www.dabeaz.com 1- 117 Shutting Down • In the generator, GeneratorExit is raised import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file try: while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line except GeneratorExit: print quot;Follow: Shutting downquot; • This allows for resource cleanup (if needed) Copyright (C) 2008, http://www.dabeaz.com 1- 118
  • 60. Ignoring Shutdown • Question: Can you ignore GeneratorExit? import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: try: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line except GeneratorExit: print quot;Forget about itquot; • Answer: No. You'll get a RuntimeError Copyright (C) 2008, http://www.dabeaz.com 1- 119 Shutdown and Threads • Question : Can a thread shutdown a generator running in a different thread? lines = follow(open(quot;foo/test.logquot;)) def sleep_and_close(s): time.sleep(s) lines.close() threading.Thread(target=sleep_and_close,args=(30,)).start() for line in lines: print line, Copyright (C) 2008, http://www.dabeaz.com 1- 120
  • 61. Shutdown and Threads • Separate threads can not call .close() • Output: Exception in thread Thread-1: Traceback (most recent call last): File quot;/Library/Frameworks/Python.framework/Versions/2.5/ lib/python2.5/threading.pyquot;, line 460, in __bootstrap self.run() File quot;/Library/Frameworks/Python.framework/Versions/2.5/ lib/python2.5/threading.pyquot;, line 440, in run self.__target(*self.__args, **self.__kwargs) File quot;genfollow.pyquot;, line 31, in sleep_and_close lines.close() ValueError: generator already executing Copyright (C) 2008, http://www.dabeaz.com 1- 121 Shutdown and Signals • Can you shutdown a generator with a signal? import signal def sigusr1(signo,frame): print quot;Closing it downquot; lines.close() signal.signal(signal.SIGUSR1,sigusr1) lines = follow(open(quot;access-logquot;)) for line in lines: print line, • From the command line % kill -USR1 pid Copyright (C) 2008, http://www.dabeaz.com 1- 122
  • 62. Shutdown and Signals • This also fails: Traceback (most recent call last): File quot;genfollow.pyquot;, line 35, in <module> for line in lines: File quot;genfollow.pyquot;, line 8, in follow time.sleep(0.1) File quot;genfollow.pyquot;, line 30, in sigusr1 lines.close() ValueError: generator already executing • Sigh. Copyright (C) 2008, http://www.dabeaz.com 1- 123 Shutdown • The only way to externally shutdown a generator would be to instrument with a flag or some kind of check def follow(thefile,shutdown=None): thefile.seek(0,2) while True: if shutdown and shutdown.isSet(): break line = thefile.readline() if not line: time.sleep(0.1) continue yield line Copyright (C) 2008, http://www.dabeaz.com 1- 124
  • 63. Shutdown • Example: import threading,signal shutdown = threading.Event() def sigusr1(signo,frame): print quot;Closing it downquot; shutdown.set() signal.signal(signal.SIGUSR1,sigusr1) lines = follow(open(quot;access-logquot;),shutdown) for line in lines: print line, Copyright (C) 2008, http://www.dabeaz.com 1- 125 Part 9 Co-routines Copyright (C) 2008, http://www.dabeaz.com 1-126
  • 64. The Final Frontier • In Python 2.5, generators picked up the ability to receive values using .send() def recv_count(): try: while True: n = (yield) # Yield expression print quot;T-minusquot;, n except GeneratorExit: print quot;Kaboom!quot; • Think of this function as receiving values rather than generating them Copyright (C) 2008, http://www.dabeaz.com 1- 127 Example Use • Using a receiver >>> r = recv_count() >>> r.next() Note: must call .next() here >>> for i in range(5,0,-1): ... r.send(i) ... T-minus 5 T-minus 4 T-minus 3 T-minus 2 T-minus 1 >>> r.close() Kaboom! >>> Copyright (C) 2008, http://www.dabeaz.com 1- 128
  • 65. Co-routines • This form of a generator is a quot;co-routinequot; • Also sometimes called a quot;reverse-generatorquot; • Python books (mine included) do a pretty poor job of explaining how co-routines are supposed to be used • I like to think of them as quot;receiversquot; or quot;consumerquot;. They receive values sent to them. Copyright (C) 2008, http://www.dabeaz.com 1- 129 Setting up a Coroutine • To get a co-routine to run properly, you have to ping it with a .next() operation first def recv_count(): try: while True: n = (yield) # Yield expression print quot;T-minusquot;, n except GeneratorExit: print quot;Kaboom!quot; • Example:r = recv_count() r.next() • This advances it to the first yield--where it will receive its first value Copyright (C) 2008, http://www.dabeaz.com 1- 130
  • 66. @consumer decorator • The .next() bit can be handled via decoration def consumer(func): def start(*args,**kwargs): c = func(*args,**kwargs) c.next() return c return start • Example:@consumer def recv_count(): try: while True: n = (yield) # Yield expression print quot;T-minusquot;, n except GeneratorExit: print quot;Kaboom!quot; Copyright (C) 2008, http://www.dabeaz.com 1- 131 @consumer decorator • Using the decorated version >>> r = recv_count() >>> for i in range(5,0,-1): ... r.send(i) ... T-minus 5 T-minus 4 T-minus 3 T-minus 2 T-minus 1 >>> r.close() Kaboom! >>> • Don't need the extra .next() step here Copyright (C) 2008, http://www.dabeaz.com 1- 132
  • 67. Coroutine Pipelines • Co-routines also set up a processing pipeline • Instead of being defining by iteration, it's defining by pushing values into the pipeline using .send() .send() .send() .send() • We already saw some of this with broadcasting Copyright (C) 2008, http://www.dabeaz.com 1- 133 Broadcasting (Reprise) • Consume a generator and send items to a set of consumers def broadcast(source, consumers): for item in source: for c in consumers: c.send(item) • Notice that send() operation there • The consumers could be co-routines Copyright (C) 2008, http://www.dabeaz.com 1- 134
  • 68. Example @consumer def find_404(): while True: r = (yield) if r['status'] == 404: print r['status'],r['datetime'],r['request'] @consumer def bytes_transferred(): total = 0 while True: r = (yield) total += r['bytes'] print quot;Total bytesquot;, total lines = follow(open(quot;access-logquot;)) log = apache_log(lines) broadcast(log,[find_404(),bytes_transferred()]) Copyright (C) 2008, http://www.dabeaz.com 1- 135 Discussion • In last example, multiple consumers • However, there were no threads • Further exploration along these lines can take you into co-operative multitasking, concurrent programming without using threads • That's an entirely different tutorial! Copyright (C) 2008, http://www.dabeaz.com 1- 136
  • 69. Wrap Up Copyright (C) 2008, http://www.dabeaz.com 1-137 The Big Idea • Generators are an incredibly useful tool for a variety of quot;systemsquot; related problem • Power comes from the ability to set up processing pipelines • Can create components that plugged into the pipeline as reusable pieces • Can extend the pipeline idea in many directions (networking, threads, co-routines) Copyright (C) 2008, http://www.dabeaz.com 1- 138
  • 70. Code Reuse • I like the way that code gets reused with generators • Small components that just process a data stream • Personally, I think this is much easier than what you commonly see with OO patterns Copyright (C) 2008, http://www.dabeaz.com 1- 139 Example • SocketServer Module (Strategy Pattern) import SocketServer class HelloHandler(SocketServer.BaseRequestHandler): def handle(self): self.request.sendall(quot;Hello Worldnquot;) serv = SocketServer.TCPServer((quot;quot;,8000),HelloHandler) serv.serve_forever() • My generator version for c,a in receive_connections((quot;quot;,8000)): c.send(quot;Hello Worldnquot;) c.close() Copyright (C) 2008, http://www.dabeaz.com 1- 140
  • 71. Pitfalls • I don't think many programmers really understand generators yet • Springing this on the uninitiated might cause their head to explode • Error handling is really tricky because you have lots of components chained together • Need to pay careful attention to debugging, reliability, and other issues. Copyright (C) 2008, http://www.dabeaz.com 1- 141 Thanks! • I hope you got some new ideas from this class • Please feel free to contact me http://www.dabeaz.com Copyright (C) 2008, http://www.dabeaz.com 1- 142